The AI Engineer's Stack
A prompt engineer writes the message. An AI engineer engineers everything around the model so it survives production — the harness, the context, the inference economics, the failure modes. The model is one component. The system is the job.
This page is the competency map. Each item below is a named lever and the tradeoff it governs. When a decision is "the AI feels slow / expensive / unreliable," the lever that moves it is on this page.
Tracking Capability — Reality, Dream, Consumed
A capability map is worthless if it lets you claim what you have not proven. So every lever holds exactly one honest state, and the state is derived from evidence, not declared:
- Reality — you have built it, and you can prove it. The proof is a triple: an engineering artifact that implements the lever, a real workflow that exercises it (not a demo), and a metric showing that workflow got cheaper, faster, or more reliable. The cleanest proof mechanism is a deterministic intent harness that scripts the inputs, runs the workflow, and asserts on the trace — so the capability is re-provable on every change, not vouched for once.
- Dream — named on the map, needed, but no artifact exists yet. This is not a failure; it is the backlog. Each Dream lever is a demand: build this, prove it on workflow X, hit metric Y.
- Consumed — real and necessary, but not yours to build. Serving-layer levers (KV cache, paged attention, continuous batching, quantization internals) belong to the inference provider; you set policy and configure, you do not engineer them. Marking scope stops you from manufacturing fake gaps against work you were never going to do.
The maturity ladder is evidence-gated. A lever is not "Proven" because someone rated it highly — it climbs only when the artifact exists:
- L0 Blind — not addressed; you cannot even name your current state. (This is where silent failures live — stale retrieval, cross-tenant leaks.)
- L1 Named — understood and documented. Reading this page moves a lever to L1.
- L2 Implemented — exists once in a running system.
- L3 Tested — a gate, eval, or guardrail has actually fired and caught something.
- L4 Proven — measured in production over time, with a metric trend and zero incidents.
Incidents demote. The production failure-modes checklist at the foot of this page is the live test. A runaway agent in production demotes "loop budgets" to L1 no matter what was claimed. Capability is what survives contact with the workflow.
Why this matters beyond the audit: capability tracking is the instrument panel for the loop that compounds a team — the strategy layer names the capability it needs; the engineering layer builds and proves it; proven capability lifts what the strategy layer can reach next; the loop runs again, higher. The reality/dream/consumed mirror is what keeps that loop honest. A map full of unproven claims breaks the loop — you build on capability you do not actually have.
1. Harness, Not Just Prompt
The leap most teams miss. Output quality is set more by what surrounds the model than by the wording of the prompt.
- Harness engineering, not prompt engineering — design the loop, the tools, the retries, the budgets. The prompt is one input to a system, not the system.
- Context engineering, not long prompts — engineer what the model sees at decode time: which documents, which memory, which tool results, in which order. A longer prompt is not a better context. Curated, ordered, fresh context is.
2. Inference Economics
Where latency and cost are actually decided. Most "the AI is slow/expensive" complaints resolve here, not in the prompt.
- Prompt caching vs. semantic caching — prompt cache reuses identical prefixes (exact match, cheap, brittle); semantic cache reuses similar requests (fuzzy, powerful, risks serving a wrong-but-close answer). Pick per the cost of a near-miss.
- KV cache management — eviction, reuse, and memory pressure at scale. The KV cache is the dominant memory consumer in serving; how you evict and reuse it sets your throughput ceiling.
- Prefill vs. decode latency — they optimize differently. Prefill is compute-bound (the whole prompt at once); decode is memory-bandwidth-bound (one token at a time). A long prompt punishes prefill; a long answer punishes decode. Diagnose which one is your latency before tuning.
- Continuous batching + paged attention — throughput levers at the serving layer. Continuous batching keeps the GPU full by swapping requests in mid-flight; paged attention stops KV-cache fragmentation from wasting memory. Together they multiply throughput without touching model quality.
3. Model Compression Tradeoffs
How you make a model cheaper or faster — and where each choice bites.
- Speculative decoding — a small draft model proposes tokens, the big model verifies. Faster decode, same quality. Low risk, real win.
- Quantization — INT8, INT4, FP8, AWQ, GPTQ. Shrinks weights to cut memory and latency. INT8/FP8 are usually safe; INT4 and aggressive schemes (AWQ, GPTQ) can degrade quality on hard tasks. Quantization hurts most where the task is reasoning-heavy or long-context — measure on your own evals, never trust the headline benchmark.
- Distillation — train a smaller model to mimic a bigger one. Best when you have a narrow, stable task and high volume; wrong when the task keeps changing.
The default order to try: speculative decoding (free quality) → moderate quantization (cheap, measure) → distillation (only for stable high-volume tasks).
4. Output & Tool Reliability
The model will return malformed JSON and call tools with bad arguments. Engineer for it as a certainty, not an edge case.
- Structured output failures — schema validation, repair loops, and fallback chains. Validate every structured output; on failure, repair (re-ask with the error) before you escalate; have a fallback that degrades gracefully.
- Function calling reliability — tool contracts, argument validation, and idempotency. Validate arguments before executing. Make tool calls idempotent so a retry can't double-charge, double-send, or double-write.
5. Agent Control
An unbounded agent is a runaway cost and a runaway risk. Control is engineered, not hoped for.
- Guardrails + budgets — loop budgets (max iterations), tool budgets (max calls), and explicit termination conditions. An agent without a loop budget is a billing incident waiting to happen.
- Model routing + graceful fallback — route easy requests to cheap models, hard ones to strong models; fall back when a model is down or refuses. Design the degraded-mode UX before the outage, so the product still works at reduced capability instead of failing blank.
6. Retrieval (RAG)
Retrieval quality caps answer quality. A perfect model on stale or irrelevant context still gives a wrong answer.
- RAG architecture — chunking, embeddings, hybrid search (keyword + vector), reranking, and freshness. Hybrid beats pure vector on most real corpora; reranking is the cheapest large quality gain; freshness is the silent killer.
- Retrieval evals — recall, precision, grounding, attribution, and citation quality. You cannot improve retrieval you do not measure. Grounding and citation quality are what separate "sounds right" from "is right."
7. Evals & Observability
Without these you are flying blind and you will not know quality regressed until a user tells you.
- Evals — golden sets, regression tests, adversarial tests, LLM-as-judge, and human evals. Golden sets catch regressions; adversarial sets catch the failures users will find; LLM-as-judge scales scoring but must itself be validated against human judgment.
- Observability as a first-class discipline — traces, spans, tokens, latency, errors, and drift. Treat it like any production system: you need the trace of a single request and the trend across all of them.
- Cost attribution — per feature, workflow, tenant, and user journey, not just per model. "The model costs X" is useless for decisions; "this feature costs X per user journey" is a decision you can act on.
8. Safety & Isolation
In multi-tenant or untrusted-input systems, these are not optional — they are the difference between a product and a breach.
- Safety engineering — prompt injection defense, data leakage prevention, and permission boundaries. Treat all retrieved and user content as potentially adversarial.
- Multi-tenant isolation — cache safety and cross-user context contamination prevention. A shared cache that leaks tenant A's data into tenant B's response is a catastrophic, common, and quiet failure.
9. Choosing the Right Tool
Fine-tuning vs. in-context learning vs. RAG vs. distillation — and when each is the wrong tool.
- In-context learning — fastest to ship, no training. Wrong when context is too large or behavior must be permanent.
- RAG — right for knowledge that changes and must be cited. Wrong for changing behavior or style.
- Fine-tuning — right for stable behavior, format, or tone. Wrong for facts that change (they go stale in the weights).
- Distillation — right for cost at high volume on a stable task. Wrong while the task is still moving.
The whole stack is one tradeoff surface: latency, quality, cost, and reliability pull against each other. There is no setting that wins all four. Engineering is choosing which to spend.
Production Failure Modes — the checklist
The things that actually break in production. Each maps to a defense above.
- Hallucinated tool calls → argument validation + tool contracts (§4)
- Malformed JSON → schema validation + repair loops (§4)
- Stale retrieval → freshness in the RAG pipeline (§6)
- Runaway agents → loop and tool budgets + termination conditions (§5)
- Silent eval regressions → regression tests + drift observability (§7)
- Cross-user contamination → cache safety + tenant isolation (§8)
Questions
When the AI system is slow, expensive, or wrong — can you name which layer of this stack is the cause, and which lever moves it? If not, the gap is observability (§7), and that is where to start.