AI Observability
When a user reports a bad experience, can you even reproduce it?
In deterministic products, you can. In AI products, you often can't — unless you're logging the full trace. Observability is the discipline of making the invisible visible: what happened inside the pipeline, why, and what it means for your next move.
Without observability, you're flying blind. Eval scores tell you how good. Traces tell you why.
Trace Analysis
A trace captures every step from input to output. For simple models, that's one step. For agent architectures, it's many.
What to Capture
| Layer | What to Log | Why |
|---|---|---|
| Input | Raw user input, context, metadata | Reproduce the scenario |
| Preprocessing | Cleaned input, prompt template, parameters | Understand what the model actually received |
| Retrieval (if RAG) | Query, sources retrieved, relevance scores | Did it find the right information? |
| Reasoning (if agent) | Tool calls, intermediate decisions, chain of thought | Where did the reasoning break? |
| Generation | Raw model output, latency, token count | Performance and cost tracking |
| Postprocessing | Filtering, formatting, safety checks | What changed between raw and final? |
| Delivery | Final output, user response, follow-up actions | Did the user get value? |
Trace Review
Systematic trace review reveals what eval scores hide.
| Review Type | Frequency | Focus | Who |
|---|---|---|---|
| Failure review | Per incident | Single bad output, root cause | Engineer |
| Pattern review | Weekly | Clusters of similar failures | PM + Engineer |
| Drift review | Monthly | Are inputs changing? Are outputs degrading? | Team |
| Discovery review | Quarterly | What new use cases are emerging? | PM |
How many traces per week? Start with 20-50 from each quality tier. Enough to stay calibrated, not enough to drown.
What Patterns Reveal
When you cluster failed traces, patterns emerge:
| Pattern | Diagnosis | Action |
|---|---|---|
| Failures concentrate in retrieval | Wrong sources, poor search | Improve retrieval pipeline |
| Failures concentrate in reasoning | Model can't connect the dots | Better prompting, examples, or model upgrade |
| Failures concentrate in safety filters | Over-filtering good outputs | Tune safety thresholds |
| Failures spread evenly | No single bottleneck | Likely a data/prompt quality issue |
| Failures cluster by input type | Coverage gap | Add that input type to golden dataset |
Agent Evaluation
Multi-step agents add complexity. The final output might be good despite bad intermediate steps, or bad despite good ones.
| Question | What It Tests | Failure Mode |
|---|---|---|
| Did it use the right tool? | Tool selection | Called search when it should have calculated |
| Did it find the right data? | Retrieval quality | Found plausible but wrong sources |
| Did it reason correctly? | Chain of thought | Correct data, wrong conclusion |
| Did it know when to stop? | Termination logic | Infinite loop, unnecessary steps |
| Did it recover from errors? | Resilience | First tool failed, agent gave up |
| Can it explain its steps? | Auditability | Black box multi-step with no trail |
Agent Metrics
Beyond output quality, track agent behavior:
| Metric | Healthy Range | Red Flag |
|---|---|---|
| Steps per task | Consistent with complexity | Increasing over time (regression) |
| Tool call success rate | Above 90% | Below 70% (tool or prompt issue) |
| Retry rate | Below 10% | Above 25% (instability) |
| Cost per task | Within budget | Increasing (prompt bloat, loops) |
| Latency p95 | Within SLA | Spiking (bottleneck emerging) |
Drift Detection
Two types of drift will degrade your AI product silently:
Data Drift
Production inputs change in ways your evals don't cover.
| Signal | Detection | Response |
|---|---|---|
| New input patterns | Compare production distribution to golden dataset | Update dataset |
| Seasonal shifts | Track input characteristics over time | Seasonal eval variants |
| User behavior change | Monitor interaction patterns | Investigate and adapt |
| New user segments | Cluster analysis on inputs | Add segment-specific evals |
Model Drift
Upstream model updates change your quality without your code changing.
| Signal | Detection | Response |
|---|---|---|
| Eval score change with no code change | Pin model version, run evals on update | Don't auto-upgrade models in production |
| Style/tone shift | Compare outputs across versions | Update prompts or pin version |
| New capabilities | Benchmark new version | Adopt selectively, test thoroughly |
| Deprecated behaviors | Regression tests | Catch before production |
Business Connection
Eval scores mean nothing if they don't connect to outcomes.
| Eval Signal | Business Metric | Connection |
|---|---|---|
| CRAFT score trending up | User retention | Better quality → users come back |
| Left tail shrinking | Support tickets | Fewer bad outputs → fewer complaints |
| Latency improving | Engagement | Faster → more usage |
| Safety score stable | Brand risk | No incidents → trust compounds |
| Cost per output declining | Margins | Efficiency → sustainability |
The Dashboard
What should leadership see?
| Row | Metric | Target | Current | Trend |
|---|---|---|---|---|
| Quality | CRAFT mean score | ≥7/10 | [measured] | ↑ → ↓ |
| Reliability | Score variance (stddev) | <1.5 | [measured] | ↑ → ↓ |
| Safety | Left tail incidents / week | <N | [measured] | ↑ → ↓ |
| Cost | $ per 1K outputs | <$X | [measured] | ↑ → ↓ |
| User | Satisfaction / edit rate | ≥N% | [measured] | ↑ → ↓ |
Continuous Improvement
The feedback loop never stops:
| Question | Cadence | Owner |
|---|---|---|
| Are eval scores improving AND users happier? | Weekly | PM |
| Is the golden dataset still representative? | Monthly | PM + Engineer |
| Are we measuring what matters or what's easy? | Quarterly | Team |
| What's our biggest eval blind spot right now? | Quarterly | Team |
| Would we know if quality silently degraded? | Monthly | Engineer |
| Are we benchmarking against state of the art? | Quarterly | Engineer |
Team Culture
Eval quality is a shared responsibility, not one person's job.
| Practice | Why | Frequency |
|---|---|---|
| Team eval review | Shared understanding of quality | Weekly |
| Red teaming | Deliberately break the product | Monthly |
| Trace reading | Everyone sees real outputs | Weekly |
| Eval maintenance time | Datasets and rubrics stay fresh | Sprint allocation |
| Failure celebration | Found failures → prevented user harm | Always |
The most important cultural shift: celebrating eval improvements with the same energy as feature launches. A 5-point CRAFT improvement that prevents 1000 bad outputs per day is worth more than most features.
Context
- AI Evaluation — The CRAFT scoring system
- AI Requirements — Observability starts in the PRD
- AI Principles — What you're observing against
- Prediction Evaluation — Scoring discipline
- AI Frameworks — Agent infrastructure
- Work Charts — Who does what in the eval process
- VVFL Loop — Observability closes the loop