AI Observability

When a user reports a bad experience, can you even reproduce it?

In deterministic products, you can. In AI products, you often can't — unless you're logging the full trace. Observability is the discipline of making the invisible visible: what happened inside the pipeline, why, and what it means for your next move.

Without observability, you're flying blind. Eval scores tell you how good. Traces tell you why.

Trace Analysis

A trace captures every step from input to output. For simple models, that's one step. For agent architectures, it's many.

What to Capture

Layer	What to Log	Why
Input	Raw user input, context, metadata	Reproduce the scenario
Preprocessing	Cleaned input, prompt template, parameters	Understand what the model actually received
Retrieval (if RAG)	Query, sources retrieved, relevance scores	Did it find the right information?
Reasoning (if agent)	Tool calls, intermediate decisions, chain of thought	Where did the reasoning break?
Generation	Raw model output, latency, token count	Performance and cost tracking
Postprocessing	Filtering, formatting, safety checks	What changed between raw and final?
Delivery	Final output, user response, follow-up actions	Did the user get value?

Trace Review

Systematic trace review reveals what eval scores hide.

Review Type	Frequency	Focus	Who
Failure review	Per incident	Single bad output, root cause	Engineer
Pattern review	Weekly	Clusters of similar failures	PM + Engineer
Drift review	Monthly	Are inputs changing? Are outputs degrading?	Team
Discovery review	Quarterly	What new use cases are emerging?	PM

How many traces per week? Start with 20-50 from each quality tier. Enough to stay calibrated, not enough to drown.

What Patterns Reveal

When you cluster failed traces, patterns emerge:

Pattern	Diagnosis	Action
Failures concentrate in retrieval	Wrong sources, poor search	Improve retrieval pipeline
Failures concentrate in reasoning	Model can't connect the dots	Better prompting, examples, or model upgrade
Failures concentrate in safety filters	Over-filtering good outputs	Tune safety thresholds
Failures spread evenly	No single bottleneck	Likely a data/prompt quality issue
Failures cluster by input type	Coverage gap	Add that input type to golden dataset

Agent Evaluation

Multi-step agents add complexity. The final output might be good despite bad intermediate steps, or bad despite good ones.

Question	What It Tests	Failure Mode
Did it use the right tool?	Tool selection	Called search when it should have calculated
Did it find the right data?	Retrieval quality	Found plausible but wrong sources
Did it reason correctly?	Chain of thought	Correct data, wrong conclusion
Did it know when to stop?	Termination logic	Infinite loop, unnecessary steps
Did it recover from errors?	Resilience	First tool failed, agent gave up
Can it explain its steps?	Auditability	Black box multi-step with no trail

Agent Metrics

Beyond output quality, track agent behavior:

Metric	Healthy Range	Red Flag
Steps per task	Consistent with complexity	Increasing over time (regression)
Tool call success rate	Above 90%	Below 70% (tool or prompt issue)
Retry rate	Below 10%	Above 25% (instability)
Cost per task	Within budget	Increasing (prompt bloat, loops)
Latency p95	Within SLA	Spiking (bottleneck emerging)

Drift Detection

Two types of drift will degrade your AI product silently:

Data Drift

Production inputs change in ways your evals don't cover.

Signal	Detection	Response
New input patterns	Compare production distribution to golden dataset	Update dataset
Seasonal shifts	Track input characteristics over time	Seasonal eval variants
User behavior change	Monitor interaction patterns	Investigate and adapt
New user segments	Cluster analysis on inputs	Add segment-specific evals

Model Drift

Upstream model updates change your quality without your code changing.

Signal	Detection	Response
Eval score change with no code change	Pin model version, run evals on update	Don't auto-upgrade models in production
Style/tone shift	Compare outputs across versions	Update prompts or pin version
New capabilities	Benchmark new version	Adopt selectively, test thoroughly
Deprecated behaviors	Regression tests	Catch before production

Business Connection

Eval scores mean nothing if they don't connect to outcomes.

Eval Signal	Business Metric	Connection
CRAFT score trending up	User retention	Better quality → users come back
Left tail shrinking	Support tickets	Fewer bad outputs → fewer complaints
Latency improving	Engagement	Faster → more usage
Safety score stable	Brand risk	No incidents → trust compounds
Cost per output declining	Margins	Efficiency → sustainability

The Dashboard

What should leadership see?

Row	Metric	Target	Current	Trend
Quality	CRAFT mean score	≥7/10	[measured]	↑ → ↓
Reliability	Score variance (stddev)	<1.5	[measured]	↑ → ↓
Safety	Left tail incidents / week	<N	[measured]	↑ → ↓
Cost	$ per 1K outputs	<$X	[measured]	↑ → ↓
User	Satisfaction / edit rate	≥N%	[measured]	↑ → ↓

Continuous Improvement

The feedback loop never stops:

Question	Cadence	Owner
Are eval scores improving AND users happier?	Weekly	PM
Is the golden dataset still representative?	Monthly	PM + Engineer
Are we measuring what matters or what's easy?	Quarterly	Team
What's our biggest eval blind spot right now?	Quarterly	Team
Would we know if quality silently degraded?	Monthly	Engineer
Are we benchmarking against state of the art?	Quarterly	Engineer

Team Culture

Eval quality is a shared responsibility, not one person's job.

Practice	Why	Frequency
Team eval review	Shared understanding of quality	Weekly
Red teaming	Deliberately break the product	Monthly
Trace reading	Everyone sees real outputs	Weekly
Eval maintenance time	Datasets and rubrics stay fresh	Sprint allocation
Failure celebration	Found failures → prevented user harm	Always

The most important cultural shift: celebrating eval improvements with the same energy as feature launches. A 5-point CRAFT improvement that prevents 1000 bad outputs per day is worth more than most features.

Context

AI Evaluation — The CRAFT scoring system
AI Requirements — Observability starts in the PRD
AI Principles — What you're observing against
Prediction Evaluation — Scoring discipline
AI Frameworks — Agent infrastructure
Work Charts — Who does what in the eval process
VVFL Loop — Observability closes the loop

Trace Analysis​

What to Capture​

Trace Review​

What Patterns Reveal​

Agent Evaluation​

Agent Metrics​

Drift Detection​

Data Drift​

Model Drift​

Business Connection​

The Dashboard​

Continuous Improvement​

Team Culture​

Context​