Skip to main content

AI Observability

When a user reports a bad experience, can you even reproduce it?

In deterministic products, you can. In AI products, you often can't — unless you're logging the full trace. Observability is the discipline of making the invisible visible: what happened inside the pipeline, why, and what it means for your next move.

Without observability, you're flying blind. Eval scores tell you how good. Traces tell you why.

Trace Analysis

A trace captures every step from input to output. For simple models, that's one step. For agent architectures, it's many.

What to Capture

LayerWhat to LogWhy
InputRaw user input, context, metadataReproduce the scenario
PreprocessingCleaned input, prompt template, parametersUnderstand what the model actually received
Retrieval (if RAG)Query, sources retrieved, relevance scoresDid it find the right information?
Reasoning (if agent)Tool calls, intermediate decisions, chain of thoughtWhere did the reasoning break?
GenerationRaw model output, latency, token countPerformance and cost tracking
PostprocessingFiltering, formatting, safety checksWhat changed between raw and final?
DeliveryFinal output, user response, follow-up actionsDid the user get value?

Trace Review

Systematic trace review reveals what eval scores hide.

Review TypeFrequencyFocusWho
Failure reviewPer incidentSingle bad output, root causeEngineer
Pattern reviewWeeklyClusters of similar failuresPM + Engineer
Drift reviewMonthlyAre inputs changing? Are outputs degrading?Team
Discovery reviewQuarterlyWhat new use cases are emerging?PM

How many traces per week? Start with 20-50 from each quality tier. Enough to stay calibrated, not enough to drown.

What Patterns Reveal

When you cluster failed traces, patterns emerge:

PatternDiagnosisAction
Failures concentrate in retrievalWrong sources, poor searchImprove retrieval pipeline
Failures concentrate in reasoningModel can't connect the dotsBetter prompting, examples, or model upgrade
Failures concentrate in safety filtersOver-filtering good outputsTune safety thresholds
Failures spread evenlyNo single bottleneckLikely a data/prompt quality issue
Failures cluster by input typeCoverage gapAdd that input type to golden dataset

Agent Evaluation

Multi-step agents add complexity. The final output might be good despite bad intermediate steps, or bad despite good ones.

QuestionWhat It TestsFailure Mode
Did it use the right tool?Tool selectionCalled search when it should have calculated
Did it find the right data?Retrieval qualityFound plausible but wrong sources
Did it reason correctly?Chain of thoughtCorrect data, wrong conclusion
Did it know when to stop?Termination logicInfinite loop, unnecessary steps
Did it recover from errors?ResilienceFirst tool failed, agent gave up
Can it explain its steps?AuditabilityBlack box multi-step with no trail

Agent Metrics

Beyond output quality, track agent behavior:

MetricHealthy RangeRed Flag
Steps per taskConsistent with complexityIncreasing over time (regression)
Tool call success rateAbove 90%Below 70% (tool or prompt issue)
Retry rateBelow 10%Above 25% (instability)
Cost per taskWithin budgetIncreasing (prompt bloat, loops)
Latency p95Within SLASpiking (bottleneck emerging)

Drift Detection

Two types of drift will degrade your AI product silently:

Data Drift

Production inputs change in ways your evals don't cover.

SignalDetectionResponse
New input patternsCompare production distribution to golden datasetUpdate dataset
Seasonal shiftsTrack input characteristics over timeSeasonal eval variants
User behavior changeMonitor interaction patternsInvestigate and adapt
New user segmentsCluster analysis on inputsAdd segment-specific evals

Model Drift

Upstream model updates change your quality without your code changing.

SignalDetectionResponse
Eval score change with no code changePin model version, run evals on updateDon't auto-upgrade models in production
Style/tone shiftCompare outputs across versionsUpdate prompts or pin version
New capabilitiesBenchmark new versionAdopt selectively, test thoroughly
Deprecated behaviorsRegression testsCatch before production

Business Connection

Eval scores mean nothing if they don't connect to outcomes.

Eval SignalBusiness MetricConnection
CRAFT score trending upUser retentionBetter quality → users come back
Left tail shrinkingSupport ticketsFewer bad outputs → fewer complaints
Latency improvingEngagementFaster → more usage
Safety score stableBrand riskNo incidents → trust compounds
Cost per output decliningMarginsEfficiency → sustainability

The Dashboard

What should leadership see?

RowMetricTargetCurrentTrend
QualityCRAFT mean score≥7/10[measured]↑ → ↓
ReliabilityScore variance (stddev)<1.5[measured]↑ → ↓
SafetyLeft tail incidents / week<N[measured]↑ → ↓
Cost$ per 1K outputs<$X[measured]↑ → ↓
UserSatisfaction / edit rate≥N%[measured]↑ → ↓

Continuous Improvement

The feedback loop never stops:

QuestionCadenceOwner
Are eval scores improving AND users happier?WeeklyPM
Is the golden dataset still representative?MonthlyPM + Engineer
Are we measuring what matters or what's easy?QuarterlyTeam
What's our biggest eval blind spot right now?QuarterlyTeam
Would we know if quality silently degraded?MonthlyEngineer
Are we benchmarking against state of the art?QuarterlyEngineer

Team Culture

Eval quality is a shared responsibility, not one person's job.

PracticeWhyFrequency
Team eval reviewShared understanding of qualityWeekly
Red teamingDeliberately break the productMonthly
Trace readingEveryone sees real outputsWeekly
Eval maintenance timeDatasets and rubrics stay freshSprint allocation
Failure celebrationFound failures → prevented user harmAlways

The most important cultural shift: celebrating eval improvements with the same energy as feature launches. A 5-point CRAFT improvement that prevents 1000 bad outputs per day is worth more than most features.

Context