Skip to main content

AI Evaluation

How do you know your AI product is getting better?

Not by gut feel. Not by cherry-picked examples. By systematic evaluation against a defined standard, run consistently, tracked over time. Evaluation is to AI products what prediction scoring is to forecasting — the discipline that separates signal from noise.

The CRAFT Checklist

Five dimensions, scored 0-2 each. Total: 0-10 points. Inspired by the SMART-BF pattern.

DimensionQuestion012
CorrectnessIs the output factually right?Hallucinating, wrong answersMostly right, minor errorsAccurate, verifiable
ReliabilitySame input, consistent quality?Wild variance across runsManageable spreadPredictable range
AlignmentDoes it match user intent?Misses the job entirelyPartially usefulNails the job
FailsafeWhat happens when wrong?Silent failure, user misledUser can detect the errorGraceful recovery, signals uncertainty
TrustSafe to ship at this quality?Harmful potential, brand riskNeeds supervision, caveatsConfident release

Scoring Guide

ScoreQualityAction
9-10ExcellentShip, celebrate, study what made it work
7-8GoodShip with monitoring, improve over time
5-6MarginalImprove before scaling, limit exposure
3-4PoorDon't ship — fix correctness or safety first
0-2FailingFundamental rethink needed

Worked Example

Product: Code review assistant Input: Pull request diff with a subtle null pointer bug

DimensionScoreReasoning
Correctness2Identified the null pointer, correct explanation
Reliability1Catches this class of bug ~70% of the time
Alignment2User wanted code review, got actionable feedback
Failsafe1Doesn't flag when it's uncertain about a finding
Trust1Good enough to assist, not replace human review

Total: 7/10 — Good. Ship with monitoring. Priority fix: add confidence signaling (Failsafe).

Golden Datasets

The foundation of evaluation. Garbage dataset, garbage scores.

Building

PrincipleWhyAnti-Pattern
RepresentativeMust reflect real trafficOnly testing "clean" inputs that flatter the AI
DiverseCover all input categoriesOver-indexing on common cases
TaggedMetadata per example (category, difficulty, risk)Untagged blob of examples
VersionedTrack changes over timeEditing in place without history
ValidatedExpected outputs reviewed by qualified humansOne person's opinion as ground truth

Size Guide

PurposeMinimum ExamplesWhy
Smoke test20-50Quick sanity check on every change
Regression100-300Statistical significance for comparison
Comprehensive500+Full coverage of input universe
SafetyAs many adversarial inputs as possibleThe tail matters more than the mean

Maintenance

Golden datasets rot. Production traffic drifts. Refresh quarterly at minimum.

SignalAction
Production inputs look different from datasetAdd new examples
"Expected outputs" are now wrong (world changed)Update golden answers
New failure pattern discoveredAdd to adversarial set
New user persona emergesAdd representative inputs
Eval scores rising but users unhappyDataset doesn't reflect reality — rebuild

Eval Rubrics

Rubrics make scoring reproducible. Without them, "4 out of 5" means different things to different people.

Calibration Test

Before using a rubric at scale, run the inter-rater test:

  1. Three people score the same 20 outputs independently
  2. Compare scores
  3. Where they disagree, discuss and refine the rubric
  4. Repeat until agreement exceeds 80%

If you can't get 80% agreement, the rubric is too subjective. Split the dimension or add examples.

Rubric Template

For each quality dimension:

DIMENSION: [Name]
WEIGHT: [1-3, how much this matters]

Score 2 (Excellent): [Specific description + example]
Score 1 (Acceptable): [Specific description + example]
Score 0 (Unacceptable): [Specific description + example]

BLOCKING: [Yes/No — does a 0 here override all other scores?]

Blocking dimensions (where a 0 means don't ship regardless of total score):

  • Safety violations
  • Factual errors in high-stakes domains
  • PII exposure
  • Legal/compliance violations

Automated Eval

Not everything needs human judgment. Automate what you can, save humans for what you must.

LayerWhat It CatchesTool TypeHuman Needed?
FormatLength, structure, required fieldsCode-based checksNo
SafetyPII, harmful content, refusal adherenceClassifier + rulesNo (for clear violations)
FactualKnown-answer questions, verifiable claimsCode + reference dataFor edge cases
QualityRelevance, tone, helpfulnessLLM-as-judgeFor calibration
NuanceSurprising insight, cultural sensitivity, tasteHuman reviewYes

LLM-as-Judge

Using one model to evaluate another. Powerful but biased.

BiasPatternMitigation
VerbosityLonger outputs score higherControl for length in rubric
Self-preferenceModel prefers its own styleUse a different model family as judge
PositionFirst option scores higherRandomize output order
SycophancyPolite outputs score higherInclude "correct but blunt" in golden set

Validation cadence: Re-calibrate LLM-as-judge against human scores monthly. If agreement drops below 75%, retrain or replace.

CI/CD Integration

Evals should run on every change, like tests:

Code Change → Automated Eval Suite → Pass/Warn/Fail

┌─────┼──────┐
▼ ▼ ▼
Format Safety Quality
(fast) (fast) (slower)
TierSpeedRuns WhenBlocks Deploy?
Format + SafetySecondsEvery commitYes
Quality (smoke)MinutesEvery PRYes, if regression
Quality (full)HoursNightly / pre-releaseAdvisory

The Eval Loop

Evaluation isn't a gate — it's a feedback loop:

DEFINE RUBRIC → BUILD DATASET → RUN EVAL → ANALYZE GAPS → IMPROVE → RE-EVAL
↑ |
└────────── Update rubric as understanding deepens ────────────────┘
StageKey Question
DefineAre we measuring what users actually value?
BuildDoes the dataset represent production traffic?
RunAre scores reproducible and trustworthy?
AnalyzeWhere are failures concentrated?
ImproveWhat's the highest-leverage fix?
Re-evalDid the improvement actually work?

The New PRD

Traditional specs describe deterministic behavior: given input X, produce output Y. AI products produce distributions. The spec cannot define behavior — it can only define boundaries. Evals are the artifact that replaces the traditional spec for non-deterministic systems.

An eval has three components:

ComponentWhat It IsWhat It Replaces
DataA set of inputs representing real usageAcceptance criteria examples
TaskThe system under test, producing an output from each inputThe feature being built
ScoresA quality rating (0-1) per output, per dimensionManual QA pass/fail

The composite score across all data items is the eval result. Track it over time and you have a quality trajectory — not a binary gate, but a signal that tells you whether the system is getting better or worse.

Stories as Datasets

VV stories already contain the three components:

Story FieldEval ComponentHow It Maps
Scenario (the struggling moment)Data — inputThe situation that triggers the eval case
Outcomes (measurable before/after)Scores — expected outputThe threshold that defines pass/fail
Counterfeit (fake success)Data — adversarial inputThe case that should score low but might score high
Actions (what the system does)Task — system under testThe capability being evaluated

A counterfeit outcome is a golden dataset's most valuable item. It tests whether the system can distinguish real progress from the appearance of progress — the left tail where invisible failures hide.

Eval Targets

Every story should declare what "good" looks like before anyone writes code. Three fields turn a story from a wish into a testable claim:

FieldWhat it answersExample
DimensionWhat kind of quality matters most here?A search feature cares about correctness (right results). A payment flow cares about trust (safe to use).
ThresholdWhat number means "pass"?"<2s p95 response time", ">95% accuracy", "zero data loss"
BaselineWhere are we starting from?"~30s manual lookup across 3 tools", "no capability exists"

The dimension comes from the CRAFT checklist: Correctness, Reliability, Alignment, Failsafe, or Trust. Pick the one dimension that would matter most if this story failed.

The threshold must be a number. "Fast" is not a target. "<2 seconds" is. If you cannot write a number, you do not understand the outcome well enough to build it.

The baseline is where you stand before the work starts. Without a baseline, you cannot prove the work made anything better. You are just shipping and hoping.

Why this matters for prioritization: Performance readiness carries the highest weight (0.25) in the readiness formula. If no stories have eval targets, Performance readiness cannot exceed 2 — there is nothing to measure against. A capability with no targets is a capability with no definition of done.

Worked Example

Story: "When managing 20+ contacts across 3 tools, each lookup takes 30 seconds. I need to find any contact with full deal context in one search. So I get contact found in <5s with linked deals. Not: contact found but deal history missing — looks fast, delivers incomplete."

FieldValue
Dimensioncorrectness — the result must include linked deals, not just the contact name
Threshold"<5s with linked deals" — time AND completeness
Baseline"~30s across 3 tools, deals not linked" — the current pain

This story is now testable. An engineer can build it and measure whether it passes. A scorer can run it and track whether quality improves over time. A commissioner can verify whether the deployed system meets the claim. The story stopped being a wish and became a contract.

Rubric Scaling

The CRAFT checklist (0-2 per dimension, 0-10 total) is a human-readable rubric. Machine-executable rubrics use finer scales:

ScaleWhen to UseConversion
0-2 (CRAFT)Human review, quick assessmentCRAFT 0 = machine 1-2, CRAFT 1 = machine 3, CRAFT 2 = machine 4-5
1-5 (rubric dimension)Automated scoring per dimensionWeight dimensions, aggregate to composite
0-100 (composite)Cross-eval comparison, dashboards((avg(weighted_scores) - 1) / 4) * 100

Thresholds convert continuous scores into decisions:

TierMeaningAction
Below minimumFails basic qualityDon't ship. Fix the worst dimension.
MinimumMeets floorShip with monitoring. Improve weekly.
ProductionReliable qualityShip confidently. Improve monthly.
World classExceeds expectationsStudy what works. Raise the floor for others.

Standards as Code

A human writes a quality standard. A rubric encodes it as weighted dimensions with score guides. A scoring engine executes it against data. Thresholds convert the result into a decision.

HUMAN STANDARD (what "good" means)

RUBRIC (dimensions × weights × score guides)

SCORING ENGINE (runs rubric against data items)

THRESHOLDS (minimum / production / world class)

DECISION (ship / improve / stop)

The standard is the contract between the team that defines quality and the system that measures it. When the standard changes, the rubric changes, the scores change, the decisions change. One source of truth, one direction of flow.

This is how commissioning connects to evaluation. The L0-L4 maturity model measures whether a capability was built. The eval measures whether it was built well. L3 (tested) proves the code runs. Eval scores prove the output is good. L4 (commissioned) requires both.

Common Failures

FailurePatternFix
False confidenceHigh scores on easy datasetAdd hard cases, adversarial inputs
Stale evalsProduction drifted, evals didn'tRefresh dataset quarterly
One rubricSame eval for chat, code, searchDifferent rubrics per use case
No baselineCan't tell if changes help or hurtEstablish baseline before iterating
Eval debt"We'll add evals later"Evals in the PRD, not afterthought
Score worshipOptimizing eval score, not user valueValidate scores against user satisfaction regularly

Context

Questions

What would change if your eval scores went up but user satisfaction went down?

  • Which CRAFT dimension is blocking in your product — where a 0 means don't ship regardless of total score?
  • If your golden dataset hasn't been refreshed in 90 days, what production reality are your evals missing?
  • When is LLM-as-judge good enough — and when does the bias cost more than human review?
  • How many of your VV stories have a threshold you could test against right now — and what does that number tell you about Performance readiness?
  • What is the gap between your commissioning maturity (L0-L4) and your eval maturity — and which one is lying?