AI Evaluation

How do you know your AI product is getting better?

Not by gut feel. Not by cherry-picked examples. By systematic evaluation against a defined standard, run consistently, tracked over time. Evaluation is to AI products what prediction scoring is to forecasting — the discipline that separates signal from noise.

The CRAFT Checklist

Five dimensions, scored 0-2 each. Total: 0-10 points. Inspired by the SMART-BF pattern.

Dimension	Question	0	1	2
Correctness	Is the output factually right?	Hallucinating, wrong answers	Mostly right, minor errors	Accurate, verifiable
Reliability	Same input, consistent quality?	Wild variance across runs	Manageable spread	Predictable range
Alignment	Does it match user intent?	Misses the job entirely	Partially useful	Nails the job
Failsafe	What happens when wrong?	Silent failure, user misled	User can detect the error	Graceful recovery, signals uncertainty
Trust	Safe to ship at this quality?	Harmful potential, brand risk	Needs supervision, caveats	Confident release

Scoring Guide

Score	Quality	Action
9-10	Excellent	Ship, celebrate, study what made it work
7-8	Good	Ship with monitoring, improve over time
5-6	Marginal	Improve before scaling, limit exposure
3-4	Poor	Don't ship — fix correctness or safety first
0-2	Failing	Fundamental rethink needed

Worked Example

Product: Code review assistant Input: Pull request diff with a subtle null pointer bug

Dimension	Score	Reasoning
Correctness	2	Identified the null pointer, correct explanation
Reliability	1	Catches this class of bug ~70% of the time
Alignment	2	User wanted code review, got actionable feedback
Failsafe	1	Doesn't flag when it's uncertain about a finding
Trust	1	Good enough to assist, not replace human review

Total: 7/10 — Good. Ship with monitoring. Priority fix: add confidence signaling (Failsafe).

Golden Datasets

The foundation of evaluation. Garbage dataset, garbage scores.

Building

Principle	Why	Anti-Pattern
Representative	Must reflect real traffic	Only testing "clean" inputs that flatter the AI
Diverse	Cover all input categories	Over-indexing on common cases
Tagged	Metadata per example (category, difficulty, risk)	Untagged blob of examples
Versioned	Track changes over time	Editing in place without history
Validated	Expected outputs reviewed by qualified humans	One person's opinion as ground truth

Size Guide

Purpose	Minimum Examples	Why
Smoke test	20-50	Quick sanity check on every change
Regression	100-300	Statistical significance for comparison
Comprehensive	500+	Full coverage of input universe
Safety	As many adversarial inputs as possible	The tail matters more than the mean

Maintenance

Golden datasets rot. Production traffic drifts. Refresh quarterly at minimum.

Signal	Action
Production inputs look different from dataset	Add new examples
"Expected outputs" are now wrong (world changed)	Update golden answers
New failure pattern discovered	Add to adversarial set
New user persona emerges	Add representative inputs
Eval scores rising but users unhappy	Dataset doesn't reflect reality — rebuild

Eval Rubrics

Rubrics make scoring reproducible. Without them, "4 out of 5" means different things to different people.

Calibration Test

Before using a rubric at scale, run the inter-rater test:

Three people score the same 20 outputs independently
Compare scores
Where they disagree, discuss and refine the rubric
Repeat until agreement exceeds 80%

If you can't get 80% agreement, the rubric is too subjective. Split the dimension or add examples.

Rubric Template

For each quality dimension:

DIMENSION: [Name]
WEIGHT: [1-3, how much this matters]

Score 2 (Excellent): [Specific description + example]
Score 1 (Acceptable): [Specific description + example]
Score 0 (Unacceptable): [Specific description + example]

BLOCKING: [Yes/No — does a 0 here override all other scores?]

Blocking dimensions (where a 0 means don't ship regardless of total score):

Safety violations
Factual errors in high-stakes domains
PII exposure
Legal/compliance violations

Automated Eval

Not everything needs human judgment. Automate what you can, save humans for what you must.

Layer	What It Catches	Tool Type	Human Needed?
Format	Length, structure, required fields	Code-based checks	No
Safety	PII, harmful content, refusal adherence	Classifier + rules	No (for clear violations)
Factual	Known-answer questions, verifiable claims	Code + reference data	For edge cases
Quality	Relevance, tone, helpfulness	LLM-as-judge	For calibration
Nuance	Surprising insight, cultural sensitivity, taste	Human review	Yes

LLM-as-Judge

Using one model to evaluate another. Powerful but biased.

Bias	Pattern	Mitigation
Verbosity	Longer outputs score higher	Control for length in rubric
Self-preference	Model prefers its own style	Use a different model family as judge
Position	First option scores higher	Randomize output order
Sycophancy	Polite outputs score higher	Include "correct but blunt" in golden set

Validation cadence: Re-calibrate LLM-as-judge against human scores monthly. If agreement drops below 75%, retrain or replace.

CI/CD Integration

Evals should run on every change, like tests:

Code Change → Automated Eval Suite → Pass/Warn/Fail
                    │
              ┌─────┼──────┐
              ▼     ▼      ▼
           Format  Safety  Quality
           (fast)  (fast)  (slower)

Tier	Speed	Runs When	Blocks Deploy?
Format + Safety	Seconds	Every commit	Yes
Quality (smoke)	Minutes	Every PR	Yes, if regression
Quality (full)	Hours	Nightly / pre-release	Advisory

The Eval Loop

Evaluation isn't a gate — it's a feedback loop:

DEFINE RUBRIC → BUILD DATASET → RUN EVAL → ANALYZE GAPS → IMPROVE → RE-EVAL
      ↑                                                                  |
      └────────── Update rubric as understanding deepens ────────────────┘

Stage	Key Question
Define	Are we measuring what users actually value?
Build	Does the dataset represent production traffic?
Run	Are scores reproducible and trustworthy?
Analyze	Where are failures concentrated?
Improve	What's the highest-leverage fix?
Re-eval	Did the improvement actually work?

The New PRD

Traditional specs describe deterministic behavior: given input X, produce output Y. AI products produce distributions. The spec cannot define behavior — it can only define boundaries. Evals are the artifact that replaces the traditional spec for non-deterministic systems.

An eval has three components:

Component	What It Is	What It Replaces
Data	A set of inputs representing real usage	Acceptance criteria examples
Task	The system under test, producing an output from each input	The feature being built
Scores	A quality rating (0-1) per output, per dimension	Manual QA pass/fail

The composite score across all data items is the eval result. Track it over time and you have a quality trajectory — not a binary gate, but a signal that tells you whether the system is getting better or worse.

Stories as Datasets

VV stories already contain the three components:

Story Field	Eval Component	How It Maps
Scenario (the struggling moment)	Data — input	The situation that triggers the eval case
Outcomes (measurable before/after)	Scores — expected output	The threshold that defines pass/fail
Counterfeit (fake success)	Data — adversarial input	The case that should score low but might score high
Actions (what the system does)	Task — system under test	The capability being evaluated

A counterfeit outcome is a golden dataset's most valuable item. It tests whether the system can distinguish real progress from the appearance of progress — the left tail where invisible failures hide.

Eval Targets

Every story should declare what "good" looks like before anyone writes code. Three fields turn a story from a wish into a testable claim:

Field	What it answers	Example
Dimension	What kind of quality matters most here?	A search feature cares about correctness (right results). A payment flow cares about trust (safe to use).
Threshold	What number means "pass"?	`"<2s p95 response time"`, `">95% accuracy"`, `"zero data loss"`
Baseline	Where are we starting from?	`"~30s manual lookup across 3 tools"`, `"no capability exists"`

The dimension comes from the CRAFT checklist: Correctness, Reliability, Alignment, Failsafe, or Trust. Pick the one dimension that would matter most if this story failed.

The threshold must be a number. "Fast" is not a target. "<2 seconds" is. If you cannot write a number, you do not understand the outcome well enough to build it.

The baseline is where you stand before the work starts. Without a baseline, you cannot prove the work made anything better. You are just shipping and hoping.

Why this matters for prioritization: Performance readiness carries the highest weight (0.25) in the readiness formula. If no stories have eval targets, Performance readiness cannot exceed 2 — there is nothing to measure against. A capability with no targets is a capability with no definition of done.

Worked Example

Story: "When managing 20+ contacts across 3 tools, each lookup takes 30 seconds. I need to find any contact with full deal context in one search. So I get contact found in <5s with linked deals. Not: contact found but deal history missing — looks fast, delivers incomplete."

Field	Value
Dimension	`correctness` — the result must include linked deals, not just the contact name
Threshold	`"<5s with linked deals"` — time AND completeness
Baseline	`"~30s across 3 tools, deals not linked"` — the current pain

This story is now testable. An engineer can build it and measure whether it passes. A scorer can run it and track whether quality improves over time. A commissioner can verify whether the deployed system meets the claim. The story stopped being a wish and became a contract.

Rubric Scaling

The CRAFT checklist (0-2 per dimension, 0-10 total) is a human-readable rubric. Machine-executable rubrics use finer scales:

Scale	When to Use	Conversion
0-2 (CRAFT)	Human review, quick assessment	CRAFT 0 = machine 1-2, CRAFT 1 = machine 3, CRAFT 2 = machine 4-5
1-5 (rubric dimension)	Automated scoring per dimension	Weight dimensions, aggregate to composite
0-100 (composite)	Cross-eval comparison, dashboards	`((avg(weighted_scores) - 1) / 4) * 100`

Thresholds convert continuous scores into decisions:

Tier	Meaning	Action
Below minimum	Fails basic quality	Don't ship. Fix the worst dimension.
Minimum	Meets floor	Ship with monitoring. Improve weekly.
Production	Reliable quality	Ship confidently. Improve monthly.
World class	Exceeds expectations	Study what works. Raise the floor for others.

Standards as Code

A human writes a quality standard. A rubric encodes it as weighted dimensions with score guides. A scoring engine executes it against data. Thresholds convert the result into a decision.

HUMAN STANDARD (what "good" means)
     ↓
RUBRIC (dimensions × weights × score guides)
     ↓
SCORING ENGINE (runs rubric against data items)
     ↓
THRESHOLDS (minimum / production / world class)
     ↓
DECISION (ship / improve / stop)

The standard is the contract between the team that defines quality and the system that measures it. When the standard changes, the rubric changes, the scores change, the decisions change. One source of truth, one direction of flow.

This is how commissioning connects to evaluation. The L0-L4 maturity model measures whether a capability was built. The eval measures whether it was built well. L3 (tested) proves the code runs. Eval scores prove the output is good. L4 (commissioned) requires both.

Common Failures

Failure	Pattern	Fix
False confidence	High scores on easy dataset	Add hard cases, adversarial inputs
Stale evals	Production drifted, evals didn't	Refresh dataset quarterly
One rubric	Same eval for chat, code, search	Different rubrics per use case
No baseline	Can't tell if changes help or hurt	Establish baseline before iterating
Eval debt	"We'll add evals later"	Evals in the PRD, not afterthought
Score worship	Optimizing eval score, not user value	Validate scores against user satisfaction regularly

Context

Deterministic vs Probabilistic — AI products produce distributions, not outputs
SMART-BF Checklist — The pattern CRAFT extends
AI Product Principles — Define "good" before you measure it
AI Observability — From eval scores to production insights
VVFL Loop — Evaluation IS the feedback loop
VV Stories — Stories are eval datasets waiting to be run
Commissioning — L0-L4 proves it was built; evals prove it was built well
Outer Loop Validation — Where the scoring engine lives

Questions

What would change if your eval scores went up but user satisfaction went down?

Which CRAFT dimension is blocking in your product — where a 0 means don't ship regardless of total score?
If your golden dataset hasn't been refreshed in 90 days, what production reality are your evals missing?
When is LLM-as-judge good enough — and when does the bias cost more than human review?
How many of your VV stories have a threshold you could test against right now — and what does that number tell you about Performance readiness?
What is the gap between your commissioning maturity (L0-L4) and your eval maturity — and which one is lying?

The CRAFT Checklist​

Scoring Guide​

Worked Example​

Golden Datasets​

Building​

Size Guide​

Maintenance​

Eval Rubrics​

Calibration Test​

Rubric Template​

Automated Eval​

LLM-as-Judge​

CI/CD Integration​

The Eval Loop​

The New PRD​

Stories as Datasets​

Eval Targets​

Worked Example​

Rubric Scaling​

Standards as Code​

Common Failures​

Context​

Questions​

The CRAFT Checklist

Scoring Guide

Worked Example

Golden Datasets

Building

Size Guide

Maintenance

Eval Rubrics

Calibration Test

Rubric Template

Automated Eval

LLM-as-Judge

CI/CD Integration

The Eval Loop

The New PRD

Stories as Datasets

Eval Targets

Worked Example

Rubric Scaling

Standards as Code

Common Failures

Context

Questions