AI Evaluation
How do you know your AI product is getting better?
Not by gut feel. Not by cherry-picked examples. By systematic evaluation against a defined standard, run consistently, tracked over time. Evaluation is to AI products what prediction scoring is to forecasting — the discipline that separates signal from noise.
The CRAFT Checklist
Five dimensions, scored 0-2 each. Total: 0-10 points. Inspired by the SMART-BF pattern.
| Dimension | Question | 0 | 1 | 2 |
|---|---|---|---|---|
| Correctness | Is the output factually right? | Hallucinating, wrong answers | Mostly right, minor errors | Accurate, verifiable |
| Reliability | Same input, consistent quality? | Wild variance across runs | Manageable spread | Predictable range |
| Alignment | Does it match user intent? | Misses the job entirely | Partially useful | Nails the job |
| Failsafe | What happens when wrong? | Silent failure, user misled | User can detect the error | Graceful recovery, signals uncertainty |
| Trust | Safe to ship at this quality? | Harmful potential, brand risk | Needs supervision, caveats | Confident release |
Scoring Guide
| Score | Quality | Action |
|---|---|---|
| 9-10 | Excellent | Ship, celebrate, study what made it work |
| 7-8 | Good | Ship with monitoring, improve over time |
| 5-6 | Marginal | Improve before scaling, limit exposure |
| 3-4 | Poor | Don't ship — fix correctness or safety first |
| 0-2 | Failing | Fundamental rethink needed |
Worked Example
Product: Code review assistant Input: Pull request diff with a subtle null pointer bug
| Dimension | Score | Reasoning |
|---|---|---|
| Correctness | 2 | Identified the null pointer, correct explanation |
| Reliability | 1 | Catches this class of bug ~70% of the time |
| Alignment | 2 | User wanted code review, got actionable feedback |
| Failsafe | 1 | Doesn't flag when it's uncertain about a finding |
| Trust | 1 | Good enough to assist, not replace human review |
Total: 7/10 — Good. Ship with monitoring. Priority fix: add confidence signaling (Failsafe).
Golden Datasets
The foundation of evaluation. Garbage dataset, garbage scores.
Building
| Principle | Why | Anti-Pattern |
|---|---|---|
| Representative | Must reflect real traffic | Only testing "clean" inputs that flatter the AI |
| Diverse | Cover all input categories | Over-indexing on common cases |
| Tagged | Metadata per example (category, difficulty, risk) | Untagged blob of examples |
| Versioned | Track changes over time | Editing in place without history |
| Validated | Expected outputs reviewed by qualified humans | One person's opinion as ground truth |
Size Guide
| Purpose | Minimum Examples | Why |
|---|---|---|
| Smoke test | 20-50 | Quick sanity check on every change |
| Regression | 100-300 | Statistical significance for comparison |
| Comprehensive | 500+ | Full coverage of input universe |
| Safety | As many adversarial inputs as possible | The tail matters more than the mean |
Maintenance
Golden datasets rot. Production traffic drifts. Refresh quarterly at minimum.
| Signal | Action |
|---|---|
| Production inputs look different from dataset | Add new examples |
| "Expected outputs" are now wrong (world changed) | Update golden answers |
| New failure pattern discovered | Add to adversarial set |
| New user persona emerges | Add representative inputs |
| Eval scores rising but users unhappy | Dataset doesn't reflect reality — rebuild |
Eval Rubrics
Rubrics make scoring reproducible. Without them, "4 out of 5" means different things to different people.
Calibration Test
Before using a rubric at scale, run the inter-rater test:
- Three people score the same 20 outputs independently
- Compare scores
- Where they disagree, discuss and refine the rubric
- Repeat until agreement exceeds 80%
If you can't get 80% agreement, the rubric is too subjective. Split the dimension or add examples.
Rubric Template
For each quality dimension:
DIMENSION: [Name]
WEIGHT: [1-3, how much this matters]
Score 2 (Excellent): [Specific description + example]
Score 1 (Acceptable): [Specific description + example]
Score 0 (Unacceptable): [Specific description + example]
BLOCKING: [Yes/No — does a 0 here override all other scores?]
Blocking dimensions (where a 0 means don't ship regardless of total score):
- Safety violations
- Factual errors in high-stakes domains
- PII exposure
- Legal/compliance violations
Automated Eval
Not everything needs human judgment. Automate what you can, save humans for what you must.
| Layer | What It Catches | Tool Type | Human Needed? |
|---|---|---|---|
| Format | Length, structure, required fields | Code-based checks | No |
| Safety | PII, harmful content, refusal adherence | Classifier + rules | No (for clear violations) |
| Factual | Known-answer questions, verifiable claims | Code + reference data | For edge cases |
| Quality | Relevance, tone, helpfulness | LLM-as-judge | For calibration |
| Nuance | Surprising insight, cultural sensitivity, taste | Human review | Yes |
LLM-as-Judge
Using one model to evaluate another. Powerful but biased.
| Bias | Pattern | Mitigation |
|---|---|---|
| Verbosity | Longer outputs score higher | Control for length in rubric |
| Self-preference | Model prefers its own style | Use a different model family as judge |
| Position | First option scores higher | Randomize output order |
| Sycophancy | Polite outputs score higher | Include "correct but blunt" in golden set |
Validation cadence: Re-calibrate LLM-as-judge against human scores monthly. If agreement drops below 75%, retrain or replace.
CI/CD Integration
Evals should run on every change, like tests:
Code Change → Automated Eval Suite → Pass/Warn/Fail
│
┌─────┼──────┐
▼ ▼ ▼
Format Safety Quality
(fast) (fast) (slower)
| Tier | Speed | Runs When | Blocks Deploy? |
|---|---|---|---|
| Format + Safety | Seconds | Every commit | Yes |
| Quality (smoke) | Minutes | Every PR | Yes, if regression |
| Quality (full) | Hours | Nightly / pre-release | Advisory |
The Eval Loop
Evaluation isn't a gate — it's a feedback loop:
DEFINE RUBRIC → BUILD DATASET → RUN EVAL → ANALYZE GAPS → IMPROVE → RE-EVAL
↑ |
└────────── Update rubric as understanding deepens ────────────────┘
| Stage | Key Question |
|---|---|
| Define | Are we measuring what users actually value? |
| Build | Does the dataset represent production traffic? |
| Run | Are scores reproducible and trustworthy? |
| Analyze | Where are failures concentrated? |
| Improve | What's the highest-leverage fix? |
| Re-eval | Did the improvement actually work? |
The New PRD
Traditional specs describe deterministic behavior: given input X, produce output Y. AI products produce distributions. The spec cannot define behavior — it can only define boundaries. Evals are the artifact that replaces the traditional spec for non-deterministic systems.
An eval has three components:
| Component | What It Is | What It Replaces |
|---|---|---|
| Data | A set of inputs representing real usage | Acceptance criteria examples |
| Task | The system under test, producing an output from each input | The feature being built |
| Scores | A quality rating (0-1) per output, per dimension | Manual QA pass/fail |
The composite score across all data items is the eval result. Track it over time and you have a quality trajectory — not a binary gate, but a signal that tells you whether the system is getting better or worse.
Stories as Datasets
VV stories already contain the three components:
| Story Field | Eval Component | How It Maps |
|---|---|---|
| Scenario (the struggling moment) | Data — input | The situation that triggers the eval case |
| Outcomes (measurable before/after) | Scores — expected output | The threshold that defines pass/fail |
| Counterfeit (fake success) | Data — adversarial input | The case that should score low but might score high |
| Actions (what the system does) | Task — system under test | The capability being evaluated |
A counterfeit outcome is a golden dataset's most valuable item. It tests whether the system can distinguish real progress from the appearance of progress — the left tail where invisible failures hide.
Eval Targets
Every story should declare what "good" looks like before anyone writes code. Three fields turn a story from a wish into a testable claim:
| Field | What it answers | Example |
|---|---|---|
| Dimension | What kind of quality matters most here? | A search feature cares about correctness (right results). A payment flow cares about trust (safe to use). |
| Threshold | What number means "pass"? | "<2s p95 response time", ">95% accuracy", "zero data loss" |
| Baseline | Where are we starting from? | "~30s manual lookup across 3 tools", "no capability exists" |
The dimension comes from the CRAFT checklist: Correctness, Reliability, Alignment, Failsafe, or Trust. Pick the one dimension that would matter most if this story failed.
The threshold must be a number. "Fast" is not a target. "<2 seconds" is. If you cannot write a number, you do not understand the outcome well enough to build it.
The baseline is where you stand before the work starts. Without a baseline, you cannot prove the work made anything better. You are just shipping and hoping.
Why this matters for prioritization: Performance readiness carries the highest weight (0.25) in the readiness formula. If no stories have eval targets, Performance readiness cannot exceed 2 — there is nothing to measure against. A capability with no targets is a capability with no definition of done.
Worked Example
Story: "When managing 20+ contacts across 3 tools, each lookup takes 30 seconds. I need to find any contact with full deal context in one search. So I get contact found in <5s with linked deals. Not: contact found but deal history missing — looks fast, delivers incomplete."
| Field | Value |
|---|---|
| Dimension | correctness — the result must include linked deals, not just the contact name |
| Threshold | "<5s with linked deals" — time AND completeness |
| Baseline | "~30s across 3 tools, deals not linked" — the current pain |
This story is now testable. An engineer can build it and measure whether it passes. A scorer can run it and track whether quality improves over time. A commissioner can verify whether the deployed system meets the claim. The story stopped being a wish and became a contract.
Rubric Scaling
The CRAFT checklist (0-2 per dimension, 0-10 total) is a human-readable rubric. Machine-executable rubrics use finer scales:
| Scale | When to Use | Conversion |
|---|---|---|
| 0-2 (CRAFT) | Human review, quick assessment | CRAFT 0 = machine 1-2, CRAFT 1 = machine 3, CRAFT 2 = machine 4-5 |
| 1-5 (rubric dimension) | Automated scoring per dimension | Weight dimensions, aggregate to composite |
| 0-100 (composite) | Cross-eval comparison, dashboards | ((avg(weighted_scores) - 1) / 4) * 100 |
Thresholds convert continuous scores into decisions:
| Tier | Meaning | Action |
|---|---|---|
| Below minimum | Fails basic quality | Don't ship. Fix the worst dimension. |
| Minimum | Meets floor | Ship with monitoring. Improve weekly. |
| Production | Reliable quality | Ship confidently. Improve monthly. |
| World class | Exceeds expectations | Study what works. Raise the floor for others. |
Standards as Code
A human writes a quality standard. A rubric encodes it as weighted dimensions with score guides. A scoring engine executes it against data. Thresholds convert the result into a decision.
HUMAN STANDARD (what "good" means)
↓
RUBRIC (dimensions × weights × score guides)
↓
SCORING ENGINE (runs rubric against data items)
↓
THRESHOLDS (minimum / production / world class)
↓
DECISION (ship / improve / stop)
The standard is the contract between the team that defines quality and the system that measures it. When the standard changes, the rubric changes, the scores change, the decisions change. One source of truth, one direction of flow.
This is how commissioning connects to evaluation. The L0-L4 maturity model measures whether a capability was built. The eval measures whether it was built well. L3 (tested) proves the code runs. Eval scores prove the output is good. L4 (commissioned) requires both.
Common Failures
| Failure | Pattern | Fix |
|---|---|---|
| False confidence | High scores on easy dataset | Add hard cases, adversarial inputs |
| Stale evals | Production drifted, evals didn't | Refresh dataset quarterly |
| One rubric | Same eval for chat, code, search | Different rubrics per use case |
| No baseline | Can't tell if changes help or hurt | Establish baseline before iterating |
| Eval debt | "We'll add evals later" | Evals in the PRD, not afterthought |
| Score worship | Optimizing eval score, not user value | Validate scores against user satisfaction regularly |
Context
- Deterministic vs Probabilistic — AI products produce distributions, not outputs
- SMART-BF Checklist — The pattern CRAFT extends
- AI Product Principles — Define "good" before you measure it
- AI Observability — From eval scores to production insights
- VVFL Loop — Evaluation IS the feedback loop
- VV Stories — Stories are eval datasets waiting to be run
- Commissioning — L0-L4 proves it was built; evals prove it was built well
- Outer Loop Validation — Where the scoring engine lives
Questions
What would change if your eval scores went up but user satisfaction went down?
- Which CRAFT dimension is blocking in your product — where a 0 means don't ship regardless of total score?
- If your golden dataset hasn't been refreshed in 90 days, what production reality are your evals missing?
- When is LLM-as-judge good enough — and when does the bias cost more than human review?
- How many of your VV stories have a threshold you could test against right now — and what does that number tell you about Performance readiness?
- What is the gap between your commissioning maturity (L0-L4) and your eval maturity — and which one is lying?