AI Evaluation
How do you know your AI product is getting better?
Not by gut feel. Not by cherry-picked examples. By systematic evaluation against a defined standard, run consistently, tracked over time. Evaluation is to AI products what prediction scoring is to forecasting — the discipline that separates signal from noise.
The CRAFT Checklist
Five dimensions, scored 0-2 each. Total: 0-10 points. Inspired by the SMART-BF pattern.
| Dimension | Question | 0 | 1 | 2 |
|---|---|---|---|---|
| Correctness | Is the output factually right? | Hallucinating, wrong answers | Mostly right, minor errors | Accurate, verifiable |
| Reliability | Same input, consistent quality? | Wild variance across runs | Manageable spread | Predictable range |
| Alignment | Does it match user intent? | Misses the job entirely | Partially useful | Nails the job |
| Failsafe | What happens when wrong? | Silent failure, user misled | User can detect the error | Graceful recovery, signals uncertainty |
| Trust | Safe to ship at this quality? | Harmful potential, brand risk | Needs supervision, caveats | Confident release |
Scoring Guide
| Score | Quality | Action |
|---|---|---|
| 9-10 | Excellent | Ship, celebrate, study what made it work |
| 7-8 | Good | Ship with monitoring, improve over time |
| 5-6 | Marginal | Improve before scaling, limit exposure |
| 3-4 | Poor | Don't ship — fix correctness or safety first |
| 0-2 | Failing | Fundamental rethink needed |
Worked Example
Product: Code review assistant Input: Pull request diff with a subtle null pointer bug
| Dimension | Score | Reasoning |
|---|---|---|
| Correctness | 2 | Identified the null pointer, correct explanation |
| Reliability | 1 | Catches this class of bug ~70% of the time |
| Alignment | 2 | User wanted code review, got actionable feedback |
| Failsafe | 1 | Doesn't flag when it's uncertain about a finding |
| Trust | 1 | Good enough to assist, not replace human review |
Total: 7/10 — Good. Ship with monitoring. Priority fix: add confidence signaling (Failsafe).
Golden Datasets
The foundation of evaluation. Garbage dataset, garbage scores.
Building
| Principle | Why | Anti-Pattern |
|---|---|---|
| Representative | Must reflect real traffic | Only testing "clean" inputs that flatter the AI |
| Diverse | Cover all input categories | Over-indexing on common cases |
| Tagged | Metadata per example (category, difficulty, risk) | Untagged blob of examples |
| Versioned | Track changes over time | Editing in place without history |
| Validated | Expected outputs reviewed by qualified humans | One person's opinion as ground truth |
Size Guide
| Purpose | Minimum Examples | Why |
|---|---|---|
| Smoke test | 20-50 | Quick sanity check on every change |
| Regression | 100-300 | Statistical significance for comparison |
| Comprehensive | 500+ | Full coverage of input universe |
| Safety | As many adversarial inputs as possible | The tail matters more than the mean |
Maintenance
Golden datasets rot. Production traffic drifts. Refresh quarterly at minimum.
| Signal | Action |
|---|---|
| Production inputs look different from dataset | Add new examples |
| "Expected outputs" are now wrong (world changed) | Update golden answers |
| New failure pattern discovered | Add to adversarial set |
| New user persona emerges | Add representative inputs |
| Eval scores rising but users unhappy | Dataset doesn't reflect reality — rebuild |
Eval Rubrics
Rubrics make scoring reproducible. Without them, "4 out of 5" means different things to different people.
Calibration Test
Before using a rubric at scale, run the inter-rater test:
- Three people score the same 20 outputs independently
- Compare scores
- Where they disagree, discuss and refine the rubric
- Repeat until agreement exceeds 80%
If you can't get 80% agreement, the rubric is too subjective. Split the dimension or add examples.
Rubric Template
For each quality dimension:
DIMENSION: [Name]
WEIGHT: [1-3, how much this matters]
Score 2 (Excellent): [Specific description + example]
Score 1 (Acceptable): [Specific description + example]
Score 0 (Unacceptable): [Specific description + example]
BLOCKING: [Yes/No — does a 0 here override all other scores?]
Blocking dimensions (where a 0 means don't ship regardless of total score):
- Safety violations
- Factual errors in high-stakes domains
- PII exposure
- Legal/compliance violations
Automated Eval
Not everything needs human judgment. Automate what you can, save humans for what you must.
| Layer | What It Catches | Tool Type | Human Needed? |
|---|---|---|---|
| Format | Length, structure, required fields | Code-based checks | No |
| Safety | PII, harmful content, refusal adherence | Classifier + rules | No (for clear violations) |
| Factual | Known-answer questions, verifiable claims | Code + reference data | For edge cases |
| Quality | Relevance, tone, helpfulness | LLM-as-judge | For calibration |
| Nuance | Surprising insight, cultural sensitivity, taste | Human review | Yes |
LLM-as-Judge
Using one model to evaluate another. Powerful but biased.
| Bias | Pattern | Mitigation |
|---|---|---|
| Verbosity | Longer outputs score higher | Control for length in rubric |
| Self-preference | Model prefers its own style | Use a different model family as judge |
| Position | First option scores higher | Randomize output order |
| Sycophancy | Polite outputs score higher | Include "correct but blunt" in golden set |
Validation cadence: Re-calibrate LLM-as-judge against human scores monthly. If agreement drops below 75%, retrain or replace.
CI/CD Integration
Evals should run on every change, like tests:
Code Change → Automated Eval Suite → Pass/Warn/Fail
│
┌─────┼──────┐
▼ ▼ ▼
Format Safety Quality
(fast) (fast) (slower)
| Tier | Speed | Runs When | Blocks Deploy? |
|---|---|---|---|
| Format + Safety | Seconds | Every commit | Yes |
| Quality (smoke) | Minutes | Every PR | Yes, if regression |
| Quality (full) | Hours | Nightly / pre-release | Advisory |
The Eval Loop
Evaluation isn't a gate — it's a feedback loop:
DEFINE RUBRIC → BUILD DATASET → RUN EVAL → ANALYZE GAPS → IMPROVE → RE-EVAL
↑ |
└────────── Update rubric as understanding deepens ────────────────┘
| Stage | Key Question |
|---|---|
| Define | Are we measuring what users actually value? |
| Build | Does the dataset represent production traffic? |
| Run | Are scores reproducible and trustworthy? |
| Analyze | Where are failures concentrated? |
| Improve | What's the highest-leverage fix? |
| Re-eval | Did the improvement actually work? |
Common Failures
| Failure | Pattern | Fix |
|---|---|---|
| False confidence | High scores on easy dataset | Add hard cases, adversarial inputs |
| Stale evals | Production drifted, evals didn't | Refresh dataset quarterly |
| One rubric | Same eval for chat, code, search | Different rubrics per use case |
| No baseline | Can't tell if changes help or hurt | Establish baseline before iterating |
| Eval debt | "We'll add evals later" | Evals in the PRD, not afterthought |
| Score worship | Optimizing eval score, not user value | Validate scores against user satisfaction regularly |
Context
- SMART-BF Checklist — The pattern CRAFT extends
- AI Product Principles — Define "good" before you measure it
- AI Requirements — Eval strategy starts in the PRD
- AI Observability — From eval scores to production insights
- VVFL Loop — Evaluation IS the feedback loop