AI Evaluation

How do you know your AI product is getting better?

Not by gut feel. Not by cherry-picked examples. By systematic evaluation against a defined standard, run consistently, tracked over time. Evaluation is to AI products what prediction scoring is to forecasting — the discipline that separates signal from noise.

The CRAFT Checklist

Five dimensions, scored 0-2 each. Total: 0-10 points. Inspired by the SMART-BF pattern.

Dimension	Question	0	1	2
Correctness	Is the output factually right?	Hallucinating, wrong answers	Mostly right, minor errors	Accurate, verifiable
Reliability	Same input, consistent quality?	Wild variance across runs	Manageable spread	Predictable range
Alignment	Does it match user intent?	Misses the job entirely	Partially useful	Nails the job
Failsafe	What happens when wrong?	Silent failure, user misled	User can detect the error	Graceful recovery, signals uncertainty
Trust	Safe to ship at this quality?	Harmful potential, brand risk	Needs supervision, caveats	Confident release

Scoring Guide

Score	Quality	Action
9-10	Excellent	Ship, celebrate, study what made it work
7-8	Good	Ship with monitoring, improve over time
5-6	Marginal	Improve before scaling, limit exposure
3-4	Poor	Don't ship — fix correctness or safety first
0-2	Failing	Fundamental rethink needed

Worked Example

Product: Code review assistant Input: Pull request diff with a subtle null pointer bug

Dimension	Score	Reasoning
Correctness	2	Identified the null pointer, correct explanation
Reliability	1	Catches this class of bug ~70% of the time
Alignment	2	User wanted code review, got actionable feedback
Failsafe	1	Doesn't flag when it's uncertain about a finding
Trust	1	Good enough to assist, not replace human review

Total: 7/10 — Good. Ship with monitoring. Priority fix: add confidence signaling (Failsafe).

Golden Datasets

The foundation of evaluation. Garbage dataset, garbage scores.

Building

Principle	Why	Anti-Pattern
Representative	Must reflect real traffic	Only testing "clean" inputs that flatter the AI
Diverse	Cover all input categories	Over-indexing on common cases
Tagged	Metadata per example (category, difficulty, risk)	Untagged blob of examples
Versioned	Track changes over time	Editing in place without history
Validated	Expected outputs reviewed by qualified humans	One person's opinion as ground truth

Size Guide

Purpose	Minimum Examples	Why
Smoke test	20-50	Quick sanity check on every change
Regression	100-300	Statistical significance for comparison
Comprehensive	500+	Full coverage of input universe
Safety	As many adversarial inputs as possible	The tail matters more than the mean

Maintenance

Golden datasets rot. Production traffic drifts. Refresh quarterly at minimum.

Signal	Action
Production inputs look different from dataset	Add new examples
"Expected outputs" are now wrong (world changed)	Update golden answers
New failure pattern discovered	Add to adversarial set
New user persona emerges	Add representative inputs
Eval scores rising but users unhappy	Dataset doesn't reflect reality — rebuild

Eval Rubrics

Rubrics make scoring reproducible. Without them, "4 out of 5" means different things to different people.

Calibration Test

Before using a rubric at scale, run the inter-rater test:

Three people score the same 20 outputs independently
Compare scores
Where they disagree, discuss and refine the rubric
Repeat until agreement exceeds 80%

If you can't get 80% agreement, the rubric is too subjective. Split the dimension or add examples.

Rubric Template

For each quality dimension:

DIMENSION: [Name]
WEIGHT: [1-3, how much this matters]

Score 2 (Excellent): [Specific description + example]
Score 1 (Acceptable): [Specific description + example]
Score 0 (Unacceptable): [Specific description + example]

BLOCKING: [Yes/No — does a 0 here override all other scores?]

Blocking dimensions (where a 0 means don't ship regardless of total score):

Safety violations
Factual errors in high-stakes domains
PII exposure
Legal/compliance violations

Automated Eval

Not everything needs human judgment. Automate what you can, save humans for what you must.

Layer	What It Catches	Tool Type	Human Needed?
Format	Length, structure, required fields	Code-based checks	No
Safety	PII, harmful content, refusal adherence	Classifier + rules	No (for clear violations)
Factual	Known-answer questions, verifiable claims	Code + reference data	For edge cases
Quality	Relevance, tone, helpfulness	LLM-as-judge	For calibration
Nuance	Surprising insight, cultural sensitivity, taste	Human review	Yes

LLM-as-Judge

Using one model to evaluate another. Powerful but biased.

Bias	Pattern	Mitigation
Verbosity	Longer outputs score higher	Control for length in rubric
Self-preference	Model prefers its own style	Use a different model family as judge
Position	First option scores higher	Randomize output order
Sycophancy	Polite outputs score higher	Include "correct but blunt" in golden set

Validation cadence: Re-calibrate LLM-as-judge against human scores monthly. If agreement drops below 75%, retrain or replace.

CI/CD Integration

Evals should run on every change, like tests:

Code Change → Automated Eval Suite → Pass/Warn/Fail
                    │
              ┌─────┼──────┐
              ▼     ▼      ▼
           Format  Safety  Quality
           (fast)  (fast)  (slower)

Tier	Speed	Runs When	Blocks Deploy?
Format + Safety	Seconds	Every commit	Yes
Quality (smoke)	Minutes	Every PR	Yes, if regression
Quality (full)	Hours	Nightly / pre-release	Advisory

The Eval Loop

Evaluation isn't a gate — it's a feedback loop:

DEFINE RUBRIC → BUILD DATASET → RUN EVAL → ANALYZE GAPS → IMPROVE → RE-EVAL
      ↑                                                                  |
      └────────── Update rubric as understanding deepens ────────────────┘

Stage	Key Question
Define	Are we measuring what users actually value?
Build	Does the dataset represent production traffic?
Run	Are scores reproducible and trustworthy?
Analyze	Where are failures concentrated?
Improve	What's the highest-leverage fix?
Re-eval	Did the improvement actually work?

Common Failures

Failure	Pattern	Fix
False confidence	High scores on easy dataset	Add hard cases, adversarial inputs
Stale evals	Production drifted, evals didn't	Refresh dataset quarterly
One rubric	Same eval for chat, code, search	Different rubrics per use case
No baseline	Can't tell if changes help or hurt	Establish baseline before iterating
Eval debt	"We'll add evals later"	Evals in the PRD, not afterthought
Score worship	Optimizing eval score, not user value	Validate scores against user satisfaction regularly

Context

SMART-BF Checklist — The pattern CRAFT extends
AI Product Principles — Define "good" before you measure it
AI Requirements — Eval strategy starts in the PRD
AI Observability — From eval scores to production insights
VVFL Loop — Evaluation IS the feedback loop

The CRAFT Checklist​

Scoring Guide​

Worked Example​

Golden Datasets​

Building​

Size Guide​

Maintenance​

Eval Rubrics​

Calibration Test​

Rubric Template​

Automated Eval​

LLM-as-Judge​

CI/CD Integration​

The Eval Loop​

Common Failures​

Context​