Skip to main content

AI Evaluation

How do you know your AI product is getting better?

Not by gut feel. Not by cherry-picked examples. By systematic evaluation against a defined standard, run consistently, tracked over time. Evaluation is to AI products what prediction scoring is to forecasting — the discipline that separates signal from noise.

The CRAFT Checklist

Five dimensions, scored 0-2 each. Total: 0-10 points. Inspired by the SMART-BF pattern.

DimensionQuestion012
CorrectnessIs the output factually right?Hallucinating, wrong answersMostly right, minor errorsAccurate, verifiable
ReliabilitySame input, consistent quality?Wild variance across runsManageable spreadPredictable range
AlignmentDoes it match user intent?Misses the job entirelyPartially usefulNails the job
FailsafeWhat happens when wrong?Silent failure, user misledUser can detect the errorGraceful recovery, signals uncertainty
TrustSafe to ship at this quality?Harmful potential, brand riskNeeds supervision, caveatsConfident release

Scoring Guide

ScoreQualityAction
9-10ExcellentShip, celebrate, study what made it work
7-8GoodShip with monitoring, improve over time
5-6MarginalImprove before scaling, limit exposure
3-4PoorDon't ship — fix correctness or safety first
0-2FailingFundamental rethink needed

Worked Example

Product: Code review assistant Input: Pull request diff with a subtle null pointer bug

DimensionScoreReasoning
Correctness2Identified the null pointer, correct explanation
Reliability1Catches this class of bug ~70% of the time
Alignment2User wanted code review, got actionable feedback
Failsafe1Doesn't flag when it's uncertain about a finding
Trust1Good enough to assist, not replace human review

Total: 7/10 — Good. Ship with monitoring. Priority fix: add confidence signaling (Failsafe).

Golden Datasets

The foundation of evaluation. Garbage dataset, garbage scores.

Building

PrincipleWhyAnti-Pattern
RepresentativeMust reflect real trafficOnly testing "clean" inputs that flatter the AI
DiverseCover all input categoriesOver-indexing on common cases
TaggedMetadata per example (category, difficulty, risk)Untagged blob of examples
VersionedTrack changes over timeEditing in place without history
ValidatedExpected outputs reviewed by qualified humansOne person's opinion as ground truth

Size Guide

PurposeMinimum ExamplesWhy
Smoke test20-50Quick sanity check on every change
Regression100-300Statistical significance for comparison
Comprehensive500+Full coverage of input universe
SafetyAs many adversarial inputs as possibleThe tail matters more than the mean

Maintenance

Golden datasets rot. Production traffic drifts. Refresh quarterly at minimum.

SignalAction
Production inputs look different from datasetAdd new examples
"Expected outputs" are now wrong (world changed)Update golden answers
New failure pattern discoveredAdd to adversarial set
New user persona emergesAdd representative inputs
Eval scores rising but users unhappyDataset doesn't reflect reality — rebuild

Eval Rubrics

Rubrics make scoring reproducible. Without them, "4 out of 5" means different things to different people.

Calibration Test

Before using a rubric at scale, run the inter-rater test:

  1. Three people score the same 20 outputs independently
  2. Compare scores
  3. Where they disagree, discuss and refine the rubric
  4. Repeat until agreement exceeds 80%

If you can't get 80% agreement, the rubric is too subjective. Split the dimension or add examples.

Rubric Template

For each quality dimension:

DIMENSION: [Name]
WEIGHT: [1-3, how much this matters]

Score 2 (Excellent): [Specific description + example]
Score 1 (Acceptable): [Specific description + example]
Score 0 (Unacceptable): [Specific description + example]

BLOCKING: [Yes/No — does a 0 here override all other scores?]

Blocking dimensions (where a 0 means don't ship regardless of total score):

  • Safety violations
  • Factual errors in high-stakes domains
  • PII exposure
  • Legal/compliance violations

Automated Eval

Not everything needs human judgment. Automate what you can, save humans for what you must.

LayerWhat It CatchesTool TypeHuman Needed?
FormatLength, structure, required fieldsCode-based checksNo
SafetyPII, harmful content, refusal adherenceClassifier + rulesNo (for clear violations)
FactualKnown-answer questions, verifiable claimsCode + reference dataFor edge cases
QualityRelevance, tone, helpfulnessLLM-as-judgeFor calibration
NuanceSurprising insight, cultural sensitivity, tasteHuman reviewYes

LLM-as-Judge

Using one model to evaluate another. Powerful but biased.

BiasPatternMitigation
VerbosityLonger outputs score higherControl for length in rubric
Self-preferenceModel prefers its own styleUse a different model family as judge
PositionFirst option scores higherRandomize output order
SycophancyPolite outputs score higherInclude "correct but blunt" in golden set

Validation cadence: Re-calibrate LLM-as-judge against human scores monthly. If agreement drops below 75%, retrain or replace.

CI/CD Integration

Evals should run on every change, like tests:

Code Change → Automated Eval Suite → Pass/Warn/Fail

┌─────┼──────┐
▼ ▼ ▼
Format Safety Quality
(fast) (fast) (slower)
TierSpeedRuns WhenBlocks Deploy?
Format + SafetySecondsEvery commitYes
Quality (smoke)MinutesEvery PRYes, if regression
Quality (full)HoursNightly / pre-releaseAdvisory

The Eval Loop

Evaluation isn't a gate — it's a feedback loop:

DEFINE RUBRIC → BUILD DATASET → RUN EVAL → ANALYZE GAPS → IMPROVE → RE-EVAL
↑ |
└────────── Update rubric as understanding deepens ────────────────┘
StageKey Question
DefineAre we measuring what users actually value?
BuildDoes the dataset represent production traffic?
RunAre scores reproducible and trustworthy?
AnalyzeWhere are failures concentrated?
ImproveWhat's the highest-leverage fix?
Re-evalDid the improvement actually work?

Common Failures

FailurePatternFix
False confidenceHigh scores on easy datasetAdd hard cases, adversarial inputs
Stale evalsProduction drifted, evals didn'tRefresh dataset quarterly
One rubricSame eval for chat, code, searchDifferent rubrics per use case
No baselineCan't tell if changes help or hurtEstablish baseline before iterating
Eval debt"We'll add evals later"Evals in the PRD, not afterthought
Score worshipOptimizing eval score, not user valueValidate scores against user satisfaction regularly

Context