AI Products

What changes when your product thinks?

Traditional products are deterministic. Same input, same output. Test it once, ship it. AI products produce a distribution of outcomes — the same input generates different outputs every time. This changes everything about how you build, test, and ship.

The gap between "AI demo" and "AI product" is an evaluation gap. Demos impress with best-case outputs. Products must handle the full distribution — including the tail where things go wrong.

The Shift

Deterministic Product	AI Product
Bug = broken code	Bad output = expected variance
Test once, ship	Evaluate continuously
Binary: works or doesn't	Spectrum: how often, how good
Reproduce every issue	Some failures are statistical
100% quality possible	Quality = acceptable distribution
Spec defines behavior	Spec defines boundaries

AI Product Tight Five

The same five questions applied to building with AI:

#	Question	AI Product Translation	Where
1	Why does this matter?	What job does the AI do that wasn't possible before?	Principles
2	What truths guide you?	What does "good" mean for this output?	Principles
3	What do you control?	What can you measure, test, and improve?	Evaluation
4	What do you see?	Where is the model failing that users haven't reported yet?	Observability
5	How do you know?	Are eval scores improving AND users happier?	Observability

The Loop

The VVFL applied to AI products:

DEFINE "GOOD" → BUILD EVALS → SHIP → MEASURE → LEARN → REDEFINE "GOOD"
      ↑                                                        |
      └────────────────────────────────────────────────────────┘

Every cycle tightens the distribution. Quality isn't a destination — it's a feedback loop.

Stage	Activity	Output
Define	Set quality principles	Dimensions, rubrics, failure budgets
Build	Write requirements	AI PRD with eval criteria
Measure	Run evaluations	Scores across golden dataset
See	Analyze traces	Where it fails, why, how often
Learn	Close the gap	Tighter prompts, better data, updated evals

Work Chart

Who does what in AI product development? The work chart pattern:

Activity	Human Role	AI Role	AI %	Trend
Define quality	Sets dimensions, judges edge cases	Generates rubric variations	25%	↑
Build golden datasets	Curates, validates, tags	Generates synthetic examples	50%	↑↑
Write eval rubrics	Defines scoring criteria	Scores outputs against rubric	60%	↑↑
Trace analysis	Pattern recognition, root cause	Surfaces anomalies, clusters failures	45%	↑
Ship decisions	Final judgment, risk tolerance	Presents evidence, scores confidence	30%	↑

Aggregate AI %: 42% — Human edge: judgment on what "good" means. AI edge: consistent scoring at scale.

Context

VVFL Loop — The feedback loop everything builds on
Prediction Evaluation — The SMART-BF pattern this section extends
Work Charts — Human/AI capability mapping
Jobs To Be Done — What job is the AI hired for?
AI Frameworks — The infrastructure layer beneath products
Product Design — Design principles that still apply

AI Products

The Shift

AI Product Tight Five

The Loop

Work Chart

Subjects

📄️ Principles

📄️ Requirements

📄️ Evaluation

📄️ Observability

Context

The Shift​

AI Product Tight Five​

The Loop​

Work Chart​

Subjects​

📄️ Principles

📄️ Requirements

📄️ Evaluation

📄️ Observability

Context​

The Shift

AI Product Tight Five

The Loop

Work Chart

Subjects

Context