AI Products
What changes when your product thinks?
Traditional products are deterministic. Same input, same output. Test it once, ship it. AI products produce a distribution of outcomes — the same input generates different outputs every time. This changes everything about how you build, test, and ship.
The gap between "AI demo" and "AI product" is an evaluation gap. Demos impress with best-case outputs. Products must handle the full distribution — including the tail where things go wrong.
The Shift
| Deterministic Product | AI Product |
|---|---|
| Bug = broken code | Bad output = expected variance |
| Test once, ship | Evaluate continuously |
| Binary: works or doesn't | Spectrum: how often, how good |
| Reproduce every issue | Some failures are statistical |
| 100% quality possible | Quality = acceptable distribution |
| Spec defines behavior | Spec defines boundaries |
AI Product Tight Five
The same five questions applied to building with AI:
| # | Question | AI Product Translation | Where |
|---|---|---|---|
| 1 | Why does this matter? | What job does the AI do that wasn't possible before? | Principles |
| 2 | What truths guide you? | What does "good" mean for this output? | Principles |
| 3 | What do you control? | What can you measure, test, and improve? | Evaluation |
| 4 | What do you see? | Where is the model failing that users haven't reported yet? | Observability |
| 5 | How do you know? | Are eval scores improving AND users happier? | Observability |
The Loop
The VVFL applied to AI products:
DEFINE "GOOD" → BUILD EVALS → SHIP → MEASURE → LEARN → REDEFINE "GOOD"
↑ |
└────────────────────────────────────────────────────────┘
Every cycle tightens the distribution. Quality isn't a destination — it's a feedback loop.
| Stage | Activity | Output |
|---|---|---|
| Define | Set quality principles | Dimensions, rubrics, failure budgets |
| Build | Write requirements | AI PRD with eval criteria |
| Measure | Run evaluations | Scores across golden dataset |
| See | Analyze traces | Where it fails, why, how often |
| Learn | Close the gap | Tighter prompts, better data, updated evals |
Work Chart
Who does what in AI product development? The work chart pattern:
| Activity | Human Role | AI Role | AI % | Trend |
|---|---|---|---|---|
| Define quality | Sets dimensions, judges edge cases | Generates rubric variations | 25% | ↑ |
| Build golden datasets | Curates, validates, tags | Generates synthetic examples | 50% | ↑↑ |
| Write eval rubrics | Defines scoring criteria | Scores outputs against rubric | 60% | ↑↑ |
| Trace analysis | Pattern recognition, root cause | Surfaces anomalies, clusters failures | 45% | ↑ |
| Ship decisions | Final judgment, risk tolerance | Presents evidence, scores confidence | 30% | ↑ |
Aggregate AI %: 42% — Human edge: judgment on what "good" means. AI edge: consistent scoring at scale.
Subjects
📄️ Principles
What does "good" mean when the same input produces different outputs?
📄️ Requirements
How do you spec a product that never gives the same answer twice?
📄️ Evaluation
How do you know your AI product is getting better?
📄️ Observability
When a user reports a bad experience, can you even reproduce it?
Context
- VVFL Loop — The feedback loop everything builds on
- Prediction Evaluation — The SMART-BF pattern this section extends
- Work Charts — Human/AI capability mapping
- Jobs To Be Done — What job is the AI hired for?
- AI Frameworks — The infrastructure layer beneath products
- Product Design — Design principles that still apply