Skip to main content

AI Products

What changes when your product thinks?

Traditional products are deterministic. Same input, same output. Test it once, ship it. AI products produce a distribution of outcomes — the same input generates different outputs every time. This changes everything about how you build, test, and ship.

The gap between "AI demo" and "AI product" is an evaluation gap. Demos impress with best-case outputs. Products must handle the full distribution — including the tail where things go wrong.

The Shift

Deterministic ProductAI Product
Bug = broken codeBad output = expected variance
Test once, shipEvaluate continuously
Binary: works or doesn'tSpectrum: how often, how good
Reproduce every issueSome failures are statistical
100% quality possibleQuality = acceptable distribution
Spec defines behaviorSpec defines boundaries

AI Product Tight Five

The same five questions applied to building with AI:

#QuestionAI Product TranslationWhere
1Why does this matter?What job does the AI do that wasn't possible before?Principles
2What truths guide you?What does "good" mean for this output?Principles
3What do you control?What can you measure, test, and improve?Evaluation
4What do you see?Where is the model failing that users haven't reported yet?Observability
5How do you know?Are eval scores improving AND users happier?Observability

The Loop

The VVFL applied to AI products:

DEFINE "GOOD" → BUILD EVALS → SHIP → MEASURE → LEARN → REDEFINE "GOOD"
↑ |
└────────────────────────────────────────────────────────┘

Every cycle tightens the distribution. Quality isn't a destination — it's a feedback loop.

StageActivityOutput
DefineSet quality principlesDimensions, rubrics, failure budgets
BuildWrite requirementsAI PRD with eval criteria
MeasureRun evaluationsScores across golden dataset
SeeAnalyze tracesWhere it fails, why, how often
LearnClose the gapTighter prompts, better data, updated evals

Work Chart

Who does what in AI product development? The work chart pattern:

ActivityHuman RoleAI RoleAI %Trend
Define qualitySets dimensions, judges edge casesGenerates rubric variations25%
Build golden datasetsCurates, validates, tagsGenerates synthetic examples50%↑↑
Write eval rubricsDefines scoring criteriaScores outputs against rubric60%↑↑
Trace analysisPattern recognition, root causeSurfaces anomalies, clusters failures45%
Ship decisionsFinal judgment, risk tolerancePresents evidence, scores confidence30%

Aggregate AI %: 42% — Human edge: judgment on what "good" means. AI edge: consistent scoring at scale.

Subjects

Context