Skip to main content

AI Product Principles

What does "good" mean when the same input produces different outputs?

This is the foundational question of AI product development. Until you answer it — with specificity, not hand-waving — everything downstream is guesswork. Your evaluation scores, your ship decisions, your user trust: all depend on a clear definition of quality that accounts for non-determinism.

Distribution Thinking

The most dangerous assumption from traditional PM: that quality is binary. AI quality is a distribution.

TRADITIONAL:  Pass ─── or ─── Fail
AI PRODUCT: Excellent ─── Good ─── Acceptable ─── Poor ─── Harmful
↑ ↑
Celebrate this Eliminate this

Your job isn't to make every output excellent. It's to shift the distribution right and cut off the left tail.

MetricWhat It MeasuresWhy It Matters
Mean qualityAverage output scoreOverall user experience
VarianceSpread between best and worstUser trust and predictability
Left tailWorst-case outputsBrand risk, safety, trust destruction
Right tailBest-case outputsDelight, word-of-mouth, competitive edge

Quality Dimensions

Not all quality is the same. Different products weight these differently.

DimensionQuestionExample: ChatbotExample: Code Gen
CorrectnessIs it factually right?CriticalCritical
CompletenessDoes it cover the full request?ImportantCritical
RelevanceDoes it match the user's intent?CriticalImportant
ConcisenessNo unnecessary content?ImportantLess important
SafetyCould it cause harm?CriticalModerate
ToneDoes it match context?ImportantLow
LatencyHow fast does it respond?ImportantModerate

The trap: measuring what's easy instead of what matters. Latency is easy to measure. "Does it match the user's intent?" requires judgment.

The Three Tiers

Every AI product needs three defined quality levels:

TierDefinitionExample (Search)Response
ExcellentExceeds expectations, would share with othersPerfect result, surprising depthCelebrate, study what made it work
AcceptableGets the job done, no complaintsRelevant results, right ballparkShip, improve over time
UnacceptableHarms user trust or safetyWrong answers presented confidentlyBlock, alert, investigate

The gap between tiers varies by stakes:

StakesAcceptable RangeLeft Tail Tolerance
Low (casual chat, suggestions)WideModerate
Medium (productivity, content)ModerateLow
High (medical, financial, legal)NarrowZero

Personas & Quality

Different users have different quality bars for the same feature.

User TypeQuality PriorityFailure ToleranceWhat "Good" Means
ExpertCorrectness, depthLow for their domainSaves time, not effort
NoviceClarity, guidanceHigher for minor errorsUnlocks capability
High-stakesAccuracy, safetyNear zeroTrustworthy enough to act on
CasualSpeed, toneModerateEntertaining, useful enough

The question: are you evaluating with inputs from your actual user base, or inputs from your own mental model?

The Invisible Failure

The most dangerous AI failure: users can't tell when the output is wrong.

Failure TypeUser Notices?Risk
Obvious errorYes — formatting, gibberishLow — user works around it
Confident wrong answerSometimesHigh — user may act on bad info
Subtle omissionRarelyVery high — user doesn't know what's missing
Plausible hallucinationAlmost neverCritical — trust destruction when discovered

If your users can't detect errors, your evaluation system must. You can't rely on user feedback to find quality problems — by the time they notice, trust is already damaged.

Common Traps

TrapPatternFix
Binary thinking"It works" or "it's broken"Think in distributions, score on spectrum
Demo biasShowing best-case outputsEvaluate against representative traffic
Metric gamingOptimizing easy metricsWeight dimensions by user value
One rubricSame eval for all use casesDifferent rubrics for different jobs
Frozen definition"Good" defined once, never updatedRedefine as user expectations evolve
Builder's gogglesTesting with clean, ideal inputsUse real user inputs (messy, ambiguous, adversarial)

The Inversion Test

Before shipping, ask: what would make this AI product worse?

If the answer is:

  • "Less consistent outputs" → Reliability is your constraint
  • "More confident wrong answers" → Safety is your constraint
  • "Slower responses" → You're latency-bound, not quality-bound
  • "Works for experts but not novices" → Persona coverage gap
  • "Great today, degrades over time" → You need drift monitoring

Context