AI Product Principles

What does "good" mean when the same input produces different outputs?

This is the foundational question of AI product development. Until you answer it — with specificity, not hand-waving — everything downstream is guesswork. Your evaluation scores, your ship decisions, your user trust: all depend on a clear definition of quality that accounts for non-determinism.

Distribution Thinking

The most dangerous assumption from traditional PM: that quality is binary. AI quality is a distribution.

TRADITIONAL:  Pass ─── or ─── Fail
AI PRODUCT:   Excellent ─── Good ─── Acceptable ─── Poor ─── Harmful
                 ↑                                              ↑
              Celebrate this                              Eliminate this

Your job isn't to make every output excellent. It's to shift the distribution right and cut off the left tail.

Metric	What It Measures	Why It Matters
Mean quality	Average output score	Overall user experience
Variance	Spread between best and worst	User trust and predictability
Left tail	Worst-case outputs	Brand risk, safety, trust destruction
Right tail	Best-case outputs	Delight, word-of-mouth, competitive edge

Quality Dimensions

Not all quality is the same. Different products weight these differently.

Dimension	Question	Example: Chatbot	Example: Code Gen
Correctness	Is it factually right?	Critical	Critical
Completeness	Does it cover the full request?	Important	Critical
Relevance	Does it match the user's intent?	Critical	Important
Conciseness	No unnecessary content?	Important	Less important
Safety	Could it cause harm?	Critical	Moderate
Tone	Does it match context?	Important	Low
Latency	How fast does it respond?	Important	Moderate

The trap: measuring what's easy instead of what matters. Latency is easy to measure. "Does it match the user's intent?" requires judgment.

The Three Tiers

Every AI product needs three defined quality levels:

Tier	Definition	Example (Search)	Response
Excellent	Exceeds expectations, would share with others	Perfect result, surprising depth	Celebrate, study what made it work
Acceptable	Gets the job done, no complaints	Relevant results, right ballpark	Ship, improve over time
Unacceptable	Harms user trust or safety	Wrong answers presented confidently	Block, alert, investigate

The gap between tiers varies by stakes:

Stakes	Acceptable Range	Left Tail Tolerance
Low (casual chat, suggestions)	Wide	Moderate
Medium (productivity, content)	Moderate	Low
High (medical, financial, legal)	Narrow	Zero

Personas & Quality

Different users have different quality bars for the same feature.

User Type	Quality Priority	Failure Tolerance	What "Good" Means
Expert	Correctness, depth	Low for their domain	Saves time, not effort
Novice	Clarity, guidance	Higher for minor errors	Unlocks capability
High-stakes	Accuracy, safety	Near zero	Trustworthy enough to act on
Casual	Speed, tone	Moderate	Entertaining, useful enough

The question: are you evaluating with inputs from your actual user base, or inputs from your own mental model?

The Invisible Failure

The most dangerous AI failure: users can't tell when the output is wrong.

Failure Type	User Notices?	Risk
Obvious error	Yes — formatting, gibberish	Low — user works around it
Confident wrong answer	Sometimes	High — user may act on bad info
Subtle omission	Rarely	Very high — user doesn't know what's missing
Plausible hallucination	Almost never	Critical — trust destruction when discovered

If your users can't detect errors, your evaluation system must. You can't rely on user feedback to find quality problems — by the time they notice, trust is already damaged.

Common Traps

Trap	Pattern	Fix
Binary thinking	"It works" or "it's broken"	Think in distributions, score on spectrum
Demo bias	Showing best-case outputs	Evaluate against representative traffic
Metric gaming	Optimizing easy metrics	Weight dimensions by user value
One rubric	Same eval for all use cases	Different rubrics for different jobs
Frozen definition	"Good" defined once, never updated	Redefine as user expectations evolve
Builder's goggles	Testing with clean, ideal inputs	Use real user inputs (messy, ambiguous, adversarial)

The Inversion Test

Before shipping, ask: what would make this AI product worse?

If the answer is:

"Less consistent outputs" → Reliability is your constraint
"More confident wrong answers" → Safety is your constraint
"Slower responses" → You're latency-bound, not quality-bound
"Works for experts but not novices" → Persona coverage gap
"Great today, degrades over time" → You need drift monitoring

Context

AI Evaluation — Score quality with the CRAFT checklist
AI Requirements — Write PRDs that account for distributions
Prediction Evaluation — The scoring pattern this extends
Behavioural Economics — Why users can't always judge quality
Product Design — Design principles that compound with AI

Distribution Thinking​

Quality Dimensions​

The Three Tiers​

Personas & Quality​

The Invisible Failure​

Common Traps​

The Inversion Test​

Context​