AI Product Principles
What does "good" mean when the same input produces different outputs?
This is the foundational question of AI product development. Until you answer it — with specificity, not hand-waving — everything downstream is guesswork. Your evaluation scores, your ship decisions, your user trust: all depend on a clear definition of quality that accounts for non-determinism.
Distribution Thinking
The most dangerous assumption from traditional PM: that quality is binary. AI quality is a distribution.
TRADITIONAL: Pass ─── or ─── Fail
AI PRODUCT: Excellent ─── Good ─── Acceptable ─── Poor ─── Harmful
↑ ↑
Celebrate this Eliminate this
Your job isn't to make every output excellent. It's to shift the distribution right and cut off the left tail.
| Metric | What It Measures | Why It Matters |
|---|---|---|
| Mean quality | Average output score | Overall user experience |
| Variance | Spread between best and worst | User trust and predictability |
| Left tail | Worst-case outputs | Brand risk, safety, trust destruction |
| Right tail | Best-case outputs | Delight, word-of-mouth, competitive edge |
Quality Dimensions
Not all quality is the same. Different products weight these differently.
| Dimension | Question | Example: Chatbot | Example: Code Gen |
|---|---|---|---|
| Correctness | Is it factually right? | Critical | Critical |
| Completeness | Does it cover the full request? | Important | Critical |
| Relevance | Does it match the user's intent? | Critical | Important |
| Conciseness | No unnecessary content? | Important | Less important |
| Safety | Could it cause harm? | Critical | Moderate |
| Tone | Does it match context? | Important | Low |
| Latency | How fast does it respond? | Important | Moderate |
The trap: measuring what's easy instead of what matters. Latency is easy to measure. "Does it match the user's intent?" requires judgment.
The Three Tiers
Every AI product needs three defined quality levels:
| Tier | Definition | Example (Search) | Response |
|---|---|---|---|
| Excellent | Exceeds expectations, would share with others | Perfect result, surprising depth | Celebrate, study what made it work |
| Acceptable | Gets the job done, no complaints | Relevant results, right ballpark | Ship, improve over time |
| Unacceptable | Harms user trust or safety | Wrong answers presented confidently | Block, alert, investigate |
The gap between tiers varies by stakes:
| Stakes | Acceptable Range | Left Tail Tolerance |
|---|---|---|
| Low (casual chat, suggestions) | Wide | Moderate |
| Medium (productivity, content) | Moderate | Low |
| High (medical, financial, legal) | Narrow | Zero |
Personas & Quality
Different users have different quality bars for the same feature.
| User Type | Quality Priority | Failure Tolerance | What "Good" Means |
|---|---|---|---|
| Expert | Correctness, depth | Low for their domain | Saves time, not effort |
| Novice | Clarity, guidance | Higher for minor errors | Unlocks capability |
| High-stakes | Accuracy, safety | Near zero | Trustworthy enough to act on |
| Casual | Speed, tone | Moderate | Entertaining, useful enough |
The question: are you evaluating with inputs from your actual user base, or inputs from your own mental model?
The Invisible Failure
The most dangerous AI failure: users can't tell when the output is wrong.
| Failure Type | User Notices? | Risk |
|---|---|---|
| Obvious error | Yes — formatting, gibberish | Low — user works around it |
| Confident wrong answer | Sometimes | High — user may act on bad info |
| Subtle omission | Rarely | Very high — user doesn't know what's missing |
| Plausible hallucination | Almost never | Critical — trust destruction when discovered |
If your users can't detect errors, your evaluation system must. You can't rely on user feedback to find quality problems — by the time they notice, trust is already damaged.
Common Traps
| Trap | Pattern | Fix |
|---|---|---|
| Binary thinking | "It works" or "it's broken" | Think in distributions, score on spectrum |
| Demo bias | Showing best-case outputs | Evaluate against representative traffic |
| Metric gaming | Optimizing easy metrics | Weight dimensions by user value |
| One rubric | Same eval for all use cases | Different rubrics for different jobs |
| Frozen definition | "Good" defined once, never updated | Redefine as user expectations evolve |
| Builder's goggles | Testing with clean, ideal inputs | Use real user inputs (messy, ambiguous, adversarial) |
The Inversion Test
Before shipping, ask: what would make this AI product worse?
If the answer is:
- "Less consistent outputs" → Reliability is your constraint
- "More confident wrong answers" → Safety is your constraint
- "Slower responses" → You're latency-bound, not quality-bound
- "Works for experts but not novices" → Persona coverage gap
- "Great today, degrades over time" → You need drift monitoring
Context
- AI Evaluation — Score quality with the CRAFT checklist
- AI Requirements — Write PRDs that account for distributions
- Prediction Evaluation — The scoring pattern this extends
- Behavioural Economics — Why users can't always judge quality
- Product Design — Design principles that compound with AI