AI Product Requirements
How do you spec a product that never gives the same answer twice?
Traditional PRDs define behavior: "When user clicks X, system does Y." AI PRDs define boundaries and quality distributions: "Given input type X, output should score above Y on dimension Z at least N% of the time."
If your PRD reads like a traditional spec with "AI" sprinkled in, you're building on the wrong foundation.
What Changes
| Traditional PRD | AI PRD |
|---|---|
| Acceptance criteria: pass/fail | Quality targets: percentage above threshold |
| Define exact behavior | Define acceptable range |
| Edge cases are bugs | Edge cases are statistical certainties |
| Test before ship | Evaluate continuously |
| User stories | User stories + failure stories |
| "The system shall..." | "The system shall... N% of the time..." |
The AI PRD Template
Every AI feature needs these sections beyond the standard PRD:
1. Job Definition
What job is the AI hired for? Be specific.
| Field | Bad | Good |
|---|---|---|
| Job | "Help users write better" | "Generate first draft of marketing emails that match brand voice and include a CTA" |
| Trigger | "When they need help" | "When user provides product name, audience, and goal" |
| Success | "User is happy" | "User edits less than 30% of generated text before sending" |
2. Quality Targets
Replace binary pass/fail with distribution targets:
For [feature], outputs must score:
- [Dimension A]: ≥ [score] for [percentage]% of requests
- [Dimension B]: ≥ [score] for [percentage]% of requests
- NEVER: [unacceptable outcome] for more than [N]% of requests
Worked example:
For email generation, outputs must score:
- Relevance: ≥ 4/5 for 85% of requests
- Brand voice: ≥ 3/5 for 90% of requests
- NEVER: factually incorrect claims for more than 1% of requests
- NEVER: competitor mentions for more than 0.1% of requests
3. Input Universe
Map the full range of what your AI will receive. Not just the happy path.
| Input Category | Examples | Expected % | Quality Expectation |
|---|---|---|---|
| Clean, standard | Well-formed request, clear intent | 60% | Excellent |
| Ambiguous | Vague request, multiple interpretations | 20% | Acceptable (ask clarifying question) |
| Edge case | Extremely long, empty, wrong language | 10% | Graceful degradation |
| Adversarial | Jailbreak attempts, off-topic abuse | 5% | Safe refusal |
| Out of scope | Unrelated to product purpose | 5% | Clear redirect |
Questions to ask for each category:
- What happens with sarcasm, slang, typos?
- What happens when the user provides correct information that conflicts with training data?
- What happens when the same question is asked five different ways?
- What happens when the input is in a language you didn't optimize for?
4. Failure Budget
How many bad outputs are acceptable? Define this before building, not after.
| Failure Type | Budget | Response |
|---|---|---|
| Harmful (safety, legal, PII) | 0% target, <0.01% tolerated | Automated blocking, human review |
| Wrong (factually incorrect) | <5% | Flag for eval, improve pipeline |
| Unhelpful (misses intent) | <15% | Monitor, iterate on prompts |
| Imperfect (could be better) | <40% | Track trend, improve over time |
The failure budget is a leadership alignment tool. Stakeholders who think in binary need to see: "We're targeting 95% acceptable, and here's our plan for the 5%."
5. Refusal Spec
What should the AI refuse to do? This is as important as what it should do.
| Category | Action | Response to User |
|---|---|---|
| Out of scope | Refuse with redirect | "I can help with X. For Y, try..." |
| Harmful request | Refuse firmly | Standard refusal |
| Uncertain | Acknowledge uncertainty | "I'm not confident about this. Here's what I know..." |
| Ambiguous | Ask for clarity | "Could you clarify whether you mean A or B?" |
6. Human Fallback
Every AI feature needs an escape hatch. What happens when the AI can't handle it?
| Trigger | Escalation Path | SLA |
|---|---|---|
| User reports bad output | Flag for review | 24h |
| Automated eval catches regression | Alert team, pause if critical | 1h |
| Output confidence below threshold | Route to human | Real-time |
| Safety filter triggers | Block output, log, review | Immediate |
7. Eval Strategy
Written into the PRD, not treated as afterthought.
| What | How | When |
|---|---|---|
| Golden dataset | Representative inputs with scored outputs | Before build |
| Rubric | Scoring criteria per dimension | Before build |
| Automated evals | CI/CD integrated checks | Every change |
| Trace review | Manual analysis of failures | Weekly |
| User signal | Satisfaction, edit rate, retry rate | Continuous |
Example Pairs
Every AI PRD should include input/output examples at each quality tier:
| Quality | Input | Expected Output | Why This Score |
|---|---|---|---|
| Excellent | "Email for SaaS founders about our new analytics feature" | Specific, brand-voice email with relevant CTA | Matches intent, voice, actionable |
| Acceptable | "Write something about analytics" | Generic but relevant email, may need voice editing | Right direction, needs polish |
| Unacceptable | "Write something about analytics" | Email about a competitor's product | Wrong subject, could embarrass |
Include at least 5 examples per tier. These become the seed of your golden dataset.
The PRD Checklist
Before signing off on an AI PRD:
- Quality targets use distributions, not binary pass/fail
- Input universe covers all five categories (clean, ambiguous, edge, adversarial, out-of-scope)
- Failure budget defined and stakeholder-approved
- Refusal spec covers what the AI should NOT do
- Human fallback path defined
- Eval strategy included with timeline
- Example pairs at three quality tiers
- Cost, latency, and quality tradeoffs documented
- Engineers, designers, and data scientists contributed (not PM monologue)
Context
- AI Product Principles — What "good" means
- AI Evaluation — How to measure quality
- Jobs To Be Done — Define the job before the PRD
- JTBD Interviews — Understand real user needs
- Product Design — Design principles for the interface layer