AI Product Requirements

How do you spec a product that never gives the same answer twice?

Traditional PRDs define behavior: "When user clicks X, system does Y." AI PRDs define boundaries and quality distributions: "Given input type X, output should score above Y on dimension Z at least N% of the time."

If your PRD reads like a traditional spec with "AI" sprinkled in, you're building on the wrong foundation.

What Changes

Traditional PRD	AI PRD
Acceptance criteria: pass/fail	Quality targets: percentage above threshold
Define exact behavior	Define acceptable range
Edge cases are bugs	Edge cases are statistical certainties
Test before ship	Evaluate continuously
User stories	User stories + failure stories
"The system shall..."	"The system shall... N% of the time..."

The AI PRD Template

Every AI feature needs these sections beyond the standard PRD:

1. Job Definition

What job is the AI hired for? Be specific.

Field	Bad	Good
Job	"Help users write better"	"Generate first draft of marketing emails that match brand voice and include a CTA"
Trigger	"When they need help"	"When user provides product name, audience, and goal"
Success	"User is happy"	"User edits less than 30% of generated text before sending"

2. Quality Targets

Replace binary pass/fail with distribution targets:

For [feature], outputs must score:
- [Dimension A]: ≥ [score] for [percentage]% of requests
- [Dimension B]: ≥ [score] for [percentage]% of requests
- NEVER: [unacceptable outcome] for more than [N]% of requests

Worked example:

For email generation, outputs must score:
- Relevance: ≥ 4/5 for 85% of requests
- Brand voice: ≥ 3/5 for 90% of requests
- NEVER: factually incorrect claims for more than 1% of requests
- NEVER: competitor mentions for more than 0.1% of requests

3. Input Universe

Map the full range of what your AI will receive. Not just the happy path.

Input Category	Examples	Expected %	Quality Expectation
Clean, standard	Well-formed request, clear intent	60%	Excellent
Ambiguous	Vague request, multiple interpretations	20%	Acceptable (ask clarifying question)
Edge case	Extremely long, empty, wrong language	10%	Graceful degradation
Adversarial	Jailbreak attempts, off-topic abuse	5%	Safe refusal
Out of scope	Unrelated to product purpose	5%	Clear redirect

Questions to ask for each category:

What happens with sarcasm, slang, typos?
What happens when the user provides correct information that conflicts with training data?
What happens when the same question is asked five different ways?
What happens when the input is in a language you didn't optimize for?

4. Failure Budget

How many bad outputs are acceptable? Define this before building, not after.

Failure Type	Budget	Response
Harmful (safety, legal, PII)	0% target, <0.01% tolerated	Automated blocking, human review
Wrong (factually incorrect)	<5%	Flag for eval, improve pipeline
Unhelpful (misses intent)	<15%	Monitor, iterate on prompts
Imperfect (could be better)	<40%	Track trend, improve over time

The failure budget is a leadership alignment tool. Stakeholders who think in binary need to see: "We're targeting 95% acceptable, and here's our plan for the 5%."

5. Refusal Spec

What should the AI refuse to do? This is as important as what it should do.

Category	Action	Response to User
Out of scope	Refuse with redirect	"I can help with X. For Y, try..."
Harmful request	Refuse firmly	Standard refusal
Uncertain	Acknowledge uncertainty	"I'm not confident about this. Here's what I know..."
Ambiguous	Ask for clarity	"Could you clarify whether you mean A or B?"

6. Human Fallback

Every AI feature needs an escape hatch. What happens when the AI can't handle it?

Trigger	Escalation Path	SLA
User reports bad output	Flag for review	24h
Automated eval catches regression	Alert team, pause if critical	1h
Output confidence below threshold	Route to human	Real-time
Safety filter triggers	Block output, log, review	Immediate

7. Eval Strategy

Written into the PRD, not treated as afterthought.

What	How	When
Golden dataset	Representative inputs with scored outputs	Before build
Rubric	Scoring criteria per dimension	Before build
Automated evals	CI/CD integrated checks	Every change
Trace review	Manual analysis of failures	Weekly
User signal	Satisfaction, edit rate, retry rate	Continuous

Example Pairs

Every AI PRD should include input/output examples at each quality tier:

Quality	Input	Expected Output	Why This Score
Excellent	"Email for SaaS founders about our new analytics feature"	Specific, brand-voice email with relevant CTA	Matches intent, voice, actionable
Acceptable	"Write something about analytics"	Generic but relevant email, may need voice editing	Right direction, needs polish
Unacceptable	"Write something about analytics"	Email about a competitor's product	Wrong subject, could embarrass

Include at least 5 examples per tier. These become the seed of your golden dataset.

The PRD Checklist

Before signing off on an AI PRD:

Quality targets use distributions, not binary pass/fail
Input universe covers all five categories (clean, ambiguous, edge, adversarial, out-of-scope)
Failure budget defined and stakeholder-approved
Refusal spec covers what the AI should NOT do
Human fallback path defined
Eval strategy included with timeline
Example pairs at three quality tiers
Cost, latency, and quality tradeoffs documented
Engineers, designers, and data scientists contributed (not PM monologue)

Context

AI Product Principles — What "good" means
AI Evaluation — How to measure quality
Jobs To Be Done — Define the job before the PRD
JTBD Interviews — Understand real user needs
Product Design — Design principles for the interface layer

What Changes​

The AI PRD Template​

1. Job Definition​

2. Quality Targets​

3. Input Universe​

4. Failure Budget​

5. Refusal Spec​

6. Human Fallback​

7. Eval Strategy​

Example Pairs​

The PRD Checklist​

Context​