AI Product Requirements

How do you spec a product that never gives the same answer twice?

Traditional PRDs define behavior: "When user clicks X, system does Y." AI PRDs define boundaries and quality distributions: "Given input type X, output should score above Y on dimension Z at least N% of the time."

If your PRD reads like a traditional spec with "AI" sprinkled in, you're building on the wrong foundation.

What Changes

Traditional PRD	AI PRD
Acceptance criteria: pass/fail	Quality targets: percentage above threshold
Define exact behavior	Define acceptable range
Edge cases are bugs	Edge cases are statistical certainties
Test before ship	Evaluate continuously
User stories	User stories + failure stories
"The system shall..."	"The system shall... N% of the time..."
Spec is for humans	Spec is for humans AND agents
Ship once	Commission in stages
Features: done or not	Features: Install → Test → Operational → Optimize

Strategic Gate

Don't write requirements for something you shouldn't build. Answer these before any PRD work begins.

Build Decision

Question	Red Flag	Green Light
What job does this do that wasn't possible before?	"It's faster" (speed isn't a new job)	"It eliminates a 20h/week manual process entirely"
Does using it make it better?	Value is static — v1 = v100	Compound flywheel — every use improves the next
Who loses if we don't build it?	"We'd miss a trend"	"Customer X loses 40h/month and will churn"
What's our unfair edge?	"We'll use AI" (everyone can)	"We have proprietary data / domain workflow / distribution"
Does this fit the build order?	Dependencies unresolved upstream	All blockers cleared or this IS the blocker

Priority Score

Rate each dimension 1–5 with specific evidence. No number without a reason.

Dimension	Question	Score
Pain	How badly does the status quo hurt?	/5
Demand	Are people actively seeking this?	/5
Edge	Do we have an unfair advantage?	/5
Trend	Is the tailwind growing or dying?	/5
Conversion	Can we reach buyers efficiently?	/5
Composite	Weighted average	/5

Kill signal: If composite < 3.0, stop. Revisit when conditions change.

PRD Sections

Every AI feature needs these sections beyond the standard PRD. Sections 1–4 define the problem. Sections 5–11 define the solution. Section 12 defines the build contract.

1. Job Definition

What job is the AI hired for? Be specific.

Field	Bad	Good
Job	"Help users write better"	"Generate first draft of marketing emails that match brand voice and include a CTA"
Trigger	"When they need help"	"When user provides product name, audience, and goal"
Success	"User is happy"	"User edits less than 30% of generated text before sending"
Compound	Not mentioned	"Each sent email trains brand voice model — 100th email needs 5% edits"

The compound test: Does using this product make this product better? If yes, describe the flywheel. If no, you're building a tool, not a moat.

2. Quality Targets

Replace binary pass/fail with distribution targets:

For [feature], outputs must score:
- [Dimension A]: ≥ [score] for [percentage]% of requests
- [Dimension B]: ≥ [score] for [percentage]% of requests
- NEVER: [unacceptable outcome] for more than [N]% of requests

Worked example:

For email generation, outputs must score:
- Relevance: ≥ 4/5 for 85% of requests
- Brand voice: ≥ 3/5 for 90% of requests
- NEVER: factually incorrect claims for more than 1% of requests
- NEVER: competitor mentions for more than 0.1% of requests

See AI Product Principles for quality dimensions, three tiers, and distribution thinking.

3. Input Universe

Map the full range of what your AI will receive. Not just the happy path.

Input Category	Examples	Expected %	Quality Expectation
Clean, standard	Well-formed request, clear intent	60%	Excellent
Ambiguous	Vague request, multiple interpretations	20%	Acceptable (ask clarifying question)
Edge case	Extremely long, empty, wrong language	10%	Graceful degradation
Adversarial	Jailbreak attempts, off-topic abuse	5%	Safe refusal
Out of scope	Unrelated to product purpose	5%	Clear redirect

Questions to ask for each category:

What happens with sarcasm, slang, typos?
What happens when the user provides correct information that conflicts with training data?
What happens when the same question is asked five different ways?
What happens when the input is in a language you didn't optimize for?

4. Failure Budget

How many bad outputs are acceptable? Define this before building, not after.

Failure Type	Budget	Response
Harmful (safety, legal, PII)	0% target, <0.01% tolerated	Automated blocking, human review
Wrong (factually incorrect)	<5%	Flag for eval, improve pipeline
Unhelpful (misses intent)	<15%	Monitor, iterate on prompts
Imperfect (could be better)	<40%	Track trend, improve over time

The failure budget is a leadership alignment tool. Stakeholders who think in binary need to see: "We're targeting 95% acceptable, and here's our plan for the 5%."

5. Refusal Spec

What should the AI refuse to do? This is as important as what it should do.

Category	Action	Response to User
Out of scope	Refuse with redirect	"I can help with X. For Y, try..."
Harmful request	Refuse firmly	Standard refusal
Uncertain	Acknowledge uncertainty	"I'm not confident about this. Here's what I know..."
Ambiguous	Ask for clarity	"Could you clarify whether you mean A or B?"
Confidence below threshold	Route to human	"Let me connect you with someone who can help."

6. Human Fallback

Every AI feature needs an escape hatch. What happens when the AI can't handle it?

Trigger	Escalation Path	SLA
User reports bad output	Flag for review	24h
Automated eval catches regression	Alert team, pause if critical	1h
Output confidence below threshold	Route to human	Real-time
Safety filter triggers	Block output, log, review	Immediate

7. Example Pairs

Every AI PRD must include input/output examples at each quality tier. Minimum 5 per tier. These become the seed of your golden dataset.

Quality	Input	Expected Output	Why This Score
Excellent	"Email for SaaS founders about our new analytics feature"	Specific, brand-voice email with relevant CTA	Matches intent, voice, actionable
Acceptable	"Write something about analytics"	Generic but relevant email, may need voice editing	Right direction, needs polish
Unacceptable	"Write something about analytics"	Email about a competitor's product	Wrong subject, could embarrass

Include at least 5 per tier. The quality of your examples determines the quality of your evals. Garbage examples, garbage evals, shipping blind.

8. Eval Strategy

Written into the PRD, not treated as afterthought.

What	How	When
Golden dataset	Representative inputs with scored outputs	Before build
Rubric	Scoring criteria per dimension	Before build
Automated evals	CI/CD integrated checks	Every change
Trace review	Manual analysis of failures	Weekly
User signal	Satisfaction, edit rate, retry rate	Continuous

9. Cost, Latency, Quality Triangle

You can't have all three. Make the tradeoff explicit.

Constraint	Implication	Decision
Cost ceiling	Limits model size, call frequency	"Max $0.02 per request → use cached responses for top 100 queries"
Latency ceiling	Limits model complexity, chain length	"Must respond in <2s → single call, no chain-of-thought"
Quality floor	Limits cost savings	"Accuracy ≥85% on golden dataset → cannot use smallest model"

The honest question: which constraint will you relax when two conflict? Write it down. If you don't decide now, the engineer will decide for you.

Build Contract

The PRD must contain the full feature set in a parseable format. This is the handoff that engineering builds from.

10. Feature / Function / Outcome Table

Every feature has a function (what it does), an outcome (why it matters), a job (which user need), and a state.

#	Feature	Function	Outcome	Job	State
1	Answer Library	Store approved RFP answers by category	Auto-fill future bids from past wins	Reduce bid prep time	Gap
2	Brand Voice Model	Learn tone from sent emails	Generated content matches company voice	Maintain consistency at scale	Stub
3	Confidence Score	Display AI certainty per output	User knows when to review vs. trust	Reduce review burden	Not verified

State values (exact enum — parseable by downstream tooling):

State	Meaning
Live	Deployed, tested, in production
Built	Code complete, not yet deployed
Dormant	Built but not wired / activated
Partial	Some functionality working
Not verified	Deployed but not tested against acceptance criteria
Gap	Not built, needed
Stub	Placeholder exists, no real implementation
Broken	Was working, now failing

11. Build Order

Features don't ship in parallel. Dependencies determine sequence. Each sprint references features by # from the table above.

Sprint	Features	What	Effort	Acceptance
0	#3	Wire confidence scoring to all AI outputs	2d	Score displays on every generated output
1	#1	Seed answer library with 50 entries from past bids	3d	User can search and retrieve answers by category
2	#2	Train brand voice on 100 sent emails	5d	Generated emails score ≥3/5 on brand voice eval

The build order encodes the dependency graph. If Sprint 2 depends on Sprint 1, say so. If sprints can run in parallel, mark them.

12. Commissioning Stages

Every feature progresses through four stages. This replaces the vague "MVP → iterate" pattern.

Stage	Definition	Gate to Next
Install	Code deployed, feature exists	Can be invoked without errors
Test	Running against eval suite	Passes acceptance criteria from build order
Operational	Handling real traffic, monitored	Quality targets met for 7 consecutive days
Optimize	Tuning for cost, latency, edge cases	Improvement rate < threshold (diminishing returns)

Track commissioning per feature:

#	Feature	Install	Test	Operational	Optimize
1	Answer Library	Pass	Pass	—	—
2	Brand Voice Model	Pass	—	—	—
3	Confidence Score	—	—	—	—

Aggregate commissioning % = (completed stages / total stages) x 100. This is the real progress metric — not "features shipped."

Agent-Facing Spec

If AI agents (coding agents, orchestration agents) will build from or operate within this PRD, the spec must be machine-readable, not just human-readable.

Frontmatter Contract

The PRD index file must include structured frontmatter that downstream parsers and agents can read:

---
title: "Feature Name"
slug: feature-slug
status: planning | building | testing | operational | optimizing
priority_pain: 4
priority_demand: 3
priority_edge: 5
priority_trend: 4
priority_conversion: 3
priority_score: 3.8
kill_date: 2025-06-01
blocked_by: [identity-access]
last_scored: 2025-03-15
---

Commands Block

If agents will build or test this feature, include executable commands:

## Commands
- Build: `nx run sales-crm:build`
- Test: `nx run sales-crm:test --coverage`
- Lint: `nx run sales-crm:lint`
- E2E: `nx run sales-crm-e2e:e2e`
- Eval: `nx run sales-crm:eval --dataset=golden`

Boundaries

What agents must never do:

## Boundaries
- ✅ Always: Run tests before commits, follow naming conventions
- ✅ Always: Check feature state before modifying — don't break Live features
- ⚠️ Ask first: Database schema changes, adding dependencies, modifying shared libs
- 🚫 Never: Commit secrets, modify auth middleware without review, skip eval suite
- 🚫 Never: Change API contracts without updating downstream consumers

Test Contract

Map each feature to its acceptance test:

#	Feature	Test File	Assertion
1	Answer Library	`apps/sales-crm/tests/answer-library.spec.ts`	Returns top 3 matches with score ≥0.8
2	Brand Voice Model	`apps/sales-crm/tests/brand-voice.eval.ts`	≥3/5 on brand dimension for 90% of golden set
3	Confidence Score	`apps/sales-crm/tests/confidence.spec.ts`	Score renders on all AI output components

The PRD Checklist

Before signing off on an AI PRD:

Problem definition:

Job definition uses compound test — does usage improve the product?
Strategic gate passed — priority score ≥ 3.0 with evidence
Problem stated as SIO: Situation, Intention, Obstacle, Hardest Thing

Quality specification:

Quality targets use distributions, not binary pass/fail
Input universe covers all five categories (clean, ambiguous, edge, adversarial, out-of-scope)
Failure budget defined and stakeholder-approved
Refusal spec covers what the AI should NOT do
Human fallback path defined
Example pairs at three quality tiers (minimum 5 per tier)
Cost, latency, quality tradeoffs documented

Build contract:

Feature/Function/Outcome table with State column (exact enum values)
Build order with sprint sequence, feature # refs, and acceptance criteria
Commissioning stages tracked per feature (Install → Test → Operational → Optimize)
Eval strategy included with timeline

Agent readiness:

Frontmatter contract with scoring fields and status
Commands block with exact executable commands
Boundaries defined (always / ask first / never)
Test contract mapping features to test files and assertions

Cross-checks:

Engineers, designers, and data scientists contributed (not PM monologue)
Dependencies on other PRDs declared
Kill date set — when does this stop making sense?

Positioning

How This Guide Differs. Currently PRDs optimise for human alignment — getting stakeholders to agree on what to build. This guide adds three layers they don't address:

Distribution thinking — AI outputs are stochastic. Quality targets, failure budgets, and input universe mapping account for variance that traditional PRDs treat as bugs. See AI Product Principles.

Build contract — Feature/Function/Outcome tables with State enums, sprint sequencing with feature # refs, and commissioning stages per feature. The PRD isn't just an alignment doc — it's what engineering builds from.

Agent readiness — Frontmatter contracts, commands blocks, boundary definitions, and test contracts. The PRD is consumed by coding agents, not just human engineers. This is the "Agent Experience" layer that Addy Osmani names but doesn't fully prescribe.

Context

Trust Architecture — Why boundaries must be structural, not intentional — the theory behind failure budgets, refusal specs, and agent boundaries
AI Product Principles — What "good" means for non-deterministic output
Positioning — Key point of difference
AI Evaluation — How to measure quality with CRAFT
AI Observability — Where it fails and why
Jobs To Be Done — Define the job before the PRD
JTBD Interviews — Understand real user needs
Product Design — Design principles for the interface layer
Pictures Templates — Pre-flight maps that feed the PRD
Commissioning — The Install → Test → Operational → Optimize model

Questions

If AI outputs are distributions, not deterministic — when is a spec complete enough to build from?

What's the minimum number of example pairs per tier before evals become statistically meaningful?
Should the commissioning model use four stages or five — is there a gate between "Operational" and "Trusted"?
How do you score Edge (dimension 3) when the edge is speed-to-market rather than proprietary data?
What happens when the strategic gate kills a PRD that engineering has already started building?
Is the agent-facing spec section premature for teams without coding agents — or does writing it force the right discipline regardless?

What Changes​

Strategic Gate​

Build Decision​

Priority Score​

PRD Sections​

1. Job Definition​

2. Quality Targets​

3. Input Universe​

4. Failure Budget​

5. Refusal Spec​

6. Human Fallback​

7. Example Pairs​

8. Eval Strategy​

9. Cost, Latency, Quality Triangle​

Build Contract​

10. Feature / Function / Outcome Table​

11. Build Order​

12. Commissioning Stages​

Agent-Facing Spec​

Frontmatter Contract​

Commands Block​

Boundaries​

Test Contract​

The PRD Checklist​

Positioning​

Context​

Links​

Questions​