AI Product Requirements
How do you spec a product that never gives the same answer twice?
Traditional PRDs define behavior: "When user clicks X, system does Y." AI PRDs define boundaries and quality distributions: "Given input type X, output should score above Y on dimension Z at least N% of the time."
If your PRD reads like a traditional spec with "AI" sprinkled in, you're building on the wrong foundation.
What Changes
| Traditional PRD | AI PRD |
|---|---|
| Acceptance criteria: pass/fail | Quality targets: percentage above threshold |
| Define exact behavior | Define acceptable range |
| Edge cases are bugs | Edge cases are statistical certainties |
| Test before ship | Evaluate continuously |
| User stories | User stories + failure stories |
| "The system shall..." | "The system shall... N% of the time..." |
| Spec is for humans | Spec is for humans AND agents |
| Ship once | Commission in stages |
| Features: done or not | Features: Install → Test → Operational → Optimize |
Strategic Gate
Don't write requirements for something you shouldn't build. Answer these before any PRD work begins.
Build Decision
| Question | Red Flag | Green Light |
|---|---|---|
| What job does this do that wasn't possible before? | "It's faster" (speed isn't a new job) | "It eliminates a 20h/week manual process entirely" |
| Does using it make it better? | Value is static — v1 = v100 | Compound flywheel — every use improves the next |
| Who loses if we don't build it? | "We'd miss a trend" | "Customer X loses 40h/month and will churn" |
| What's our unfair edge? | "We'll use AI" (everyone can) | "We have proprietary data / domain workflow / distribution" |
| Does this fit the build order? | Dependencies unresolved upstream | All blockers cleared or this IS the blocker |
Priority Score
Rate each dimension 1–5 with specific evidence. No number without a reason.
| Dimension | Question | Score | Evidence |
|---|---|---|---|
| Pain | How badly does the status quo hurt? | /5 | |
| Demand | Are people actively seeking this? | /5 | |
| Edge | Do we have an unfair advantage? | /5 | |
| Trend | Is the tailwind growing or dying? | /5 | |
| Conversion | Can we reach buyers efficiently? | /5 | |
| Composite | Weighted average | /5 |
Kill signal: If composite < 3.0, stop. Revisit when conditions change.
PRD Sections
Every AI feature needs these sections beyond the standard PRD. Sections 1–4 define the problem. Sections 5–11 define the solution. Section 12 defines the build contract.
1. Job Definition
What job is the AI hired for? Be specific.
| Field | Bad | Good |
|---|---|---|
| Job | "Help users write better" | "Generate first draft of marketing emails that match brand voice and include a CTA" |
| Trigger | "When they need help" | "When user provides product name, audience, and goal" |
| Success | "User is happy" | "User edits less than 30% of generated text before sending" |
| Compound | Not mentioned | "Each sent email trains brand voice model — 100th email needs 5% edits" |
The compound test: Does using this product make this product better? If yes, describe the flywheel. If no, you're building a tool, not a moat.
2. Quality Targets
Replace binary pass/fail with distribution targets:
For [feature], outputs must score:
- [Dimension A]: ≥ [score] for [percentage]% of requests
- [Dimension B]: ≥ [score] for [percentage]% of requests
- NEVER: [unacceptable outcome] for more than [N]% of requests
Worked example:
For email generation, outputs must score:
- Relevance: ≥ 4/5 for 85% of requests
- Brand voice: ≥ 3/5 for 90% of requests
- NEVER: factually incorrect claims for more than 1% of requests
- NEVER: competitor mentions for more than 0.1% of requests
See AI Product Principles for quality dimensions, three tiers, and distribution thinking.
3. Input Universe
Map the full range of what your AI will receive. Not just the happy path.
| Input Category | Examples | Expected % | Quality Expectation |
|---|---|---|---|
| Clean, standard | Well-formed request, clear intent | 60% | Excellent |
| Ambiguous | Vague request, multiple interpretations | 20% | Acceptable (ask clarifying question) |
| Edge case | Extremely long, empty, wrong language | 10% | Graceful degradation |
| Adversarial | Jailbreak attempts, off-topic abuse | 5% | Safe refusal |
| Out of scope | Unrelated to product purpose | 5% | Clear redirect |
Questions to ask for each category:
- What happens with sarcasm, slang, typos?
- What happens when the user provides correct information that conflicts with training data?
- What happens when the same question is asked five different ways?
- What happens when the input is in a language you didn't optimize for?
4. Failure Budget
How many bad outputs are acceptable? Define this before building, not after.
| Failure Type | Budget | Response |
|---|---|---|
| Harmful (safety, legal, PII) | 0% target, <0.01% tolerated | Automated blocking, human review |
| Wrong (factually incorrect) | <5% | Flag for eval, improve pipeline |
| Unhelpful (misses intent) | <15% | Monitor, iterate on prompts |
| Imperfect (could be better) | <40% | Track trend, improve over time |
The failure budget is a leadership alignment tool. Stakeholders who think in binary need to see: "We're targeting 95% acceptable, and here's our plan for the 5%."
5. Refusal Spec
What should the AI refuse to do? This is as important as what it should do.
| Category | Action | Response to User |
|---|---|---|
| Out of scope | Refuse with redirect | "I can help with X. For Y, try..." |
| Harmful request | Refuse firmly | Standard refusal |
| Uncertain | Acknowledge uncertainty | "I'm not confident about this. Here's what I know..." |
| Ambiguous | Ask for clarity | "Could you clarify whether you mean A or B?" |
| Confidence below threshold | Route to human | "Let me connect you with someone who can help." |
6. Human Fallback
Every AI feature needs an escape hatch. What happens when the AI can't handle it?
| Trigger | Escalation Path | SLA |
|---|---|---|
| User reports bad output | Flag for review | 24h |
| Automated eval catches regression | Alert team, pause if critical | 1h |
| Output confidence below threshold | Route to human | Real-time |
| Safety filter triggers | Block output, log, review | Immediate |
7. Example Pairs
Every AI PRD must include input/output examples at each quality tier. Minimum 5 per tier. These become the seed of your golden dataset.
| Quality | Input | Expected Output | Why This Score |
|---|---|---|---|
| Excellent | "Email for SaaS founders about our new analytics feature" | Specific, brand-voice email with relevant CTA | Matches intent, voice, actionable |
| Acceptable | "Write something about analytics" | Generic but relevant email, may need voice editing | Right direction, needs polish |
| Unacceptable | "Write something about analytics" | Email about a competitor's product | Wrong subject, could embarrass |
Include at least 5 per tier. The quality of your examples determines the quality of your evals. Garbage examples, garbage evals, shipping blind.
8. Eval Strategy
Written into the PRD, not treated as afterthought.
| What | How | When |
|---|---|---|
| Golden dataset | Representative inputs with scored outputs | Before build |
| Rubric | Scoring criteria per dimension | Before build |
| Automated evals | CI/CD integrated checks | Every change |
| Trace review | Manual analysis of failures | Weekly |
| User signal | Satisfaction, edit rate, retry rate | Continuous |
9. Cost, Latency, Quality Triangle
You can't have all three. Make the tradeoff explicit.
| Constraint | Implication | Decision |
|---|---|---|
| Cost ceiling | Limits model size, call frequency | "Max $0.02 per request → use cached responses for top 100 queries" |
| Latency ceiling | Limits model complexity, chain length | "Must respond in <2s → single call, no chain-of-thought" |
| Quality floor | Limits cost savings | "Accuracy ≥85% on golden dataset → cannot use smallest model" |
The honest question: which constraint will you relax when two conflict? Write it down. If you don't decide now, the engineer will decide for you.
Build Contract
The PRD must contain the full feature set in a parseable format. This is the handoff that engineering builds from.
10. Feature / Function / Outcome Table
Every feature has a function (what it does), an outcome (why it matters), a job (which user need), and a state.
| # | Feature | Function | Outcome | Job | State |
|---|---|---|---|---|---|
| 1 | Answer Library | Store approved RFP answers by category | Auto-fill future bids from past wins | Reduce bid prep time | Gap |
| 2 | Brand Voice Model | Learn tone from sent emails | Generated content matches company voice | Maintain consistency at scale | Stub |
| 3 | Confidence Score | Display AI certainty per output | User knows when to review vs. trust | Reduce review burden | Not verified |
State values (exact enum — parseable by downstream tooling):
| State | Meaning |
|---|---|
| Live | Deployed, tested, in production |
| Built | Code complete, not yet deployed |
| Dormant | Built but not wired / activated |
| Partial | Some functionality working |
| Not verified | Deployed but not tested against acceptance criteria |
| Gap | Not built, needed |
| Stub | Placeholder exists, no real implementation |
| Broken | Was working, now failing |
11. Build Order
Features don't ship in parallel. Dependencies determine sequence. Each sprint references features by # from the table above.
| Sprint | Features | What | Effort | Acceptance |
|---|---|---|---|---|
| 0 | #3 | Wire confidence scoring to all AI outputs | 2d | Score displays on every generated output |
| 1 | #1 | Seed answer library with 50 entries from past bids | 3d | User can search and retrieve answers by category |
| 2 | #2 | Train brand voice on 100 sent emails | 5d | Generated emails score ≥3/5 on brand voice eval |
The build order encodes the dependency graph. If Sprint 2 depends on Sprint 1, say so. If sprints can run in parallel, mark them.
12. Commissioning Stages
Every feature progresses through four stages. This replaces the vague "MVP → iterate" pattern.
| Stage | Definition | Gate to Next |
|---|---|---|
| Install | Code deployed, feature exists | Can be invoked without errors |
| Test | Running against eval suite | Passes acceptance criteria from build order |
| Operational | Handling real traffic, monitored | Quality targets met for 7 consecutive days |
| Optimize | Tuning for cost, latency, edge cases | Improvement rate < threshold (diminishing returns) |
Track commissioning per feature:
| # | Feature | Install | Test | Operational | Optimize |
|---|---|---|---|---|---|
| 1 | Answer Library | Pass | Pass | — | — |
| 2 | Brand Voice Model | Pass | — | — | — |
| 3 | Confidence Score | — | — | — | — |
Aggregate commissioning % = (completed stages / total stages) x 100. This is the real progress metric — not "features shipped."
Agent-Facing Spec
If AI agents (coding agents, orchestration agents) will build from or operate within this PRD, the spec must be machine-readable, not just human-readable.
Frontmatter Contract
The PRD index file must include structured frontmatter that downstream parsers and agents can read:
---
title: "Feature Name"
slug: feature-slug
status: planning | building | testing | operational | optimizing
priority_pain: 4
priority_demand: 3
priority_edge: 5
priority_trend: 4
priority_conversion: 3
priority_score: 3.8
kill_date: 2025-06-01
blocked_by: [identity-access]
last_scored: 2025-03-15
---
Commands Block
If agents will build or test this feature, include executable commands:
## Commands
- Build: `nx run sales-crm:build`
- Test: `nx run sales-crm:test --coverage`
- Lint: `nx run sales-crm:lint`
- E2E: `nx run sales-crm-e2e:e2e`
- Eval: `nx run sales-crm:eval --dataset=golden`
Boundaries
What agents must never do:
## Boundaries
- ✅ Always: Run tests before commits, follow naming conventions
- ✅ Always: Check feature state before modifying — don't break Live features
- ⚠️ Ask first: Database schema changes, adding dependencies, modifying shared libs
- 🚫 Never: Commit secrets, modify auth middleware without review, skip eval suite
- 🚫 Never: Change API contracts without updating downstream consumers
Test Contract
Map each feature to its acceptance test:
| # | Feature | Test File | Assertion |
|---|---|---|---|
| 1 | Answer Library | apps/sales-crm/tests/answer-library.spec.ts | Returns top 3 matches with score ≥0.8 |
| 2 | Brand Voice Model | apps/sales-crm/tests/brand-voice.eval.ts | ≥3/5 on brand dimension for 90% of golden set |
| 3 | Confidence Score | apps/sales-crm/tests/confidence.spec.ts | Score renders on all AI output components |
The PRD Checklist
Before signing off on an AI PRD:
Problem definition:
- Job definition uses compound test — does usage improve the product?
- Strategic gate passed — priority score ≥ 3.0 with evidence
- Problem stated as SIO: Situation, Intention, Obstacle, Hardest Thing
Quality specification:
- Quality targets use distributions, not binary pass/fail
- Input universe covers all five categories (clean, ambiguous, edge, adversarial, out-of-scope)
- Failure budget defined and stakeholder-approved
- Refusal spec covers what the AI should NOT do
- Human fallback path defined
- Example pairs at three quality tiers (minimum 5 per tier)
- Cost, latency, quality tradeoffs documented
Build contract:
- Feature/Function/Outcome table with State column (exact enum values)
- Build order with sprint sequence, feature # refs, and acceptance criteria
- Commissioning stages tracked per feature (Install → Test → Operational → Optimize)
- Eval strategy included with timeline
Agent readiness:
- Frontmatter contract with scoring fields and status
- Commands block with exact executable commands
- Boundaries defined (always / ask first / never)
- Test contract mapping features to test files and assertions
Cross-checks:
- Engineers, designers, and data scientists contributed (not PM monologue)
- Dependencies on other PRDs declared
- Kill date set — when does this stop making sense?
Positioning
How This Guide Differs. Currently PRDs optimise for human alignment — getting stakeholders to agree on what to build. This guide adds three layers they don't address:
Distribution thinking — AI outputs are stochastic. Quality targets, failure budgets, and input universe mapping account for variance that traditional PRDs treat as bugs. See AI Product Principles.
Build contract — Feature/Function/Outcome tables with State enums, sprint sequencing with feature # refs, and commissioning stages per feature. The PRD isn't just an alignment doc — it's what engineering builds from.
Agent readiness — Frontmatter contracts, commands blocks, boundary definitions, and test contracts. The PRD is consumed by coding agents, not just human engineers. This is the "Agent Experience" layer that Addy Osmani names but doesn't fully prescribe.
Context
- Trust Architecture — Why boundaries must be structural, not intentional — the theory behind failure budgets, refusal specs, and agent boundaries
- AI Product Principles — What "good" means for non-deterministic output
- Positioning — Key point of difference
- AI Evaluation — How to measure quality with CRAFT
- AI Observability — Where it fails and why
- Jobs To Be Done — Define the job before the PRD
- JTBD Interviews — Understand real user needs
- Product Design — Design principles for the interface layer
- Pictures Templates — Pre-flight maps that feed the PRD
- Commissioning — The Install → Test → Operational → Optimize model
Links
- Lenny Rachitsky — PRD Templates — The starting point most PMs use. Problem before solution, one page, non-goals section. No AI awareness, no build contract.
- Miqdad Jaffer (OpenAI) — AI PRD Template — Battle-tested AI PRD from someone who shipped at Shopify and advises at OpenAI. Strategic alignment, AI-specific considerations, GTM. Human-only document — no agent-facing spec or commissioning model.
- Addy Osmani (Google) — Specs for AI Agents — Names the concept: "Agent Experience (AX)." Six core areas, modular context, boundaries. Focused on coding agents consuming specs, not on the PRD production process itself.
- GitHub Spec Kit — Spec-Driven Development — Four-phase gated workflow. Specs as executable artifacts. Open source, agent-agnostic. No quality distributions, no failure budgets, no commissioning.
- Edo van Royen — PRD Template Meta-Analysis — Reviews 13+ templates (Intercom, Shape Up, Miro, Airbnb). Every good template forces problem clarity before solutions.
Questions
If AI outputs are distributions, not deterministic — when is a spec complete enough to build from?
- What's the minimum number of example pairs per tier before evals become statistically meaningful?
- Should the commissioning model use four stages or five — is there a gate between "Operational" and "Trusted"?
- How do you score Edge (dimension 3) when the edge is speed-to-market rather than proprietary data?
- What happens when the strategic gate kills a PRD that engineering has already started building?
- Is the agent-facing spec section premature for teams without coding agents — or does writing it force the right discipline regardless?