Create PRD Stories

How do you spec a product that never gives the same answer twice?

Traditional PRDs define behavior: "When user clicks X, system does Y." AI PRDs define boundaries and quality distributions: "Given input type X, output should score above Y on dimension Z at least N% of the time."

If your PRD reads like a traditional spec with "AI" sprinkled in, you're building on the wrong foundation.

What Changes

Traditional PRD	AI PRD
Acceptance criteria: pass/fail	Quality targets: percentage above threshold
Define exact behavior	Define acceptable range
Edge cases are bugs	Edge cases are statistical certainties
Test before ship	Evaluate continuously
User stories	User stories + failure stories
"The system shall..."	"The system shall... N% of the time..."
Spec is for humans	Spec is for humans AND agents
Ship once	Commission in stages
Features: done or not	Features: Install → Test → Operational → Optimize

Strategic Gate

Don't write requirements for something you shouldn't build. Answer these before any PRD work begins.

Build Decision

Question	Red Flag	Green Light
What job does this do that wasn't possible before?	"It's faster" (speed isn't a new job)	"It eliminates a 20h/week manual process entirely"
Does using it make it better?	Value is static — v1 = v100	Compound flywheel — every use improves the next
Who loses if we don't build it?	"We'd miss a trend"	"Customer X loses 40h/month and will churn"
What's our unfair edge?	"We'll use AI" (everyone can)	"We have proprietary data / domain workflow / distribution"
Does this fit the build order?	Dependencies unresolved upstream	All blockers cleared or this IS the blocker

Priority Score

Rate each dimension 1–5 with specific evidence. No number without a reason.

Dimension	Question	Score
Pain	How badly does the status quo hurt?	/5
Demand	Are people actively seeking this?	/5
Edge	Do we have an unfair advantage?	/5
Trend	Is the tailwind growing or dying?	/5
Conversion	Can we reach buyers efficiently?	/5
Composite	Product of all five	/3125

Kill signal: If composite < 50, park it. Revisit when conditions change. Bands: 500+ build now, 200-499 strong, 50-199 promising, <50 park.

Build Contract

The deliverable, not part of the framework. The Tight Five sections below (Principles, Performance, Platform, Protocols, Players) justify what's in this table. This table is what engineering builds from and what the commissioning dashboard reads.

Every feature has a function (what it does), an outcome (why it matters), a job (which user need), and a state.

Feature Table

#	Feature	Function	Outcome	Job	State
1	Answer Library	Store approved RFP answers by category	Auto-fill future bids from past wins	Reduce bid prep time	Gap
2	Brand Voice Model	Learn tone from sent emails	Generated content matches company voice	Maintain consistency at scale	Stub
3	Confidence Score	Display AI certainty per output	User knows when to review vs. trust	Reduce review burden	Not verified

State Enum

Exact values — parseable by downstream tooling:

State	Meaning
Live	Deployed, tested, in production
Built	Code complete, not yet deployed
Dormant	Built but not wired / activated
Partial	Some functionality working
Not verified	Deployed but not tested against acceptance criteria
Gap	Not built, needed
Stub	Placeholder exists, no real implementation
Broken	Was working, now failing

The five sections below are the Tight Five applied to PRD writing. Each justifies what's in the Build Contract above.

Principles

What truths guide the design? The job is the foundational truth. Refusal is a design constraint.

Job Definition

What job is the AI hired for? Be specific.

Field	Bad	Good
Job	"Help users write better"	"Generate first draft of marketing emails that match brand voice and include a CTA"
Trigger	"When they need help"	"When user provides product name, audience, and goal"
Success	"User is happy"	"User edits less than 30% of generated text before sending"
Compound	Not mentioned	"Each sent email trains brand voice model — 100th email needs 5% edits"

The compound test: Does using this product make this product better? If yes, describe the flywheel. If no, you're building a tool, not a moat.

Refusal Spec

What should the AI refuse to do? This is as important as what it should do.

Category	Action	Response to User
Out of scope	Refuse with redirect	"I can help with X. For Y, try..."
Harmful request	Refuse firmly	Standard refusal
Uncertain	Acknowledge uncertainty	"I'm not confident about this. Here's what I know..."
Ambiguous	Ask for clarity	"Could you clarify whether you mean A or B?"
Confidence below threshold	Route to human	"Let me connect you with someone who can help."

Performance

How do you know it's working? What good looks like, what bad costs, how you verify.

Quality Targets

Replace binary pass/fail with distribution targets:

For [feature], outputs must score:
- [Dimension A]: ≥ [score] for [percentage]% of requests
- [Dimension B]: ≥ [score] for [percentage]% of requests
- NEVER: [unacceptable outcome] for more than [N]% of requests

Worked example:

For email generation, outputs must score:
- Relevance: ≥ 4/5 for 85% of requests
- Brand voice: ≥ 3/5 for 90% of requests
- NEVER: factually incorrect claims for more than 1% of requests
- NEVER: competitor mentions for more than 0.1% of requests

See AI Product Principles for quality dimensions, three tiers, and distribution thinking.

Failure Budget

How many bad outputs are acceptable? Define this before building, not after.

Failure Type	Budget	Response
Harmful (safety, legal, PII)	0% target, <0.01% tolerated	Automated blocking, human review
Wrong (factually incorrect)	<5%	Flag for eval, improve pipeline
Unhelpful (misses intent)	<15%	Monitor, iterate on prompts
Imperfect (could be better)	<40%	Track trend, improve over time

The failure budget is a leadership alignment tool. Stakeholders who think in binary need to see: "We're targeting 95% acceptable, and here's our plan for the 5%."

Eval Strategy

Written into the PRD, not treated as afterthought.

What	How	When
Golden dataset	Representative inputs with scored outputs	Before build
Rubric	Scoring criteria per dimension	Before build
Automated evals	CI/CD integrated checks	Every change
Trace review	Manual analysis of failures	Weekly
User signal	Satisfaction, edit rate, retry rate	Continuous

Platform

What do you control? What you receive, what you own, what you trade off.

Input Universe

Map the full range of what your AI will receive. Not just the happy path.

Input Category	Examples	Expected %	Quality Expectation
Clean, standard	Well-formed request, clear intent	60%	Excellent
Ambiguous	Vague request, multiple interpretations	20%	Acceptable (ask clarifying question)
Edge case	Extremely long, empty, wrong language	10%	Graceful degradation
Adversarial	Jailbreak attempts, off-topic abuse	5%	Safe refusal
Out of scope	Unrelated to product purpose	5%	Clear redirect

Questions to ask for each category:

What happens with sarcasm, slang, typos?
What happens when the user provides correct information that conflicts with training data?
What happens when the same question is asked five different ways?
What happens when the input is in a language you didn't optimize for?

Human Fallback

Every AI feature needs an escape hatch. What happens when the AI can't handle it?

Trigger	Escalation Path	SLA
User reports bad output	Flag for review	24h
Automated eval catches regression	Alert team, pause if critical	1h
Output confidence below threshold	Route to human	Real-time
Safety filter triggers	Block output, log, review	Immediate

Cost, Latency, Quality

You can't have all three. Make the tradeoff explicit.

Constraint	Implication	Decision
Cost ceiling	Limits model size, call frequency	"Max $0.02 per request → use cached responses for top 100 queries"
Latency ceiling	Limits model complexity, chain length	"Must respond in <2s → single call, no chain-of-thought"
Quality floor	Limits cost savings	"Accuracy ≥85% on golden dataset → cannot use smallest model"

The honest question: which constraint will you relax when two conflict? Write it down. If you don't decide now, the engineer will decide for you.

Protocols

How do you coordinate? Sequencing, verification, and agent handoff.

Build Order

Features don't ship in parallel. Dependencies determine sequence. Each sprint references features by # from the feature table.

Sprint	Features	What	Effort	Acceptance
0	#3	Wire confidence scoring to all AI outputs	2d	Score displays on every generated output
1	#1	Seed answer library with 50 entries from past bids	3d	User can search and retrieve answers by category
2	#2	Train brand voice on 100 sent emails	5d	Generated emails score ≥3/5 on brand voice eval

The build order encodes the dependency graph. If Sprint 2 depends on Sprint 1, say so. If sprints can run in parallel, mark them.

Commissioning Stages

Every feature progresses through four stages. This replaces the vague "MVP → iterate" pattern.

Stage	Definition	Gate to Next
Install	Code deployed, feature exists	Can be invoked without errors
Test	Running against eval suite	Passes acceptance criteria from build order
Operational	Handling real traffic, monitored	Quality targets met for 7 consecutive days
Optimize	Tuning for cost, latency, edge cases	Improvement rate < threshold (diminishing returns)

Track commissioning per feature:

#	Feature	Install	Test	Operational	Optimize
1	Answer Library	Pass	Pass	—	—
2	Brand Voice Model	Pass	—	—	—
3	Confidence Score	—	—	—	—

Aggregate commissioning % = (completed stages / total stages) x 100. This is the real progress metric — not "features shipped."

Agent-Facing Spec

If AI agents (coding agents, orchestration agents) will build from or operate within this PRD, the spec must be machine-readable, not just human-readable.

Frontmatter Contract:

The PRD index file must include structured frontmatter that downstream parsers and agents can read:

---
title: "Feature Name"
slug: feature-slug
status: planning | building | testing | operational | optimizing
priority_pain: 4
priority_demand: 3
priority_edge: 5
priority_trend: 4
priority_conversion: 3
priority_score: 720
kill_date: 2025-06-01
blocked_by: [identity-access]
last_scored: 2025-03-15
---

Commands Block:

If agents will build or test this feature, include executable commands:

## Commands
- Build: `nx run sales-crm:build`
- Test: `nx run sales-crm:test --coverage`
- Lint: `nx run sales-crm:lint`
- E2E: `nx run sales-crm-e2e:e2e`
- Eval: `nx run sales-crm:eval --dataset=golden`

Boundaries:

What agents must never do:

## Boundaries
- Always: Run tests before commits, follow naming conventions
- Always: Check feature state before modifying — don't break Live features
- Ask first: Database schema changes, adding dependencies, modifying shared libs
- Never: Commit secrets, modify auth middleware without review, skip eval suite
- Never: Change API contracts without updating downstream consumers

Test Contract:

Map each feature to its acceptance test:

#	Feature	Test File	Assertion
1	Answer Library	`apps/sales-crm/tests/answer-library.spec.ts`	Returns top 3 matches with score ≥0.8
2	Brand Voice Model	`apps/sales-crm/tests/brand-voice.eval.ts`	≥3/5 on brand dimension for 90% of golden set
3	Confidence Score	`apps/sales-crm/tests/confidence.spec.ts`	Score renders on all AI output components

Players

Who creates harmony? For Product and Agent PRDs, this is the heaviest section — who you serve, what they struggle with, what triggers switching. The depth comes from JTBD interviews.

Demand-Side Jobs

Every PRD must define at least one demand-side job. Each job captures the struggling moment that drives someone to seek your product — not a feature request, but a circumstance. See Validate Demand for the interview method and Four Moments framework.

Element	Bad	Good
Struggling moment	"They need better analytics"	"Month-end report takes 3 days of copy-paste from 6 spreadsheets"
Current workaround	"Manual process"	"Junior analyst builds deck in Excel, emailed to 4 stakeholders"
What progress looks like	"Faster reports"	"Report generated in 10 minutes, stakeholders self-serve"
Hidden objection	"Cost"	"If AI generates wrong numbers, I get fired — Excel I can audit"
Switch trigger	"When they see a demo"	"When the board asks why reporting takes 3 FTEs"

The hidden objection is what they won't tell you. It's the real reason they haven't switched yet. Sutherland's insight: the objection is never the stated objection. "It's too expensive" usually means "I don't trust it enough to justify the risk."

Features that serve this job: Map each demand-side job to specific features from the Build Contract by # reference. If a feature doesn't serve any job, question whether it belongs.

Example Pairs

Every AI PRD must include input/output examples at each quality tier. Minimum 1 per tier (3 calibrated pairs beat 15 generic ones). These become the seed of your golden dataset.

Quality	Input	Expected Output	Why This Score
Excellent	"Email for SaaS founders about our new analytics feature"	Specific, brand-voice email with relevant CTA	Matches intent, voice, actionable
Acceptable	"Write something about analytics"	Generic but relevant email, may need voice editing	Right direction, needs polish
Unacceptable	"Write something about analytics"	Email about a competitor's product	Wrong subject, could embarrass

Include at least 1 per tier, scaling to 5+ for AI-heavy features. The quality of your examples determines the quality of your evals. Garbage examples, garbage evals, shipping blind.

PRD Checklist

Before signing off on an AI PRD:

Build contract:

Feature/Function/Outcome table with State column (exact enum values)

Principles:

Job definition uses compound test — does usage improve the product?
Problem stated as SIO: Situation, Intention, Obstacle, Hardest Thing
Refusal spec covers what the AI should NOT do

Performance:

Strategic gate passed — priority score ≥ 3.0 with evidence
Quality targets use distributions, not binary pass/fail
Failure budget defined and stakeholder-approved
Eval strategy included with timeline

Platform:

Input universe covers all five categories (clean, ambiguous, edge, adversarial, out-of-scope)
Human fallback path defined
Cost, latency, quality tradeoffs documented

Protocols:

Build order with sprint sequence, feature # refs, and acceptance criteria
Commissioning stages tracked per feature (Install → Test → Operational → Optimize)
Frontmatter contract with scoring fields and status
Commands block with exact executable commands
Boundaries defined (always / ask first / never)
Test contract mapping features to test files and assertions

Players:

At least 1 demand-side job with all 5 elements (struggling moment, workaround, progress, hidden objection, switch trigger)
Each demand-side job maps to features from Build Contract by # reference
Example pairs at three quality tiers (minimum 1 per tier, 5+ for AI-heavy features)

Cross-checks:

Engineers, designers, and data scientists contributed (not PM monologue)
Dependencies on other PRDs declared
Kill date set — when does this stop making sense?

Positioning

How This Guide Differs. Currently PRDs optimise for human alignment — getting stakeholders to agree on what to build. This guide adds three layers they don't address:

Distribution thinking — AI outputs are stochastic. Quality targets, failure budgets, and input universe mapping account for variance that traditional PRDs treat as bugs. See AI Product Principles.

Build contract — Feature/Function/Outcome tables with State enums, sprint sequencing with feature # refs, and commissioning stages per feature. The PRD isn't just an alignment doc — it's what engineering builds from.

Agent readiness — Frontmatter contracts, commands blocks, boundary definitions, and test contracts. The PRD is consumed by coding agents, not just human engineers. This is the "Agent Experience" layer that Addy Osmani names but doesn't fully prescribe.

Context

Trust Architecture — Why boundaries must be structural, not intentional — the theory behind failure budgets, refusal specs, and agent boundaries
Positioning — Key point of difference
Flow Engineering — Stories become maps, maps become types, types become code
Jobs To Be Done — Define the job before the PRD
JTBD Interviews — Understand real user needs
Product Design — Design principles for the interface layer
Pictures Templates — Pre-flight maps that feed the PRD
Commissioning — The Install → Test → Operational → Optimize model

Questions

If AI outputs are distributions, not deterministic — when is a spec complete enough to build from?

What's the minimum number of example pairs per tier before evals become statistically meaningful?
Should the commissioning model use four stages or five — is there a gate between "Operational" and "Trusted"?
How do you score Edge (dimension 3) when the edge is speed-to-market rather than proprietary data?
What happens when the strategic gate kills a PRD that engineering has already started building?
Is the agent-facing spec section premature for teams without coding agents — or does writing it force the right discipline regardless?

What Changes​

Strategic Gate​

Build Decision​

Priority Score​

Build Contract​

Feature Table​

State Enum​

Principles​

Job Definition​

Refusal Spec​

Performance​

Quality Targets​

Failure Budget​

Eval Strategy​

Platform​

Input Universe​

Human Fallback​

Cost, Latency, Quality​

Protocols​

Build Order​

Commissioning Stages​

Agent-Facing Spec​

Players​

Demand-Side Jobs​

Example Pairs​

PRD Checklist​

Positioning​

Context​

Links​

Questions​