Skip to main content

Create PRD Stories

How do you spec a product that never gives the same answer twice?

Traditional PRDs define behavior: "When user clicks X, system does Y." AI PRDs define boundaries and quality distributions: "Given input type X, output should score above Y on dimension Z at least N% of the time."

If your PRD reads like a traditional spec with "AI" sprinkled in, you're building on the wrong foundation.

What Changes

Traditional PRDAI PRD
Acceptance criteria: pass/failQuality targets: percentage above threshold
Define exact behaviorDefine acceptable range
Edge cases are bugsEdge cases are statistical certainties
Test before shipEvaluate continuously
User storiesUser stories + failure stories
"The system shall...""The system shall... N% of the time..."
Spec is for humansSpec is for humans AND agents
Ship onceCommission in stages
Features: done or notFeatures: Install → Test → Operational → Optimize

Strategic Gate

Don't write requirements for something you shouldn't build. Answer these before any PRD work begins.

Build Decision

QuestionRed FlagGreen Light
What job does this do that wasn't possible before?"It's faster" (speed isn't a new job)"It eliminates a 20h/week manual process entirely"
Does using it make it better?Value is static — v1 = v100Compound flywheel — every use improves the next
Who loses if we don't build it?"We'd miss a trend""Customer X loses 40h/month and will churn"
What's our unfair edge?"We'll use AI" (everyone can)"We have proprietary data / domain workflow / distribution"
Does this fit the build order?Dependencies unresolved upstreamAll blockers cleared or this IS the blocker

Priority Score

Rate each dimension 1–5 with specific evidence. No number without a reason.

DimensionQuestionScoreEvidence
PainHow badly does the status quo hurt?/5
DemandAre people actively seeking this?/5
EdgeDo we have an unfair advantage?/5
TrendIs the tailwind growing or dying?/5
ConversionCan we reach buyers efficiently?/5
CompositeProduct of all five/3125

Kill signal: If composite < 50, park it. Revisit when conditions change. Bands: 500+ build now, 200-499 strong, 50-199 promising, <50 park.


Build Contract

The deliverable, not part of the framework. The Tight Five sections below (Principles, Performance, Platform, Protocols, Players) justify what's in this table. This table is what engineering builds from and what the commissioning dashboard reads.

Every feature has a function (what it does), an outcome (why it matters), a job (which user need), and a state.

Feature Table

#FeatureFunctionOutcomeJobState
1Answer LibraryStore approved RFP answers by categoryAuto-fill future bids from past winsReduce bid prep timeGap
2Brand Voice ModelLearn tone from sent emailsGenerated content matches company voiceMaintain consistency at scaleStub
3Confidence ScoreDisplay AI certainty per outputUser knows when to review vs. trustReduce review burdenNot verified

State Enum

Exact values — parseable by downstream tooling:

StateMeaning
LiveDeployed, tested, in production
BuiltCode complete, not yet deployed
DormantBuilt but not wired / activated
PartialSome functionality working
Not verifiedDeployed but not tested against acceptance criteria
GapNot built, needed
StubPlaceholder exists, no real implementation
BrokenWas working, now failing

The five sections below are the Tight Five applied to PRD writing. Each justifies what's in the Build Contract above.

Principles

What truths guide the design? The job is the foundational truth. Refusal is a design constraint.

Job Definition

What job is the AI hired for? Be specific.

FieldBadGood
Job"Help users write better""Generate first draft of marketing emails that match brand voice and include a CTA"
Trigger"When they need help""When user provides product name, audience, and goal"
Success"User is happy""User edits less than 30% of generated text before sending"
CompoundNot mentioned"Each sent email trains brand voice model — 100th email needs 5% edits"

The compound test: Does using this product make this product better? If yes, describe the flywheel. If no, you're building a tool, not a moat.

Refusal Spec

What should the AI refuse to do? This is as important as what it should do.

CategoryActionResponse to User
Out of scopeRefuse with redirect"I can help with X. For Y, try..."
Harmful requestRefuse firmlyStandard refusal
UncertainAcknowledge uncertainty"I'm not confident about this. Here's what I know..."
AmbiguousAsk for clarity"Could you clarify whether you mean A or B?"
Confidence below thresholdRoute to human"Let me connect you with someone who can help."

Performance

How do you know it's working? What good looks like, what bad costs, how you verify.

Quality Targets

Replace binary pass/fail with distribution targets:

For [feature], outputs must score:
- [Dimension A]: ≥ [score] for [percentage]% of requests
- [Dimension B]: ≥ [score] for [percentage]% of requests
- NEVER: [unacceptable outcome] for more than [N]% of requests

Worked example:

For email generation, outputs must score:
- Relevance: ≥ 4/5 for 85% of requests
- Brand voice: ≥ 3/5 for 90% of requests
- NEVER: factually incorrect claims for more than 1% of requests
- NEVER: competitor mentions for more than 0.1% of requests

See AI Product Principles for quality dimensions, three tiers, and distribution thinking.

Failure Budget

How many bad outputs are acceptable? Define this before building, not after.

Failure TypeBudgetResponse
Harmful (safety, legal, PII)0% target, <0.01% toleratedAutomated blocking, human review
Wrong (factually incorrect)<5%Flag for eval, improve pipeline
Unhelpful (misses intent)<15%Monitor, iterate on prompts
Imperfect (could be better)<40%Track trend, improve over time

The failure budget is a leadership alignment tool. Stakeholders who think in binary need to see: "We're targeting 95% acceptable, and here's our plan for the 5%."

Eval Strategy

Written into the PRD, not treated as afterthought.

WhatHowWhen
Golden datasetRepresentative inputs with scored outputsBefore build
RubricScoring criteria per dimensionBefore build
Automated evalsCI/CD integrated checksEvery change
Trace reviewManual analysis of failuresWeekly
User signalSatisfaction, edit rate, retry rateContinuous

Platform

What do you control? What you receive, what you own, what you trade off.

Input Universe

Map the full range of what your AI will receive. Not just the happy path.

Input CategoryExamplesExpected %Quality Expectation
Clean, standardWell-formed request, clear intent60%Excellent
AmbiguousVague request, multiple interpretations20%Acceptable (ask clarifying question)
Edge caseExtremely long, empty, wrong language10%Graceful degradation
AdversarialJailbreak attempts, off-topic abuse5%Safe refusal
Out of scopeUnrelated to product purpose5%Clear redirect

Questions to ask for each category:

  • What happens with sarcasm, slang, typos?
  • What happens when the user provides correct information that conflicts with training data?
  • What happens when the same question is asked five different ways?
  • What happens when the input is in a language you didn't optimize for?

Human Fallback

Every AI feature needs an escape hatch. What happens when the AI can't handle it?

TriggerEscalation PathSLA
User reports bad outputFlag for review24h
Automated eval catches regressionAlert team, pause if critical1h
Output confidence below thresholdRoute to humanReal-time
Safety filter triggersBlock output, log, reviewImmediate

Cost, Latency, Quality

You can't have all three. Make the tradeoff explicit.

ConstraintImplicationDecision
Cost ceilingLimits model size, call frequency"Max $0.02 per request → use cached responses for top 100 queries"
Latency ceilingLimits model complexity, chain length"Must respond in <2s → single call, no chain-of-thought"
Quality floorLimits cost savings"Accuracy ≥85% on golden dataset → cannot use smallest model"

The honest question: which constraint will you relax when two conflict? Write it down. If you don't decide now, the engineer will decide for you.


Protocols

How do you coordinate? Sequencing, verification, and agent handoff.

Build Order

Features don't ship in parallel. Dependencies determine sequence. Each sprint references features by # from the feature table.

SprintFeaturesWhatEffortAcceptance
0#3Wire confidence scoring to all AI outputs2dScore displays on every generated output
1#1Seed answer library with 50 entries from past bids3dUser can search and retrieve answers by category
2#2Train brand voice on 100 sent emails5dGenerated emails score ≥3/5 on brand voice eval

The build order encodes the dependency graph. If Sprint 2 depends on Sprint 1, say so. If sprints can run in parallel, mark them.

Commissioning Stages

Every feature progresses through four stages. This replaces the vague "MVP → iterate" pattern.

StageDefinitionGate to Next
InstallCode deployed, feature existsCan be invoked without errors
TestRunning against eval suitePasses acceptance criteria from build order
OperationalHandling real traffic, monitoredQuality targets met for 7 consecutive days
OptimizeTuning for cost, latency, edge casesImprovement rate < threshold (diminishing returns)

Track commissioning per feature:

#FeatureInstallTestOperationalOptimize
1Answer LibraryPassPass
2Brand Voice ModelPass
3Confidence Score

Aggregate commissioning % = (completed stages / total stages) x 100. This is the real progress metric — not "features shipped."

Agent-Facing Spec

If AI agents (coding agents, orchestration agents) will build from or operate within this PRD, the spec must be machine-readable, not just human-readable.

Frontmatter Contract:

The PRD index file must include structured frontmatter that downstream parsers and agents can read:

---
title: "Feature Name"
slug: feature-slug
status: planning | building | testing | operational | optimizing
priority_pain: 4
priority_demand: 3
priority_edge: 5
priority_trend: 4
priority_conversion: 3
priority_score: 720
kill_date: 2025-06-01
blocked_by: [identity-access]
last_scored: 2025-03-15
---

Commands Block:

If agents will build or test this feature, include executable commands:

## Commands
- Build: `nx run sales-crm:build`
- Test: `nx run sales-crm:test --coverage`
- Lint: `nx run sales-crm:lint`
- E2E: `nx run sales-crm-e2e:e2e`
- Eval: `nx run sales-crm:eval --dataset=golden`

Boundaries:

What agents must never do:

## Boundaries
- Always: Run tests before commits, follow naming conventions
- Always: Check feature state before modifying — don't break Live features
- Ask first: Database schema changes, adding dependencies, modifying shared libs
- Never: Commit secrets, modify auth middleware without review, skip eval suite
- Never: Change API contracts without updating downstream consumers

Test Contract:

Map each feature to its acceptance test:

#FeatureTest FileAssertion
1Answer Libraryapps/sales-crm/tests/answer-library.spec.tsReturns top 3 matches with score ≥0.8
2Brand Voice Modelapps/sales-crm/tests/brand-voice.eval.ts≥3/5 on brand dimension for 90% of golden set
3Confidence Scoreapps/sales-crm/tests/confidence.spec.tsScore renders on all AI output components

Players

Who creates harmony? For Product and Agent PRDs, this is the heaviest section — who you serve, what they struggle with, what triggers switching. The depth comes from JTBD interviews.

Demand-Side Jobs

Every PRD must define at least one demand-side job. Each job captures the struggling moment that drives someone to seek your product — not a feature request, but a circumstance. See Validate Demand for the interview method and Four Moments framework.

ElementBadGood
Struggling moment"They need better analytics""Month-end report takes 3 days of copy-paste from 6 spreadsheets"
Current workaround"Manual process""Junior analyst builds deck in Excel, emailed to 4 stakeholders"
What progress looks like"Faster reports""Report generated in 10 minutes, stakeholders self-serve"
Hidden objection"Cost""If AI generates wrong numbers, I get fired — Excel I can audit"
Switch trigger"When they see a demo""When the board asks why reporting takes 3 FTEs"

The hidden objection is what they won't tell you. It's the real reason they haven't switched yet. Sutherland's insight: the objection is never the stated objection. "It's too expensive" usually means "I don't trust it enough to justify the risk."

Features that serve this job: Map each demand-side job to specific features from the Build Contract by # reference. If a feature doesn't serve any job, question whether it belongs.

Example Pairs

Every AI PRD must include input/output examples at each quality tier. Minimum 1 per tier (3 calibrated pairs beat 15 generic ones). These become the seed of your golden dataset.

QualityInputExpected OutputWhy This Score
Excellent"Email for SaaS founders about our new analytics feature"Specific, brand-voice email with relevant CTAMatches intent, voice, actionable
Acceptable"Write something about analytics"Generic but relevant email, may need voice editingRight direction, needs polish
Unacceptable"Write something about analytics"Email about a competitor's productWrong subject, could embarrass

Include at least 1 per tier, scaling to 5+ for AI-heavy features. The quality of your examples determines the quality of your evals. Garbage examples, garbage evals, shipping blind.


PRD Checklist

Before signing off on an AI PRD:

Build contract:

  • Feature/Function/Outcome table with State column (exact enum values)

Principles:

  • Job definition uses compound test — does usage improve the product?
  • Problem stated as SIO: Situation, Intention, Obstacle, Hardest Thing
  • Refusal spec covers what the AI should NOT do

Performance:

  • Strategic gate passed — priority score ≥ 3.0 with evidence
  • Quality targets use distributions, not binary pass/fail
  • Failure budget defined and stakeholder-approved
  • Eval strategy included with timeline

Platform:

  • Input universe covers all five categories (clean, ambiguous, edge, adversarial, out-of-scope)
  • Human fallback path defined
  • Cost, latency, quality tradeoffs documented

Protocols:

  • Build order with sprint sequence, feature # refs, and acceptance criteria
  • Commissioning stages tracked per feature (Install → Test → Operational → Optimize)
  • Frontmatter contract with scoring fields and status
  • Commands block with exact executable commands
  • Boundaries defined (always / ask first / never)
  • Test contract mapping features to test files and assertions

Players:

  • At least 1 demand-side job with all 5 elements (struggling moment, workaround, progress, hidden objection, switch trigger)
  • Each demand-side job maps to features from Build Contract by # reference
  • Example pairs at three quality tiers (minimum 1 per tier, 5+ for AI-heavy features)

Cross-checks:

  • Engineers, designers, and data scientists contributed (not PM monologue)
  • Dependencies on other PRDs declared
  • Kill date set — when does this stop making sense?

Positioning

How This Guide Differs. Currently PRDs optimise for human alignment — getting stakeholders to agree on what to build. This guide adds three layers they don't address:

Distribution thinking — AI outputs are stochastic. Quality targets, failure budgets, and input universe mapping account for variance that traditional PRDs treat as bugs. See AI Product Principles.

Build contract — Feature/Function/Outcome tables with State enums, sprint sequencing with feature # refs, and commissioning stages per feature. The PRD isn't just an alignment doc — it's what engineering builds from.

Agent readiness — Frontmatter contracts, commands blocks, boundary definitions, and test contracts. The PRD is consumed by coding agents, not just human engineers. This is the "Agent Experience" layer that Addy Osmani names but doesn't fully prescribe.

Context

Questions

If AI outputs are distributions, not deterministic — when is a spec complete enough to build from?

  • What's the minimum number of example pairs per tier before evals become statistically meaningful?
  • Should the commissioning model use four stages or five — is there a gate between "Operational" and "Trusted"?
  • How do you score Edge (dimension 3) when the edge is speed-to-market rather than proprietary data?
  • What happens when the strategic gate kills a PRD that engineering has already started building?
  • Is the agent-facing spec section premature for teams without coding agents — or does writing it force the right discipline regardless?