Skip to main content

AI Product Requirements

How do you spec a product that never gives the same answer twice?

Traditional PRDs define behavior: "When user clicks X, system does Y." AI PRDs define boundaries and quality distributions: "Given input type X, output should score above Y on dimension Z at least N% of the time."

If your PRD reads like a traditional spec with "AI" sprinkled in, you're building on the wrong foundation.

What Changes

Traditional PRDAI PRD
Acceptance criteria: pass/failQuality targets: percentage above threshold
Define exact behaviorDefine acceptable range
Edge cases are bugsEdge cases are statistical certainties
Test before shipEvaluate continuously
User storiesUser stories + failure stories
"The system shall...""The system shall... N% of the time..."
Spec is for humansSpec is for humans AND agents
Ship onceCommission in stages
Features: done or notFeatures: Install → Test → Operational → Optimize

Strategic Gate

Don't write requirements for something you shouldn't build. Answer these before any PRD work begins.

Build Decision

QuestionRed FlagGreen Light
What job does this do that wasn't possible before?"It's faster" (speed isn't a new job)"It eliminates a 20h/week manual process entirely"
Does using it make it better?Value is static — v1 = v100Compound flywheel — every use improves the next
Who loses if we don't build it?"We'd miss a trend""Customer X loses 40h/month and will churn"
What's our unfair edge?"We'll use AI" (everyone can)"We have proprietary data / domain workflow / distribution"
Does this fit the build order?Dependencies unresolved upstreamAll blockers cleared or this IS the blocker

Priority Score

Rate each dimension 1–5 with specific evidence. No number without a reason.

DimensionQuestionScoreEvidence
PainHow badly does the status quo hurt?/5
DemandAre people actively seeking this?/5
EdgeDo we have an unfair advantage?/5
TrendIs the tailwind growing or dying?/5
ConversionCan we reach buyers efficiently?/5
CompositeWeighted average/5

Kill signal: If composite < 3.0, stop. Revisit when conditions change.


PRD Sections

Every AI feature needs these sections beyond the standard PRD. Sections 1–4 define the problem. Sections 5–11 define the solution. Section 12 defines the build contract.

1. Job Definition

What job is the AI hired for? Be specific.

FieldBadGood
Job"Help users write better""Generate first draft of marketing emails that match brand voice and include a CTA"
Trigger"When they need help""When user provides product name, audience, and goal"
Success"User is happy""User edits less than 30% of generated text before sending"
CompoundNot mentioned"Each sent email trains brand voice model — 100th email needs 5% edits"

The compound test: Does using this product make this product better? If yes, describe the flywheel. If no, you're building a tool, not a moat.

2. Quality Targets

Replace binary pass/fail with distribution targets:

For [feature], outputs must score:
- [Dimension A]: ≥ [score] for [percentage]% of requests
- [Dimension B]: ≥ [score] for [percentage]% of requests
- NEVER: [unacceptable outcome] for more than [N]% of requests

Worked example:

For email generation, outputs must score:
- Relevance: ≥ 4/5 for 85% of requests
- Brand voice: ≥ 3/5 for 90% of requests
- NEVER: factually incorrect claims for more than 1% of requests
- NEVER: competitor mentions for more than 0.1% of requests

See AI Product Principles for quality dimensions, three tiers, and distribution thinking.

3. Input Universe

Map the full range of what your AI will receive. Not just the happy path.

Input CategoryExamplesExpected %Quality Expectation
Clean, standardWell-formed request, clear intent60%Excellent
AmbiguousVague request, multiple interpretations20%Acceptable (ask clarifying question)
Edge caseExtremely long, empty, wrong language10%Graceful degradation
AdversarialJailbreak attempts, off-topic abuse5%Safe refusal
Out of scopeUnrelated to product purpose5%Clear redirect

Questions to ask for each category:

  • What happens with sarcasm, slang, typos?
  • What happens when the user provides correct information that conflicts with training data?
  • What happens when the same question is asked five different ways?
  • What happens when the input is in a language you didn't optimize for?

4. Failure Budget

How many bad outputs are acceptable? Define this before building, not after.

Failure TypeBudgetResponse
Harmful (safety, legal, PII)0% target, <0.01% toleratedAutomated blocking, human review
Wrong (factually incorrect)<5%Flag for eval, improve pipeline
Unhelpful (misses intent)<15%Monitor, iterate on prompts
Imperfect (could be better)<40%Track trend, improve over time

The failure budget is a leadership alignment tool. Stakeholders who think in binary need to see: "We're targeting 95% acceptable, and here's our plan for the 5%."

5. Refusal Spec

What should the AI refuse to do? This is as important as what it should do.

CategoryActionResponse to User
Out of scopeRefuse with redirect"I can help with X. For Y, try..."
Harmful requestRefuse firmlyStandard refusal
UncertainAcknowledge uncertainty"I'm not confident about this. Here's what I know..."
AmbiguousAsk for clarity"Could you clarify whether you mean A or B?"
Confidence below thresholdRoute to human"Let me connect you with someone who can help."

6. Human Fallback

Every AI feature needs an escape hatch. What happens when the AI can't handle it?

TriggerEscalation PathSLA
User reports bad outputFlag for review24h
Automated eval catches regressionAlert team, pause if critical1h
Output confidence below thresholdRoute to humanReal-time
Safety filter triggersBlock output, log, reviewImmediate

7. Example Pairs

Every AI PRD must include input/output examples at each quality tier. Minimum 5 per tier. These become the seed of your golden dataset.

QualityInputExpected OutputWhy This Score
Excellent"Email for SaaS founders about our new analytics feature"Specific, brand-voice email with relevant CTAMatches intent, voice, actionable
Acceptable"Write something about analytics"Generic but relevant email, may need voice editingRight direction, needs polish
Unacceptable"Write something about analytics"Email about a competitor's productWrong subject, could embarrass

Include at least 5 per tier. The quality of your examples determines the quality of your evals. Garbage examples, garbage evals, shipping blind.

8. Eval Strategy

Written into the PRD, not treated as afterthought.

WhatHowWhen
Golden datasetRepresentative inputs with scored outputsBefore build
RubricScoring criteria per dimensionBefore build
Automated evalsCI/CD integrated checksEvery change
Trace reviewManual analysis of failuresWeekly
User signalSatisfaction, edit rate, retry rateContinuous

9. Cost, Latency, Quality Triangle

You can't have all three. Make the tradeoff explicit.

ConstraintImplicationDecision
Cost ceilingLimits model size, call frequency"Max $0.02 per request → use cached responses for top 100 queries"
Latency ceilingLimits model complexity, chain length"Must respond in <2s → single call, no chain-of-thought"
Quality floorLimits cost savings"Accuracy ≥85% on golden dataset → cannot use smallest model"

The honest question: which constraint will you relax when two conflict? Write it down. If you don't decide now, the engineer will decide for you.


Build Contract

The PRD must contain the full feature set in a parseable format. This is the handoff that engineering builds from.

10. Feature / Function / Outcome Table

Every feature has a function (what it does), an outcome (why it matters), a job (which user need), and a state.

#FeatureFunctionOutcomeJobState
1Answer LibraryStore approved RFP answers by categoryAuto-fill future bids from past winsReduce bid prep timeGap
2Brand Voice ModelLearn tone from sent emailsGenerated content matches company voiceMaintain consistency at scaleStub
3Confidence ScoreDisplay AI certainty per outputUser knows when to review vs. trustReduce review burdenNot verified

State values (exact enum — parseable by downstream tooling):

StateMeaning
LiveDeployed, tested, in production
BuiltCode complete, not yet deployed
DormantBuilt but not wired / activated
PartialSome functionality working
Not verifiedDeployed but not tested against acceptance criteria
GapNot built, needed
StubPlaceholder exists, no real implementation
BrokenWas working, now failing

11. Build Order

Features don't ship in parallel. Dependencies determine sequence. Each sprint references features by # from the table above.

SprintFeaturesWhatEffortAcceptance
0#3Wire confidence scoring to all AI outputs2dScore displays on every generated output
1#1Seed answer library with 50 entries from past bids3dUser can search and retrieve answers by category
2#2Train brand voice on 100 sent emails5dGenerated emails score ≥3/5 on brand voice eval

The build order encodes the dependency graph. If Sprint 2 depends on Sprint 1, say so. If sprints can run in parallel, mark them.

12. Commissioning Stages

Every feature progresses through four stages. This replaces the vague "MVP → iterate" pattern.

StageDefinitionGate to Next
InstallCode deployed, feature existsCan be invoked without errors
TestRunning against eval suitePasses acceptance criteria from build order
OperationalHandling real traffic, monitoredQuality targets met for 7 consecutive days
OptimizeTuning for cost, latency, edge casesImprovement rate < threshold (diminishing returns)

Track commissioning per feature:

#FeatureInstallTestOperationalOptimize
1Answer LibraryPassPass
2Brand Voice ModelPass
3Confidence Score

Aggregate commissioning % = (completed stages / total stages) x 100. This is the real progress metric — not "features shipped."


Agent-Facing Spec

If AI agents (coding agents, orchestration agents) will build from or operate within this PRD, the spec must be machine-readable, not just human-readable.

Frontmatter Contract

The PRD index file must include structured frontmatter that downstream parsers and agents can read:

---
title: "Feature Name"
slug: feature-slug
status: planning | building | testing | operational | optimizing
priority_pain: 4
priority_demand: 3
priority_edge: 5
priority_trend: 4
priority_conversion: 3
priority_score: 3.8
kill_date: 2025-06-01
blocked_by: [identity-access]
last_scored: 2025-03-15
---

Commands Block

If agents will build or test this feature, include executable commands:

## Commands
- Build: `nx run sales-crm:build`
- Test: `nx run sales-crm:test --coverage`
- Lint: `nx run sales-crm:lint`
- E2E: `nx run sales-crm-e2e:e2e`
- Eval: `nx run sales-crm:eval --dataset=golden`

Boundaries

What agents must never do:

## Boundaries
- ✅ Always: Run tests before commits, follow naming conventions
- ✅ Always: Check feature state before modifying — don't break Live features
- ⚠️ Ask first: Database schema changes, adding dependencies, modifying shared libs
- 🚫 Never: Commit secrets, modify auth middleware without review, skip eval suite
- 🚫 Never: Change API contracts without updating downstream consumers

Test Contract

Map each feature to its acceptance test:

#FeatureTest FileAssertion
1Answer Libraryapps/sales-crm/tests/answer-library.spec.tsReturns top 3 matches with score ≥0.8
2Brand Voice Modelapps/sales-crm/tests/brand-voice.eval.ts≥3/5 on brand dimension for 90% of golden set
3Confidence Scoreapps/sales-crm/tests/confidence.spec.tsScore renders on all AI output components

The PRD Checklist

Before signing off on an AI PRD:

Problem definition:

  • Job definition uses compound test — does usage improve the product?
  • Strategic gate passed — priority score ≥ 3.0 with evidence
  • Problem stated as SIO: Situation, Intention, Obstacle, Hardest Thing

Quality specification:

  • Quality targets use distributions, not binary pass/fail
  • Input universe covers all five categories (clean, ambiguous, edge, adversarial, out-of-scope)
  • Failure budget defined and stakeholder-approved
  • Refusal spec covers what the AI should NOT do
  • Human fallback path defined
  • Example pairs at three quality tiers (minimum 5 per tier)
  • Cost, latency, quality tradeoffs documented

Build contract:

  • Feature/Function/Outcome table with State column (exact enum values)
  • Build order with sprint sequence, feature # refs, and acceptance criteria
  • Commissioning stages tracked per feature (Install → Test → Operational → Optimize)
  • Eval strategy included with timeline

Agent readiness:

  • Frontmatter contract with scoring fields and status
  • Commands block with exact executable commands
  • Boundaries defined (always / ask first / never)
  • Test contract mapping features to test files and assertions

Cross-checks:

  • Engineers, designers, and data scientists contributed (not PM monologue)
  • Dependencies on other PRDs declared
  • Kill date set — when does this stop making sense?

Positioning

How This Guide Differs. Currently PRDs optimise for human alignment — getting stakeholders to agree on what to build. This guide adds three layers they don't address:

Distribution thinking — AI outputs are stochastic. Quality targets, failure budgets, and input universe mapping account for variance that traditional PRDs treat as bugs. See AI Product Principles.

Build contract — Feature/Function/Outcome tables with State enums, sprint sequencing with feature # refs, and commissioning stages per feature. The PRD isn't just an alignment doc — it's what engineering builds from.

Agent readiness — Frontmatter contracts, commands blocks, boundary definitions, and test contracts. The PRD is consumed by coding agents, not just human engineers. This is the "Agent Experience" layer that Addy Osmani names but doesn't fully prescribe.

Context

Questions

If AI outputs are distributions, not deterministic — when is a spec complete enough to build from?

  • What's the minimum number of example pairs per tier before evals become statistically meaningful?
  • Should the commissioning model use four stages or five — is there a gate between "Operational" and "Trusted"?
  • How do you score Edge (dimension 3) when the edge is speed-to-market rather than proprietary data?
  • What happens when the strategic gate kills a PRD that engineering has already started building?
  • Is the agent-facing spec section premature for teams without coding agents — or does writing it force the right discipline regardless?