Skip to main content

AI Product Requirements

How do you spec a product that never gives the same answer twice?

Traditional PRDs define behavior: "When user clicks X, system does Y." AI PRDs define boundaries and quality distributions: "Given input type X, output should score above Y on dimension Z at least N% of the time."

If your PRD reads like a traditional spec with "AI" sprinkled in, you're building on the wrong foundation.

What Changes

Traditional PRDAI PRD
Acceptance criteria: pass/failQuality targets: percentage above threshold
Define exact behaviorDefine acceptable range
Edge cases are bugsEdge cases are statistical certainties
Test before shipEvaluate continuously
User storiesUser stories + failure stories
"The system shall...""The system shall... N% of the time..."

The AI PRD Template

Every AI feature needs these sections beyond the standard PRD:

1. Job Definition

What job is the AI hired for? Be specific.

FieldBadGood
Job"Help users write better""Generate first draft of marketing emails that match brand voice and include a CTA"
Trigger"When they need help""When user provides product name, audience, and goal"
Success"User is happy""User edits less than 30% of generated text before sending"

2. Quality Targets

Replace binary pass/fail with distribution targets:

For [feature], outputs must score:
- [Dimension A]: ≥ [score] for [percentage]% of requests
- [Dimension B]: ≥ [score] for [percentage]% of requests
- NEVER: [unacceptable outcome] for more than [N]% of requests

Worked example:

For email generation, outputs must score:
- Relevance: ≥ 4/5 for 85% of requests
- Brand voice: ≥ 3/5 for 90% of requests
- NEVER: factually incorrect claims for more than 1% of requests
- NEVER: competitor mentions for more than 0.1% of requests

3. Input Universe

Map the full range of what your AI will receive. Not just the happy path.

Input CategoryExamplesExpected %Quality Expectation
Clean, standardWell-formed request, clear intent60%Excellent
AmbiguousVague request, multiple interpretations20%Acceptable (ask clarifying question)
Edge caseExtremely long, empty, wrong language10%Graceful degradation
AdversarialJailbreak attempts, off-topic abuse5%Safe refusal
Out of scopeUnrelated to product purpose5%Clear redirect

Questions to ask for each category:

  • What happens with sarcasm, slang, typos?
  • What happens when the user provides correct information that conflicts with training data?
  • What happens when the same question is asked five different ways?
  • What happens when the input is in a language you didn't optimize for?

4. Failure Budget

How many bad outputs are acceptable? Define this before building, not after.

Failure TypeBudgetResponse
Harmful (safety, legal, PII)0% target, <0.01% toleratedAutomated blocking, human review
Wrong (factually incorrect)<5%Flag for eval, improve pipeline
Unhelpful (misses intent)<15%Monitor, iterate on prompts
Imperfect (could be better)<40%Track trend, improve over time

The failure budget is a leadership alignment tool. Stakeholders who think in binary need to see: "We're targeting 95% acceptable, and here's our plan for the 5%."

5. Refusal Spec

What should the AI refuse to do? This is as important as what it should do.

CategoryActionResponse to User
Out of scopeRefuse with redirect"I can help with X. For Y, try..."
Harmful requestRefuse firmlyStandard refusal
UncertainAcknowledge uncertainty"I'm not confident about this. Here's what I know..."
AmbiguousAsk for clarity"Could you clarify whether you mean A or B?"

6. Human Fallback

Every AI feature needs an escape hatch. What happens when the AI can't handle it?

TriggerEscalation PathSLA
User reports bad outputFlag for review24h
Automated eval catches regressionAlert team, pause if critical1h
Output confidence below thresholdRoute to humanReal-time
Safety filter triggersBlock output, log, reviewImmediate

7. Eval Strategy

Written into the PRD, not treated as afterthought.

WhatHowWhen
Golden datasetRepresentative inputs with scored outputsBefore build
RubricScoring criteria per dimensionBefore build
Automated evalsCI/CD integrated checksEvery change
Trace reviewManual analysis of failuresWeekly
User signalSatisfaction, edit rate, retry rateContinuous

Example Pairs

Every AI PRD should include input/output examples at each quality tier:

QualityInputExpected OutputWhy This Score
Excellent"Email for SaaS founders about our new analytics feature"Specific, brand-voice email with relevant CTAMatches intent, voice, actionable
Acceptable"Write something about analytics"Generic but relevant email, may need voice editingRight direction, needs polish
Unacceptable"Write something about analytics"Email about a competitor's productWrong subject, could embarrass

Include at least 5 examples per tier. These become the seed of your golden dataset.

The PRD Checklist

Before signing off on an AI PRD:

  • Quality targets use distributions, not binary pass/fail
  • Input universe covers all five categories (clean, ambiguous, edge, adversarial, out-of-scope)
  • Failure budget defined and stakeholder-approved
  • Refusal spec covers what the AI should NOT do
  • Human fallback path defined
  • Eval strategy included with timeline
  • Example pairs at three quality tiers
  • Cost, latency, and quality tradeoffs documented
  • Engineers, designers, and data scientists contributed (not PM monologue)

Context