Validate External Outcomes
Protocol for Commissioning PRD Expectation vs Results
When is a capability ready to ship — and how do you prove it?
The team that builds a system is never the team that commissions it. The builder knows what they intended. The commissioner checks what actually shipped.
Maturity Levels
Every Mycelium capability is scored on a 5-level maturity scale:
| Level | Meaning | Evidence Required |
|---|---|---|
| L0 | Spec only | PRD written, no build |
| L1 | Schema + API | Backend exists, no interface |
| L2 | UI connected | Users can interact |
| L3 | Tested | Automated verification + intent spec passes |
| L4 | Commissioned | Independent verification against PRD criteria |
Status Vocabulary
Exact definitions. No synonyms. If engineering and dream team use different words for the same state, the report lies.
| Status | Meaning | Evidence | NOT the same as |
|---|---|---|---|
| Gap | Identified need, no PRD | Mentioned in another PRD or index | Spec |
| Spec | PRD written, features undefined or unscored | PRD index.md exists | Spec draft |
| Spec draft | PRD exists, features listed but incomplete | Feature table present, gaps noted | L0 |
| Spec complete | PRD fully specified, ready for engineering | All sections filled, scored | L0 |
| L0 | Features assessed, all scored as Gap | Feature table with Gap/Done per feature | Spec |
| L1 | Schema + API deployed | Backend responds, no UI | L0 |
| L2 | UI connected, users can interact | Pages render, forms submit | L1 |
| L3 | Tested, automated + intent verification | E2E tests pass, intent spec verified, evidence captured | L2 |
| L4 | Commissioned by independent verification | Commissioner grants after browser walkthrough | L3 |
The critical distinction: "Spec" means features haven't been individually assessed. "L0" means they HAVE been assessed and scored as Gap. A PRD at L0 has more structure than one at Spec — it knows exactly what's missing.
Feature vs capability maturity: Feature commissioning (Install → Test → Operational → Optimize) tracks individual features within a capability. Capability maturity (L0-L4) tracks the aggregate. A capability at L2 may have features at Install, Test, and Gap simultaneously.
The Process
How a capability moves from L0 to L4:
L0: SPEC ONLY L1: SCHEMA + API L2: UI CONNECTED
PRD written -> Backend exists -> Users can interact
Features defined Schema deployed CRUD works
Success criteria API endpoints live Workflows complete
Kill signal set Integration tests Manual QA passes
L3: TESTED L4: COMMISSIONED
Automated tests -> Independent verification
E2E suite passes Commissioner reads spec
Performance gates Commissioner opens browser
Regression suite Commissioner checks features
Pass / Fail + evidence
The Protocol
- Commissioner reads the PRD (not the code)
- Commissioner opens the live application
- For each row in the Feature / Function / Outcome table:
- Can the feature be found?
- Does the function work as specified?
- Does the outcome match the success criteria?
- Record Pass / Fail with evidence (screenshot, recording, measurement)
- Update the PRD commissioning table
- If all critical features Pass: capability is L4 Commissioned
The commissioner is never the builder. The builder knows what they intended. The commissioner checks what actually shipped.
The Loop
Read SPEC-MAP (what should pass, what tests exist)
-> Check test coverage: any empty Test File cells = BLOCKER before L4
-> Navigate to deployed URL
-> Walk each feature row
-> Verify pass/fail with evidence (screenshot, GIF, console, network)
-> Update SPEC-MAP columns 5-6 (L-Level, Last Verified)
-> Update commissioning dashboard with findings
-> Gap between spec and reality drives next priority
The SPEC-MAP adds a step before browser verification: check that engineering has test coverage for every Story Contract row. A feature that works on the deployed site but has no automated test is L3 at best — one deploy away from invisible regression. The SPEC-MAP makes this gap visible before commissioning starts.
Evidence Gates
Each level transition requires specific evidence. The builder claims, the commissioner verifies.
| Transition | Evidence Type | Minimum | Who Claims | Who Verifies |
|---|---|---|---|---|
| L0→L1 | Schema matches spec | DB introspection output | Engineering | Engineering (schema is binary) |
| L1→L2 | UI renders and connects | Screenshot/GIF of CRUD flow | Engineering | Commissioner (Dream Team) |
| L2→L3 | Tests pass against spec | CI output, e2e suite green | Engineering | CI pipeline (automated) |
| L3→L4 | Independent verification | Commissioner walkthrough + evidence link | Commissioner | Commissioner (different from builder) |
| Any→Broken | Reproduction evidence | Bug report with steps | Anyone | Engineering confirms |
Commissioning is Dream Team's final responsibility. You defined what creates value (steps 1-3 in the pipeline). Engineering built it (step 4). Now you verify the delivery matches the spec. The gap between what was specified and what was built is the honest error signal — it feeds back into the next cycle.
Per-Feature Checklist
For each row in a PRD's commissioning table:
- Navigate — Can you reach the feature from the expected entry point?
- Happy path — Does the primary workflow complete successfully?
- Output correct — Does the result match the PRD's stated outcome?
- Error handling — Does a bad input produce a clear error, not a crash?
- Intent verified — If agentic: agent action stayed within declared scope (constraints, budget, permissions)
- Evidence captured — GIF or screenshot proving the above
Verification Channels
Each channel gets validated differently:
| Channel | What to Verify |
|---|---|
| Web UI | Features work as specified in PRD |
| API routes | Endpoints return correct data, response shape + status codes |
| A2A protocol | Agent Card discoverable, Task Cards accepted, task lifecycle response |
| Console health | No errors, no warnings in critical paths |
| Agent intent | Agent action matches declared scope — constraints, budget, permissions |
For which browser tool to use per channel, see the tool selection guide.
Flight Readiness
Before any capability ships to production, it must pass eight gates. Adapted from factory pre-flight inspection.
| Gate | Criteria | Test | Applies To |
|---|---|---|---|
| G1: Config | Version locked, zero uncommitted changes | git status clean on deploy branch | All |
| G2: Types | Zero TypeScript errors, strict mode | pnpm nx typecheck [app] | All |
| G3: Security | Auth + rate limits + CSP configured | Action validation audit | All |
| G4: Tests | Pass rate above threshold, documented skips | pnpm nx test [app] | All |
| G5: Performance | P95 response time within budget | Latency measurement under load | All with UI |
| G6: Observability | Four Golden Signals monitored | Latency, Traffic, Errors, Saturation | Production apps |
| G7: AI Safety | Prompt injection mitigated, hallucination bounded | Validation layer audit | AI capabilities |
| G8: Ops Ready | Rollback tested, runbook exists | Deployment verification | Production apps |
Golden Signals
Four signals for G6 observability:
| Signal | Metric | Threshold |
|---|---|---|
| Latency | P95 response time | Under 3s API, under 10s AI |
| Traffic | Concurrent users | Over 50 supported |
| Errors | Error rate | Under 5% |
| Saturation | Function timeout | Under 80% |
Phase to Level
How the venture algorithm maps to engineering maturity:
| Algorithm Phase | Typical L-Level | What's Happening |
|---|---|---|
| SCAN-DISCOVER | -- | No build. Exploring. |
| VALIDATE | L0 | Spec written, scored, kill signals identified |
| MODEL-FINANCE | L0-L1 | Business model selected, financial models built |
| STRATEGY | L1 | Positioning defined, GTM planned |
| PITCH-SELL | L1-L2 | Persuasion assets created, users can interact |
| MEASURE | L2+ | Feedback loop operational, scorecard active |
Verifiable Intent
Commissioning IS verifiable intent applied to software delivery. The delegation chain maps directly:
| VI Layer | Commissioning | What It Proves |
|---|---|---|
| L1 Identity | PRD author | Who specified the capability |
| L2 Intent | PRD spec + success criteria | What was authorized to be built |
| L3 Action | Engineering build | What was actually shipped |
| Verification | Commissioner walkthrough | Did action match intent? |
The builder (agent) acts within the PRD (intent). The commissioner (verifier) checks the delegation chain: spec matched build matched outcome. When agents build features, the same three-layer proof applies — L2 intent constraints become machine-verifiable acceptance criteria. A capability without an intent spec cannot reach L3.
Context
- Verifiable Intent — The authorization proof protocol that commissioning implements
- Commissioning — The principle: why independent verification matters, across domains
- AI Browser Tools — Tool selection for browser-based commissioning
- Commissioning Dashboard — Live status of every capability
- Work Prioritisation — Scoring algorithm, rubrics, gates
- Business Factory Requirements — The capability catalogue
- PRDs — How to spec capabilities
- Benchmark Standards — Trigger-based benchmark protocol
- Flow Engineering — After L4, stories become maps that produce code artifacts
- Cost of Quality — Enforcement tier metrics
- Cost Escalation — The 10x multiplier
Questions
When is a capability ready to ship — and how do you prove it without building it yourself?
- At what maturity level does a capability start generating revenue — and is L4 even necessary for first customers?
- Should flight readiness gates differ by capability type (platform vs product vs agent)?
- What's the cost of skipping L3 (tested) and going straight from L2 (UI connected) to L4 (commissioned)?
- How do you commission an AI capability when its outputs are distributions, not deterministic?