Automated Commissioning
When the feature matrix has 210 features and states drift within days of manual editing — feature states must be computed from test results, not hand-edited.
Why should I care?
Five cards that sell the dream
Same five positions. Different seat. The commissioner asks "can I trust it?" The builder asks "does it catch my regressions?"
How do we build this?
Five cards that sell the process
210 features hand-edited in feature-matrix.json. States drift. CRM claims L3 with no Story Contract. Manual tracking hits ceiling at 50 features. Every PRD review requires manual state checks that take 30+ minutes and are usually skipped.
Feature states are a function of test results. Run the script, get the truth. feature-matrix.json updated by script within 5 minutes of merge to main. Every state change has an evidence trail.
No bridge connects PRD Build Contract artifacts to test execution. Test files don't declare which feature they verify. Three FAVV formats coexist. The mapping fidelity is the hardest thing — weak mapping produces states that look computed but are just as wrong as manual edits.
A weak mapping (wrong test files mapped to wrong features) produces states that look computed but are just as wrong as manual edits — except now with false confidence. Under-report over over-report.
Priority (5P)
Readiness (5R)
What Exists
| Component | State | Gap |
|---|---|---|
| PRD specs with FAVV Build Contracts | Working | 3 PRDs have full Story+Build. Not parsed by any commissioning script. |
| feature-matrix.json | Working | Powers feature-matrix.jsx page. Hand-edited. |
| Vitest in engineering repo | Working | Runs in CI but not scoped to feature IDs. |
| FAVV parser in project-from-prd | Working | Reads Build Contracts for engineering tasks. Doesn't extract feature-to-test mapping. |
| Playwright in Nx monorepo | Working | Nx plugin installed, e2e targets inferred. Not connected to commissioning. |
| Agent receipt schema | Stub | v1.0 defined. No commissioning script emits receipts. |
Kill Signal
Script states contradict manual commissioner judgment on >20% of features after 3 runs.
Questions
What breaks first when the script disagrees with a human commissioner?
- If 30% of features are unmapped, is the script's output trustworthy enough to replace manual edits?
- Should the script refuse to write results when unmapped percentage exceeds a threshold?
- Is the Safety Test gate (blocking L3) too conservative — or not conservative enough?