Value Stories
Nine stories across five groups. Each story is a test contract — RED before implementation, GREEN when value is delivered. The CONTROLLER verifies what shipped.
Does the matrix update itself?
PR merged to main. feature-matrix.json still shows states from last manual edit. No idea if they're still true. Have to open the app, check each feature, type the new level.
feature-matrix.json updates automatically within 5 minutes of merge. Only features whose mapped tests were touched have state changes.
Git blame shows script commit, not manual edit. No feature reaches L3 when its test results are failing.
States change for untouched features. Feature marked L3 with failing tests.
Commission script run completes. 60 of 210 features have no test file mapping. They silently sit at L0. Report says nothing about them.
Report shows unmapped count with specific feature IDs. No unmapped feature silently stays at L0 without a flag.
Script does not report 100% coverage when mappings are missing. Every unmapped feature visible in report.
Unmapped features hidden. Coverage report lies.
Can I trust what the matrix says?
Safety Test assertion fails in test run. Feature still reaches L3 because only Success Test is checked.
Feature state capped at L2 regardless of Success Test results when Safety Test fails.
Safety Test failure logged with feature ID. Feature cannot reach L3 with active safety violation.
Feature reaches L3 with active safety violation. Failures silently ignored.
Previously-passing test starts failing after a refactor. Feature state stays at L3 because states only go up, never down.
Feature state moves from L3 to L2 (or lower) in next commission run. Demotion logged with evidence.
States can go down. No one-way ratchet. Demotion logged with previous state, new state, and failing test file.
State stays at L3 after tests break. Demotion applied silently without logging.
Unit test with mocked DB passes. Commission script counts it as integration evidence. Feature reaches L3 without real infrastructure tested.
Unit tests (mocked DB/server) logged but not used for L-level state changes. Only integration and e2e results feed computation.
L3 means real infrastructure was tested, not mocks. Unit test exclusion does not affect unit test reporting elsewhere.
Unit test with mocked DB counted as integration evidence. Hook test with stubbed API affects L-level.
Does the parser handle all PRD formats?
PRD uses FFO format (6 columns) or FAVV v2.0. Parser only handles v2.1 (9 columns). Older PRDs silently skipped.
Parser extracts feature-to-test mappings from FFO, v2.0, and v2.1 formats.
v2.0 and FFO PRDs produce mappings. Partial results returned when table is malformed — no crash, log warning.
Crash on FFO. Silent skip of v2.0. Only v2.1 produces mappings.
Does browser verification work?
Feature has e2e Playwright spec. Unit tests pass. But e2e spec fails because the UI is broken. Feature still shows L3.
Feature with passing unit tests but failing e2e is capped at L2. Playwright runs spec headlessly with trace + screenshot.
E2e results feed L-level computation alongside unit results. Traces archived as commission evidence.
Feature reaches L3 from unit tests alone when e2e spec exists but fails. Playwright runs headed in CI.
E2e spec uses page.route() to intercept real API calls with canned responses. Test passes but proves nothing about real behavior.
AST scan flags page.route() and page.fulfill() usage. Report lists counterfeit specs with file path and line number.
Mock-route specs not counted as passing evidence. Scanner does not modify spec files.
Mock-route specs pass the scanner. Scanner false-positives on legitimate test setup.
Does the dashboard show commissioning health?
Commission script completes. Project dashboard in Plans UI shows no commissioning data. Have to read feature-matrix.json manually.
commissioning_results Convex table updated. listProjectsWithStats returns commissioningLevel per project.
Project with no commission run returns commissioningLevel: null — not 0. Dashboard reads from Convex, not filesystem.
Dashboard shows commissioningLevel: 0 when no commission run occurred. Script writes to Convex in dry-run mode.
Kill Signal
Script states contradict manual commissioner judgment on >20% of features after 3 runs.
Who
- Commissioner — verifies feature states after engineering ships. 30+ min manual inspection, usually skipped.
- Engineering agent — needs matrix updated automatically after merge. Forgets to update JSON manually.
- PRD author — needs unmapped report to fix missing Artifact paths in Build Contract.
Questions
What breaks first when the script disagrees with a human commissioner?
- If 30% of features are unmapped, is the script's output trustworthy enough to replace manual edits?
- Should the script refuse to write results when unmapped percentage exceeds a threshold?
- Is the Safety Test gate (blocking L3) too conservative — or not conservative enough?