Value Stories
Six stories across four groups. Each story is a test contract — RED before implementation, GREEN when value is delivered. The CONDUCTOR wires the factory.
Can the loop measure what it's optimising?
Creating a PRD and defining the North Star. Current metric-definer accepts prose formulas — no named data source, no threshold, no unit. The metric is uncheckable.
Metric-definer that enforces a computable formula with named data source, threshold, and unit at creation time.
PRD blocked if metric isn't queryable. Every new PRD enters with a measurable target. Currently: 0/7 Active PRDs have queryable metrics.
Prose metric accepted without formula. Metric with no data source passes gate.
Running measure-north-star for a PRD. The CLI namespace exists but returns nothing. An agent can't determine if its iteration improved or regressed the metric.
CLI command that returns a scalar metric value for any PRD with a queryable formula.
Agent runs measure-north-star --prd=sales-crm, gets a number. Currently: returns 'not implemented'.
Command returns a number but it's hardcoded or cached, not computed from the named data source.
Do validated outcomes compound into the next cycle?
Engineering validates a PRD (RED to GREEN). proof-to-story writes a meta article but the actual number doesn't flow back to frontmatter where score-prds can read it.
proof-to-story writes actual_metric_value, validation_outcome, and pain_reduction_delta to PRD frontmatter.
Next score-prds run reads validated numbers from frontmatter. Proven demand compounds into priority.
Fields written but score-prds doesn't read them. Numbers flow but don't influence ranking.
Score-prds runs on a PRD with validation_outcome in frontmatter. Validated and unvalidated PRDs scored identically — evidence quality ignored.
Validated PRDs receive a confidence boost. Failed PRDs flagged for re-evaluation.
PRD with validation_outcome: pass scores higher than identical PRD without. Currently: both score the same.
Boost applied but trivial. Failed PRDs not flagged. Validation status present but scoring formula unchanged.
Can the operator trust what happened overnight?
Agent session completes overnight. Results scattered across 20+ Comms messages. No aggregation. I wake to noise, not signal.
Session-experiment-logger reads Comms, aggregates deltas, writes one structured log per session.
Morning report is diff-readable — what changed, what improved, what regressed. One file per session.
Log produced but just copies Comms messages verbatim. No aggregation, no delta computation.
Does one trigger run the full cycle?
Operator triggers 'run loop' on top uncommissioned PRDs. Currently requires 7 separate skills manually in sequence.
Loop-orchestrator chains scaffold, activate, validate, measure, story with budget cap and regression halt.
One trigger, full cycle. Loop stops at budget. Halts on regression. Metric check before each iteration.
Loop runs but skips metric check. Continues past budget. Ignores regressions. Produces false greens.
Kill Signal
Loop runs for 90 days with zero PRDs completing a full cycle. Or: metrics are defined but never queried.
Who
- Platform operator — wants overnight sessions to produce structured experiment logs
- PRD author — wants validated outcomes to automatically boost the next scoring cycle
- Agent — wants to measure a PRD's North Star via CLI to know if iteration improved or regressed
Questions
What compounds faster — building new skills or wiring existing ones into a loop?
- If every overnight session produced a structured experiment log, what would you learn by morning?
- Which dormant algorithm would benefit most from 100 automated iterations?
- Can LOOP-001 + LOOP-003 deliver value without the full orchestrator (LOOP-005)?