Skip to main content

Intent Contract

DimensionQuestionAutoresearch Loop
ObjectiveWhat problem, and why now?Seven skills span pain-to-proof but none chain. Metrics aren't queryable. Every step needs a human trigger. The factory is designed but dormant.
Desired OutcomesObservable state changes proving success (2-4)1. North Stars return scalars via CLI. 2. Validated outcomes propagate back to PRD frontmatter. 3. One PRD completes a full pain-to-proof cycle per month.
Health MetricsWhat must not degrade?Existing skill execution time. Comms CLI reliability. PRD frontmatter integrity.
Hard ConstraintsWhat the system blocks regardless of agent intentMax 3 iterations per task before escalation. Budget cap per overnight session. No metric definition without named data source.
SteeringHow it should think and trade offActivation over creation. Wire what exists before building new. Evidence over hope.
AutonomyWhat it may do alone / must escalate / must never doAllowed: run tests, measure metrics, log experiments. Escalate: when metric regresses or Forbidden Outcome occurs. Never: push to production without validation.
Stop RulesComplete when... Halt & escalate when...Complete when prd_full_cycles_per_month >= 1. Halt when any Forbidden Outcome occurs or metric regresses 2 consecutive iterations.
Counter-metricsWhat gets worse if we over-optimize?Speed without understanding. Running loops that produce green tests but miss the actual pain. Metric gaming.
Blast RadiusWhat breaks if this goes wrong?PRD frontmatter corruption. False commissioning. Noise in Comms channels drowning real signals.
RollbackHow to undo if it failsRevert frontmatter fields. Delete session logs. Skills revert to manual trigger mode.

Story Contract

#WHENTHENARTIFACTTest TypeFORBIDDENOUTCOME
S1Agent creates a PRD and defines the Intent Contract outcomeMetric-definer enforces a queryable formula with named data source, threshold, and unittests/skills/metric-definer.spec.tsintegrationProse metric accepted without formula. Metric with no data source passes gate.Every PRD enters the factory with a measurable North Star.
S2Engineering validates a PRD (RED to GREEN) and outcome-validator posts resultsProof-to-story writes actual_metric_value, validation_outcome, and pain_reduction_delta to PRD frontmattertests/skills/proof-return-signal.spec.tsintegrationValidation results stored only in spec prose, not in machine-readable frontmatter.Validated outcomes compound into the next scoring cycle automatically.
S3Score-prds runs on a PRD with validation_outcome in frontmatterValidated PRDs get boosted confidence. Failed PRDs get flagged for re-evaluation.tests/skills/score-validated.spec.tsunitValidated and unvalidated PRDs scored identically. Evidence quality ignored.Proven demand compounds — real evidence weights higher than estimates.
S4Agent session completes (overnight or manual)Session-experiment-logger reads Comms, aggregates deltas, writes structured logtests/skills/session-logger.spec.tsintegrationSession results scattered across Comms with no aggregation. No morning report.Every session produces a diff-readable experiment record.
S5Operator triggers "run loop" on top uncommissioned PRDsLoop-orchestrator chains scaffold > activate > validate > measure > story for each PRD within budgettests/skills/loop-orchestrator.spec.tse2eLoop runs without metric check. Loop continues past budget. Loop ignores regressions.The full pain-to-proof cycle runs with one trigger instead of seven.

Build Contract

Metric Enforcement

#FeatureIDFeatureFunctionArtifactSuccess TestSafety TestRegression TestValueState
1LOOP-001Queryable metric enforcerEnforce computable North Star formula at PRD creation time.agents/skills/metric-definer/SKILL.mdS1F1create-prd gatesEvery PRD enters with a measurable targetStub
2LOOP-002North Star measurement CLIReturn scalar metric value for any PRD via terminaltools/scripts/measure-north-star.tsS1F1agent-comms latencyAgents can measure without human interventionStub

Return Signal

#FeatureIDFeatureFunctionArtifactSuccess TestSafety TestRegression TestValueState
3LOOP-003Return signal propagationWrite validated outcome numbers back to PRD frontmatter.agents/skills/proof-to-story/SKILL.mdS2F2proof-to-story outputSame number proves engineering, seeds marketingStub

Experiment Infrastructure

#FeatureIDFeatureFunctionArtifactSuccess TestSafety TestRegression TestValueState
4LOOP-004Session experiment loggerAggregate overnight results into structured morning report.agents/skills/session-experiment-logger/SKILL.mdS4F4Comms CLI read perfStructured experiment records, not scattered noiseStub
5LOOP-005Autonomous loop orchestratorChain scaffold > activate > validate > measure > storystackmates/.agents/skills/loop-orchestrator/SKILL.mdS5F5Plan CLI stabilityOne trigger runs the full pain-to-proof cycleStub

Principles

SIO:

  • Situation: When the priority stack has scored PRDs with Red tests waiting, and an agent has overnight compute budget available.
  • Intention: The pain-to-proof cycle should run autonomously — measure, iterate, validate, propagate results — without seven separate human triggers.
  • Obstacle: Metrics aren't queryable (prose, not formulas). Validated outcomes don't propagate back to frontmatter. No conductor chains the skills.
  • Hardest Thing: Making the loop trustworthy enough to run unsupervised. A bad loop that produces false greens is worse than no loop.

Why Now: The skill chain exists. The CLIs exist. Karpathy proved the pattern works at research scale. The gap is three wires, not a new architecture.

Design Constraints: Activation over creation. Wire existing skills before building new ones. Every metric must name its data source. Every loop iteration must log its experiment.

Performance

  • Priority Score: 1200 (Prepare)
  • Quality Targets: prd_full_cycles_per_month >= 1. North Star queryable for all Active PRDs within 30 days.
  • Kill Signal: Loop runs for 90 days with zero PRDs completing a full cycle. Or: metrics are defined but never queried.

Platform

  • Current State: 7 skills built, 0 wired into a loop. 7/7 Active PRDs have Queryable: No.
  • Build Ratio: 30% new (3 skills + 1 CLI), 70% wiring (3 skill upgrades + integration).

Protocols

  • Build Order: Phase 1 (metric-definer + measure-north-star.ts) → Phase 2 (proof-to-story upgrade + score-prds upgrade) → Phase 3 (session-logger + loop-orchestrator)
  • Commissioning: L2 when metric-definer enforces on one PRD. L3 when one PRD completes full cycle. L4 when prd_full_cycles_per_month >= 1 sustained for 60 days.
  • First Target: Sales CRM (L2-L3, clearest metric: won_deals / total_deals > 0.30, real data in Postgres)

Players

  • Demand-Side Jobs:
    • As a platform operator, I want overnight agent sessions to produce structured experiment logs so I wake to a readable diff instead of scattered Comms.
    • As a PRD author, I want validated outcomes to automatically boost the next scoring cycle so proven demand compounds without manual re-scoring.
    • As an agent, I want to measure a PRD's North Star via CLI so I can determine whether my iteration improved or regressed the metric.