Skip to main content

← Autoresearch Loop · Prompt Deck · Pictures

Intent Contract

DimensionQuestionAutoresearch Loop
ObjectiveWhat problem, and why now?Seven skills span pain-to-proof but none chain. Metrics aren't queryable. Every step needs a human trigger. The factory is designed but dormant.
Desired OutcomesObservable state changes proving success (2-4)1. North Stars return scalars via CLI. 2. Validated outcomes propagate back to PRD frontmatter. 3. One PRD completes a full pain-to-proof cycle per month.
Health MetricsWhat must not degrade?Existing skill execution time. Comms CLI reliability. PRD frontmatter integrity.
Hard ConstraintsWhat the system blocks regardless of agent intentMax 3 iterations per task before escalation. Budget cap per overnight session. No metric definition without named data source.
SteeringHow it should think and trade offActivation over creation. Wire what exists before building new. Evidence over hope.
AutonomyWhat it may do alone / must escalate / must never doAllowed: run tests, measure metrics, log experiments. Escalate: when metric regresses or Forbidden Outcome occurs. Never: push to production without validation.
Stop RulesComplete when... Halt & escalate when...Complete when prd_full_cycles_per_month >= 1. Halt when any Forbidden Outcome occurs or metric regresses 2 consecutive iterations.
Counter-metricsWhat gets worse if we over-optimize?Speed without understanding. Running loops that produce green tests but miss the actual pain. Metric gaming.
Blast RadiusWhat breaks if this goes wrong?PRD frontmatter corruption. False commissioning. Noise in Comms channels drowning real signals.
RollbackHow to undo if it failsRevert frontmatter fields. Delete session logs. Skills revert to manual trigger mode.

Story Contract

Stories are test contracts. Tests must be RED before implementation starts. GREEN = value delivered.


S1 — PRD metric enforced at creation

Trigger: Agent creates a PRD and defines the Intent Contract outcome

Checklist:

  • Metric-definer enforces a queryable formula (not prose) with named data source, threshold, and unit
  • PRD without a formula is blocked — not silently accepted
  • PRD without a named data source fails the gate

Forbidden: Prose metric accepted without formula. Metric with no data source passes gate.

Evidence: integration — tests/skills/metric-definer.spec.ts

Commission Result: ⬜ PASS / ⬜ FAIL Notes: (findings)


S2 — Validation outcome propagates to frontmatter

Trigger: Engineering validates a PRD (RED → GREEN) and outcome-validator posts results

Checklist:

  • proof-to-story writes actual_metric_value, validation_outcome, and pain_reduction_delta to PRD frontmatter
  • Fields are machine-readable — not buried in spec prose
  • Next score-prds run reads the new frontmatter fields

Forbidden: Validation results stored only in spec prose. No frontmatter update.

Evidence: integration — tests/skills/proof-return-signal.spec.ts

Commission Result: ⬜ PASS / ⬜ FAIL Notes: (findings)


S3 — Validated PRDs score higher

Trigger: Score-prds runs on a PRD with validation_outcome in frontmatter

Checklist:

  • Validated PRDs receive a confidence boost in priority score
  • Failed PRDs flagged for re-evaluation — not silently kept at current score
  • Unvalidated PRDs score differently from validated PRDs

Forbidden: Validated and unvalidated PRDs scored identically. Evidence quality ignored.

Evidence: unit — tests/skills/score-validated.spec.ts

Commission Result: ⬜ PASS / ⬜ FAIL Notes: (findings)


S4 — Session experiment logged automatically

Trigger: Agent session completes (overnight or manual)

Checklist:

  • Session-experiment-logger reads Comms, aggregates deltas, writes structured log
  • Morning report is diff-readable — not scattered across channel messages
  • Log produced for every session, not only sessions with changes

Forbidden: Session results scattered across Comms with no aggregation. No morning report.

Evidence: integration — tests/skills/session-logger.spec.ts

Commission Result: ⬜ PASS / ⬜ FAIL Notes: (findings)


S5 — Full loop runs from one trigger

Trigger: Operator triggers "run loop" on top uncommissioned PRDs

Checklist:

  • Loop-orchestrator chains scaffold → activate → validate → measure → story for each PRD
  • Loop stops at budget cap — does not continue past limit
  • Loop halts on metric regression — does not continue to next PRD
  • Metric check runs before each iteration — not skipped

Forbidden: Loop runs without metric check. Loop continues past budget. Loop ignores regressions.

Evidence: e2e — tests/skills/loop-orchestrator.spec.ts

Commission Result: ⬜ PASS / ⬜ FAIL Notes: (findings)

Build Contract

Metric Enforcement

#FeatureIDFeatureFunctionArtifactSuccess TestSafety TestRegression TestValueState
1LOOP-001Queryable metric enforcerEnforce computable North Star formula at PRD creation time.agents/skills/metric-definer/SKILL.mdS1F1create-prd gatesEvery PRD enters with a measurable targetStub
2LOOP-002North Star measurement CLIReturn scalar metric value for any PRD via terminaltools/scripts/measure-north-star.tsS1F1agent-comms latencyAgents can measure without human interventionStub

Return Signal

#FeatureIDFeatureFunctionArtifactSuccess TestSafety TestRegression TestValueState
3LOOP-003Return signal propagationWrite validated outcome numbers back to PRD frontmatter.agents/skills/proof-to-story/SKILL.mdS2F2proof-to-story outputSame number proves engineering, seeds marketingStub

Experiment Infrastructure

#FeatureIDFeatureFunctionArtifactSuccess TestSafety TestRegression TestValueState
4LOOP-004Session experiment loggerAggregate overnight results into structured morning report.agents/skills/session-experiment-logger/SKILL.mdS4F4Comms CLI read perfStructured experiment records, not scattered noiseStub
5LOOP-005Autonomous loop orchestratorChain scaffold > activate > validate > measure > storystackmates/.agents/skills/loop-orchestrator/SKILL.mdS5F5Plan CLI stabilityOne trigger runs the full pain-to-proof cycleStub

Principles

SIO:

  • Situation: When the priority stack has scored PRDs with Red tests waiting, and an agent has overnight compute budget available.
  • Intention: The pain-to-proof cycle should run autonomously — measure, iterate, validate, propagate results — without seven separate human triggers.
  • Obstacle: Metrics aren't queryable (prose, not formulas). Validated outcomes don't propagate back to frontmatter. No conductor chains the skills.
  • Hardest Thing: Making the loop trustworthy enough to run unsupervised. A bad loop that produces false greens is worse than no loop.

Why Now: The skill chain exists. The CLIs exist. Karpathy proved the pattern works at research scale. The gap is three wires, not a new architecture.

Design Constraints: Activation over creation. Wire existing skills before building new ones. Every metric must name its data source. Every loop iteration must log its experiment.

Performance

  • Priority Score: 1200 (Prepare)
  • Quality Targets: prd_full_cycles_per_month >= 1. North Star queryable for all Active PRDs within 30 days.
  • Kill Signal: Loop runs for 90 days with zero PRDs completing a full cycle. Or: metrics are defined but never queried.

Platform

  • Current State: 7 skills built, 0 wired into a loop. 7/7 Active PRDs have Queryable: No.
  • Build Ratio: 30% new (3 skills + 1 CLI), 70% wiring (3 skill upgrades + integration).

Protocols

  • Build Order: Phase 1 (metric-definer + measure-north-star.ts) → Phase 2 (proof-to-story upgrade + score-prds upgrade) → Phase 3 (session-logger + loop-orchestrator)
  • Commissioning: L2 when metric-definer enforces on one PRD. L3 when one PRD completes full cycle. L4 when prd_full_cycles_per_month >= 1 sustained for 60 days.
  • First Target: Sales CRM (L2-L3, clearest metric: won_deals / total_deals > 0.30, real data in Postgres)

Players

  • Demand-Side Jobs:
    • As a platform operator, I want overnight agent sessions to produce structured experiment logs so I wake to a readable diff instead of scattered Comms.
    • As a PRD author, I want validated outcomes to automatically boost the next scoring cycle so proven demand compounds without manual re-scoring.
    • As an agent, I want to measure a PRD's North Star via CLI so I can determine whether my iteration improved or regressed the metric.

Context