Intent Contract
| Dimension | Question | Autoresearch Loop |
|---|---|---|
| Objective | What problem, and why now? | Seven skills span pain-to-proof but none chain. Metrics aren't queryable. Every step needs a human trigger. The factory is designed but dormant. |
| Desired Outcomes | Observable state changes proving success (2-4) | 1. North Stars return scalars via CLI. 2. Validated outcomes propagate back to PRD frontmatter. 3. One PRD completes a full pain-to-proof cycle per month. |
| Health Metrics | What must not degrade? | Existing skill execution time. Comms CLI reliability. PRD frontmatter integrity. |
| Hard Constraints | What the system blocks regardless of agent intent | Max 3 iterations per task before escalation. Budget cap per overnight session. No metric definition without named data source. |
| Steering | How it should think and trade off | Activation over creation. Wire what exists before building new. Evidence over hope. |
| Autonomy | What it may do alone / must escalate / must never do | Allowed: run tests, measure metrics, log experiments. Escalate: when metric regresses or Forbidden Outcome occurs. Never: push to production without validation. |
| Stop Rules | Complete when... Halt & escalate when... | Complete when prd_full_cycles_per_month >= 1. Halt when any Forbidden Outcome occurs or metric regresses 2 consecutive iterations. |
| Counter-metrics | What gets worse if we over-optimize? | Speed without understanding. Running loops that produce green tests but miss the actual pain. Metric gaming. |
| Blast Radius | What breaks if this goes wrong? | PRD frontmatter corruption. False commissioning. Noise in Comms channels drowning real signals. |
| Rollback | How to undo if it fails | Revert frontmatter fields. Delete session logs. Skills revert to manual trigger mode. |
Story Contract
Stories are test contracts. Tests must be RED before implementation starts. GREEN = value delivered.
S1 — PRD metric enforced at creation
Trigger: Agent creates a PRD and defines the Intent Contract outcome
Checklist:
- Metric-definer enforces a queryable formula (not prose) with named data source, threshold, and unit
- PRD without a formula is blocked — not silently accepted
- PRD without a named data source fails the gate
Forbidden: Prose metric accepted without formula. Metric with no data source passes gate.
Evidence: integration — tests/skills/metric-definer.spec.ts
Commission Result: ⬜ PASS / ⬜ FAIL Notes: (findings)
S2 — Validation outcome propagates to frontmatter
Trigger: Engineering validates a PRD (RED → GREEN) and outcome-validator posts results
Checklist:
-
proof-to-storywritesactual_metric_value,validation_outcome, andpain_reduction_deltato PRD frontmatter - Fields are machine-readable — not buried in spec prose
- Next score-prds run reads the new frontmatter fields
Forbidden: Validation results stored only in spec prose. No frontmatter update.
Evidence: integration — tests/skills/proof-return-signal.spec.ts
Commission Result: ⬜ PASS / ⬜ FAIL Notes: (findings)
S3 — Validated PRDs score higher
Trigger: Score-prds runs on a PRD with validation_outcome in frontmatter
Checklist:
- Validated PRDs receive a confidence boost in priority score
- Failed PRDs flagged for re-evaluation — not silently kept at current score
- Unvalidated PRDs score differently from validated PRDs
Forbidden: Validated and unvalidated PRDs scored identically. Evidence quality ignored.
Evidence: unit — tests/skills/score-validated.spec.ts
Commission Result: ⬜ PASS / ⬜ FAIL Notes: (findings)
S4 — Session experiment logged automatically
Trigger: Agent session completes (overnight or manual)
Checklist:
- Session-experiment-logger reads Comms, aggregates deltas, writes structured log
- Morning report is diff-readable — not scattered across channel messages
- Log produced for every session, not only sessions with changes
Forbidden: Session results scattered across Comms with no aggregation. No morning report.
Evidence: integration — tests/skills/session-logger.spec.ts
Commission Result: ⬜ PASS / ⬜ FAIL Notes: (findings)
S5 — Full loop runs from one trigger
Trigger: Operator triggers "run loop" on top uncommissioned PRDs
Checklist:
- Loop-orchestrator chains scaffold → activate → validate → measure → story for each PRD
- Loop stops at budget cap — does not continue past limit
- Loop halts on metric regression — does not continue to next PRD
- Metric check runs before each iteration — not skipped
Forbidden: Loop runs without metric check. Loop continues past budget. Loop ignores regressions.
Evidence: e2e — tests/skills/loop-orchestrator.spec.ts
Commission Result: ⬜ PASS / ⬜ FAIL Notes: (findings)
Build Contract
Metric Enforcement
| # | FeatureID | Feature | Function | Artifact | Success Test | Safety Test | Regression Test | Value | State |
|---|---|---|---|---|---|---|---|---|---|
| 1 | LOOP-001 | Queryable metric enforcer | Enforce computable North Star formula at PRD creation time | .agents/skills/metric-definer/SKILL.md | S1 | F1 | create-prd gates | Every PRD enters with a measurable target | Stub |
| 2 | LOOP-002 | North Star measurement CLI | Return scalar metric value for any PRD via terminal | tools/scripts/measure-north-star.ts | S1 | F1 | agent-comms latency | Agents can measure without human intervention | Stub |
Return Signal
| # | FeatureID | Feature | Function | Artifact | Success Test | Safety Test | Regression Test | Value | State |
|---|---|---|---|---|---|---|---|---|---|
| 3 | LOOP-003 | Return signal propagation | Write validated outcome numbers back to PRD frontmatter | .agents/skills/proof-to-story/SKILL.md | S2 | F2 | proof-to-story output | Same number proves engineering, seeds marketing | Stub |
Experiment Infrastructure
| # | FeatureID | Feature | Function | Artifact | Success Test | Safety Test | Regression Test | Value | State |
|---|---|---|---|---|---|---|---|---|---|
| 4 | LOOP-004 | Session experiment logger | Aggregate overnight results into structured morning report | .agents/skills/session-experiment-logger/SKILL.md | S4 | F4 | Comms CLI read perf | Structured experiment records, not scattered noise | Stub |
| 5 | LOOP-005 | Autonomous loop orchestrator | Chain scaffold > activate > validate > measure > story | stackmates/.agents/skills/loop-orchestrator/SKILL.md | S5 | F5 | Plan CLI stability | One trigger runs the full pain-to-proof cycle | Stub |
Principles
SIO:
- Situation: When the priority stack has scored PRDs with Red tests waiting, and an agent has overnight compute budget available.
- Intention: The pain-to-proof cycle should run autonomously — measure, iterate, validate, propagate results — without seven separate human triggers.
- Obstacle: Metrics aren't queryable (prose, not formulas). Validated outcomes don't propagate back to frontmatter. No conductor chains the skills.
- Hardest Thing: Making the loop trustworthy enough to run unsupervised. A bad loop that produces false greens is worse than no loop.
Why Now: The skill chain exists. The CLIs exist. Karpathy proved the pattern works at research scale. The gap is three wires, not a new architecture.
Design Constraints: Activation over creation. Wire existing skills before building new ones. Every metric must name its data source. Every loop iteration must log its experiment.
Performance
- Priority Score: 1200 (Prepare)
- Quality Targets:
prd_full_cycles_per_month >= 1. North Star queryable for all Active PRDs within 30 days. - Kill Signal: Loop runs for 90 days with zero PRDs completing a full cycle. Or: metrics are defined but never queried.
Platform
- Current State: 7 skills built, 0 wired into a loop. 7/7 Active PRDs have
Queryable: No. - Build Ratio: 30% new (3 skills + 1 CLI), 70% wiring (3 skill upgrades + integration).
Protocols
- Build Order: Phase 1 (metric-definer + measure-north-star.ts) → Phase 2 (proof-to-story upgrade + score-prds upgrade) → Phase 3 (session-logger + loop-orchestrator)
- Commissioning: L2 when metric-definer enforces on one PRD. L3 when one PRD completes full cycle. L4 when
prd_full_cycles_per_month >= 1sustained for 60 days. - First Target: Sales CRM (L2-L3, clearest metric:
won_deals / total_deals > 0.30, real data in Postgres)
Players
- Demand-Side Jobs:
- As a platform operator, I want overnight agent sessions to produce structured experiment logs so I wake to a readable diff instead of scattered Comms.
- As a PRD author, I want validated outcomes to automatically boost the next scoring cycle so proven demand compounds without manual re-scoring.
- As an agent, I want to measure a PRD's North Star via CLI so I can determine whether my iteration improved or regressed the metric.
Context
- PRD Index — Autoresearch Loop
- Prompt Deck — 5-card pitch
- Pictures — Pre-flight maps