← Autoresearch Loop · Prompt Deck · Pictures

Intent Contract

Dimension	Question	Autoresearch Loop
Objective	What problem, and why now?	Seven skills span pain-to-proof but none chain. Metrics aren't queryable. Every step needs a human trigger. The factory is designed but dormant.
Desired Outcomes	Observable state changes proving success (2-4)	1. North Stars return scalars via CLI. 2. Validated outcomes propagate back to PRD frontmatter. 3. One PRD completes a full pain-to-proof cycle per month.
Health Metrics	What must not degrade?	Existing skill execution time. Comms CLI reliability. PRD frontmatter integrity.
Hard Constraints	What the system blocks regardless of agent intent	Max 3 iterations per task before escalation. Budget cap per overnight session. No metric definition without named data source.
Steering	How it should think and trade off	Activation over creation. Wire what exists before building new. Evidence over hope.
Autonomy	What it may do alone / must escalate / must never do	Allowed: run tests, measure metrics, log experiments. Escalate: when metric regresses or Forbidden Outcome occurs. Never: push to production without validation.
Stop Rules	Complete when... Halt & escalate when...	Complete when `prd_full_cycles_per_month >= 1`. Halt when any Forbidden Outcome occurs or metric regresses 2 consecutive iterations.
Counter-metrics	What gets worse if we over-optimize?	Speed without understanding. Running loops that produce green tests but miss the actual pain. Metric gaming.
Blast Radius	What breaks if this goes wrong?	PRD frontmatter corruption. False commissioning. Noise in Comms channels drowning real signals.
Rollback	How to undo if it fails	Revert frontmatter fields. Delete session logs. Skills revert to manual trigger mode.

Story Contract

Stories are test contracts. Tests must be RED before implementation starts. GREEN = value delivered.

S1 — PRD metric enforced at creation

Trigger: Agent creates a PRD and defines the Intent Contract outcome

Checklist:

Metric-definer enforces a queryable formula (not prose) with named data source, threshold, and unit
PRD without a formula is blocked — not silently accepted
PRD without a named data source fails the gate

Forbidden: Prose metric accepted without formula. Metric with no data source passes gate.

Evidence: integration — tests/skills/metric-definer.spec.ts

Commission Result: ⬜ PASS / ⬜ FAIL Notes: (findings)

S2 — Validation outcome propagates to frontmatter

Trigger: Engineering validates a PRD (RED → GREEN) and outcome-validator posts results

Checklist:

proof-to-story writes actual_metric_value, validation_outcome, and pain_reduction_delta to PRD frontmatter
Fields are machine-readable — not buried in spec prose
Next score-prds run reads the new frontmatter fields

Forbidden: Validation results stored only in spec prose. No frontmatter update.

Evidence: integration — tests/skills/proof-return-signal.spec.ts

Commission Result: ⬜ PASS / ⬜ FAIL Notes: (findings)

S3 — Validated PRDs score higher

Trigger: Score-prds runs on a PRD with validation_outcome in frontmatter

Checklist:

Validated PRDs receive a confidence boost in priority score
Failed PRDs flagged for re-evaluation — not silently kept at current score
Unvalidated PRDs score differently from validated PRDs

Forbidden: Validated and unvalidated PRDs scored identically. Evidence quality ignored.

Evidence: unit — tests/skills/score-validated.spec.ts

Commission Result: ⬜ PASS / ⬜ FAIL Notes: (findings)

S4 — Session experiment logged automatically

Trigger: Agent session completes (overnight or manual)

Checklist:

Session-experiment-logger reads Comms, aggregates deltas, writes structured log
Morning report is diff-readable — not scattered across channel messages
Log produced for every session, not only sessions with changes

Forbidden: Session results scattered across Comms with no aggregation. No morning report.

Evidence: integration — tests/skills/session-logger.spec.ts

Commission Result: ⬜ PASS / ⬜ FAIL Notes: (findings)

S5 — Full loop runs from one trigger

Trigger: Operator triggers "run loop" on top uncommissioned PRDs

Checklist:

Loop-orchestrator chains scaffold → activate → validate → measure → story for each PRD
Loop stops at budget cap — does not continue past limit
Loop halts on metric regression — does not continue to next PRD
Metric check runs before each iteration — not skipped

Forbidden: Loop runs without metric check. Loop continues past budget. Loop ignores regressions.

Evidence: e2e — tests/skills/loop-orchestrator.spec.ts

Commission Result: ⬜ PASS / ⬜ FAIL Notes: (findings)

Build Contract

Metric Enforcement

#	FeatureID	Feature	Function	Artifact	Success Test	Safety Test	Regression Test	Value	State
1	LOOP-001	Queryable metric enforcer	Enforce computable North Star formula at PRD creation time	`.agents/skills/metric-definer/SKILL.md`	S1	F1	`create-prd` gates	Every PRD enters with a measurable target	Stub
2	LOOP-002	North Star measurement CLI	Return scalar metric value for any PRD via terminal	`tools/scripts/measure-north-star.ts`	S1	F1	`agent-comms` latency	Agents can measure without human intervention	Stub

Return Signal

#	FeatureID	Feature	Function	Artifact	Success Test	Safety Test	Regression Test	Value	State
3	LOOP-003	Return signal propagation	Write validated outcome numbers back to PRD frontmatter	`.agents/skills/proof-to-story/SKILL.md`	S2	F2	`proof-to-story` output	Same number proves engineering, seeds marketing	Stub

Experiment Infrastructure

#	FeatureID	Feature	Function	Artifact	Success Test	Safety Test	Regression Test	Value	State
4	LOOP-004	Session experiment logger	Aggregate overnight results into structured morning report	`.agents/skills/session-experiment-logger/SKILL.md`	S4	F4	Comms CLI read perf	Structured experiment records, not scattered noise	Stub
5	LOOP-005	Autonomous loop orchestrator	Chain scaffold > activate > validate > measure > story	`stackmates/.agents/skills/loop-orchestrator/SKILL.md`	S5	F5	Plan CLI stability	One trigger runs the full pain-to-proof cycle	Stub

Principles

SIO:

Situation: When the priority stack has scored PRDs with Red tests waiting, and an agent has overnight compute budget available.
Intention: The pain-to-proof cycle should run autonomously — measure, iterate, validate, propagate results — without seven separate human triggers.
Obstacle: Metrics aren't queryable (prose, not formulas). Validated outcomes don't propagate back to frontmatter. No conductor chains the skills.
Hardest Thing: Making the loop trustworthy enough to run unsupervised. A bad loop that produces false greens is worse than no loop.

Why Now: The skill chain exists. The CLIs exist. Karpathy proved the pattern works at research scale. The gap is three wires, not a new architecture.

Design Constraints: Activation over creation. Wire existing skills before building new ones. Every metric must name its data source. Every loop iteration must log its experiment.

Performance

Priority Score: 1200 (Prepare)
Quality Targets: prd_full_cycles_per_month >= 1. North Star queryable for all Active PRDs within 30 days.
Kill Signal: Loop runs for 90 days with zero PRDs completing a full cycle. Or: metrics are defined but never queried.

Platform

Current State: 7 skills built, 0 wired into a loop. 7/7 Active PRDs have Queryable: No.
Build Ratio: 30% new (3 skills + 1 CLI), 70% wiring (3 skill upgrades + integration).

Protocols

Build Order: Phase 1 (metric-definer + measure-north-star.ts) → Phase 2 (proof-to-story upgrade + score-prds upgrade) → Phase 3 (session-logger + loop-orchestrator)
Commissioning: L2 when metric-definer enforces on one PRD. L3 when one PRD completes full cycle. L4 when prd_full_cycles_per_month >= 1 sustained for 60 days.
First Target: Sales CRM (L2-L3, clearest metric: won_deals / total_deals > 0.30, real data in Postgres)

Players

Demand-Side Jobs:
- As a platform operator, I want overnight agent sessions to produce structured experiment logs so I wake to a readable diff instead of scattered Comms.
- As a PRD author, I want validated outcomes to automatically boost the next scoring cycle so proven demand compounds without manual re-scoring.
- As an agent, I want to measure a PRD's North Star via CLI so I can determine whether my iteration improved or regressed the metric.

Context

PRD Index — Autoresearch Loop
Prompt Deck — 5-card pitch
Pictures — Pre-flight maps

Story Contract​

S1 — PRD metric enforced at creation​

S2 — Validation outcome propagates to frontmatter​

S3 — Validated PRDs score higher​

S4 — Session experiment logged automatically​

S5 — Full loop runs from one trigger​

Build Contract​

Metric Enforcement​

Return Signal​

Experiment Infrastructure​

Principles​

Performance​

Platform​

Protocols​

Players​

Context​

Story Contract

S1 — PRD metric enforced at creation

S2 — Validation outcome propagates to frontmatter

S3 — Validated PRDs score higher

S4 — Session experiment logged automatically

S5 — Full loop runs from one trigger

Build Contract

Metric Enforcement

Return Signal

Experiment Infrastructure

Principles

Performance

Platform

Protocols

Players

Context