Intent Contract

Dimension	Question	Autoresearch Loop
Objective	What problem, and why now?	Seven skills span pain-to-proof but none chain. Metrics aren't queryable. Every step needs a human trigger. The factory is designed but dormant.
Desired Outcomes	Observable state changes proving success (2-4)	1. North Stars return scalars via CLI. 2. Validated outcomes propagate back to PRD frontmatter. 3. One PRD completes a full pain-to-proof cycle per month.
Health Metrics	What must not degrade?	Existing skill execution time. Comms CLI reliability. PRD frontmatter integrity.
Hard Constraints	What the system blocks regardless of agent intent	Max 3 iterations per task before escalation. Budget cap per overnight session. No metric definition without named data source.
Steering	How it should think and trade off	Activation over creation. Wire what exists before building new. Evidence over hope.
Autonomy	What it may do alone / must escalate / must never do	Allowed: run tests, measure metrics, log experiments. Escalate: when metric regresses or Forbidden Outcome occurs. Never: push to production without validation.
Stop Rules	Complete when... Halt & escalate when...	Complete when `prd_full_cycles_per_month >= 1`. Halt when any Forbidden Outcome occurs or metric regresses 2 consecutive iterations.
Counter-metrics	What gets worse if we over-optimize?	Speed without understanding. Running loops that produce green tests but miss the actual pain. Metric gaming.
Blast Radius	What breaks if this goes wrong?	PRD frontmatter corruption. False commissioning. Noise in Comms channels drowning real signals.
Rollback	How to undo if it fails	Revert frontmatter fields. Delete session logs. Skills revert to manual trigger mode.

Story Contract

#	WHEN	THEN	ARTIFACT	Test Type	FORBIDDEN	OUTCOME
S1	Agent creates a PRD and defines the Intent Contract outcome	Metric-definer enforces a queryable formula with named data source, threshold, and unit	`tests/skills/metric-definer.spec.ts`	integration	Prose metric accepted without formula. Metric with no data source passes gate.	Every PRD enters the factory with a measurable North Star.
S2	Engineering validates a PRD (RED to GREEN) and outcome-validator posts results	Proof-to-story writes actual_metric_value, validation_outcome, and pain_reduction_delta to PRD frontmatter	`tests/skills/proof-return-signal.spec.ts`	integration	Validation results stored only in spec prose, not in machine-readable frontmatter.	Validated outcomes compound into the next scoring cycle automatically.
S3	Score-prds runs on a PRD with validation_outcome in frontmatter	Validated PRDs get boosted confidence. Failed PRDs get flagged for re-evaluation.	`tests/skills/score-validated.spec.ts`	unit	Validated and unvalidated PRDs scored identically. Evidence quality ignored.	Proven demand compounds — real evidence weights higher than estimates.
S4	Agent session completes (overnight or manual)	Session-experiment-logger reads Comms, aggregates deltas, writes structured log	`tests/skills/session-logger.spec.ts`	integration	Session results scattered across Comms with no aggregation. No morning report.	Every session produces a diff-readable experiment record.
S5	Operator triggers "run loop" on top uncommissioned PRDs	Loop-orchestrator chains scaffold > activate > validate > measure > story for each PRD within budget	`tests/skills/loop-orchestrator.spec.ts`	e2e	Loop runs without metric check. Loop continues past budget. Loop ignores regressions.	The full pain-to-proof cycle runs with one trigger instead of seven.

Build Contract

Metric Enforcement

#	FeatureID	Feature	Function	Artifact	Success Test	Safety Test	Regression Test	Value	State
1	LOOP-001	Queryable metric enforcer	Enforce computable North Star formula at PRD creation time	`.agents/skills/metric-definer/SKILL.md`	S1	F1	`create-prd` gates	Every PRD enters with a measurable target	Stub
2	LOOP-002	North Star measurement CLI	Return scalar metric value for any PRD via terminal	`tools/scripts/measure-north-star.ts`	S1	F1	`agent-comms` latency	Agents can measure without human intervention	Stub

Return Signal

#	FeatureID	Feature	Function	Artifact	Success Test	Safety Test	Regression Test	Value	State
3	LOOP-003	Return signal propagation	Write validated outcome numbers back to PRD frontmatter	`.agents/skills/proof-to-story/SKILL.md`	S2	F2	`proof-to-story` output	Same number proves engineering, seeds marketing	Stub

Experiment Infrastructure

#	FeatureID	Feature	Function	Artifact	Success Test	Safety Test	Regression Test	Value	State
4	LOOP-004	Session experiment logger	Aggregate overnight results into structured morning report	`.agents/skills/session-experiment-logger/SKILL.md`	S4	F4	Comms CLI read perf	Structured experiment records, not scattered noise	Stub
5	LOOP-005	Autonomous loop orchestrator	Chain scaffold > activate > validate > measure > story	`stackmates/.agents/skills/loop-orchestrator/SKILL.md`	S5	F5	Plan CLI stability	One trigger runs the full pain-to-proof cycle	Stub

Principles

SIO:

Situation: When the priority stack has scored PRDs with Red tests waiting, and an agent has overnight compute budget available.
Intention: The pain-to-proof cycle should run autonomously — measure, iterate, validate, propagate results — without seven separate human triggers.
Obstacle: Metrics aren't queryable (prose, not formulas). Validated outcomes don't propagate back to frontmatter. No conductor chains the skills.
Hardest Thing: Making the loop trustworthy enough to run unsupervised. A bad loop that produces false greens is worse than no loop.

Why Now: The skill chain exists. The CLIs exist. Karpathy proved the pattern works at research scale. The gap is three wires, not a new architecture.

Design Constraints: Activation over creation. Wire existing skills before building new ones. Every metric must name its data source. Every loop iteration must log its experiment.

Performance

Priority Score: 1200 (Prepare)
Quality Targets: prd_full_cycles_per_month >= 1. North Star queryable for all Active PRDs within 30 days.
Kill Signal: Loop runs for 90 days with zero PRDs completing a full cycle. Or: metrics are defined but never queried.

Platform

Current State: 7 skills built, 0 wired into a loop. 7/7 Active PRDs have Queryable: No.
Build Ratio: 30% new (3 skills + 1 CLI), 70% wiring (3 skill upgrades + integration).

Protocols

Build Order: Phase 1 (metric-definer + measure-north-star.ts) → Phase 2 (proof-to-story upgrade + score-prds upgrade) → Phase 3 (session-logger + loop-orchestrator)
Commissioning: L2 when metric-definer enforces on one PRD. L3 when one PRD completes full cycle. L4 when prd_full_cycles_per_month >= 1 sustained for 60 days.
First Target: Sales CRM (L2-L3, clearest metric: won_deals / total_deals > 0.30, real data in Postgres)

Players

Demand-Side Jobs:
- As a platform operator, I want overnight agent sessions to produce structured experiment logs so I wake to a readable diff instead of scattered Comms.
- As a PRD author, I want validated outcomes to automatically boost the next scoring cycle so proven demand compounds without manual re-scoring.
- As an agent, I want to measure a PRD's North Star via CLI so I can determine whether my iteration improved or regressed the metric.

Story Contract​

Build Contract​

Metric Enforcement​

Return Signal​

Experiment Infrastructure​

Principles​

Performance​

Platform​

Protocols​

Players​