Intent Contract
| Dimension | Question | Autoresearch Loop |
|---|---|---|
| Objective | What problem, and why now? | Seven skills span pain-to-proof but none chain. Metrics aren't queryable. Every step needs a human trigger. The factory is designed but dormant. |
| Desired Outcomes | Observable state changes proving success (2-4) | 1. North Stars return scalars via CLI. 2. Validated outcomes propagate back to PRD frontmatter. 3. One PRD completes a full pain-to-proof cycle per month. |
| Health Metrics | What must not degrade? | Existing skill execution time. Comms CLI reliability. PRD frontmatter integrity. |
| Hard Constraints | What the system blocks regardless of agent intent | Max 3 iterations per task before escalation. Budget cap per overnight session. No metric definition without named data source. |
| Steering | How it should think and trade off | Activation over creation. Wire what exists before building new. Evidence over hope. |
| Autonomy | What it may do alone / must escalate / must never do | Allowed: run tests, measure metrics, log experiments. Escalate: when metric regresses or Forbidden Outcome occurs. Never: push to production without validation. |
| Stop Rules | Complete when... Halt & escalate when... | Complete when prd_full_cycles_per_month >= 1. Halt when any Forbidden Outcome occurs or metric regresses 2 consecutive iterations. |
| Counter-metrics | What gets worse if we over-optimize? | Speed without understanding. Running loops that produce green tests but miss the actual pain. Metric gaming. |
| Blast Radius | What breaks if this goes wrong? | PRD frontmatter corruption. False commissioning. Noise in Comms channels drowning real signals. |
| Rollback | How to undo if it fails | Revert frontmatter fields. Delete session logs. Skills revert to manual trigger mode. |
Story Contract
| # | WHEN | THEN | ARTIFACT | Test Type | FORBIDDEN | OUTCOME |
|---|---|---|---|---|---|---|
| S1 | Agent creates a PRD and defines the Intent Contract outcome | Metric-definer enforces a queryable formula with named data source, threshold, and unit | tests/skills/metric-definer.spec.ts | integration | Prose metric accepted without formula. Metric with no data source passes gate. | Every PRD enters the factory with a measurable North Star. |
| S2 | Engineering validates a PRD (RED to GREEN) and outcome-validator posts results | Proof-to-story writes actual_metric_value, validation_outcome, and pain_reduction_delta to PRD frontmatter | tests/skills/proof-return-signal.spec.ts | integration | Validation results stored only in spec prose, not in machine-readable frontmatter. | Validated outcomes compound into the next scoring cycle automatically. |
| S3 | Score-prds runs on a PRD with validation_outcome in frontmatter | Validated PRDs get boosted confidence. Failed PRDs get flagged for re-evaluation. | tests/skills/score-validated.spec.ts | unit | Validated and unvalidated PRDs scored identically. Evidence quality ignored. | Proven demand compounds — real evidence weights higher than estimates. |
| S4 | Agent session completes (overnight or manual) | Session-experiment-logger reads Comms, aggregates deltas, writes structured log | tests/skills/session-logger.spec.ts | integration | Session results scattered across Comms with no aggregation. No morning report. | Every session produces a diff-readable experiment record. |
| S5 | Operator triggers "run loop" on top uncommissioned PRDs | Loop-orchestrator chains scaffold > activate > validate > measure > story for each PRD within budget | tests/skills/loop-orchestrator.spec.ts | e2e | Loop runs without metric check. Loop continues past budget. Loop ignores regressions. | The full pain-to-proof cycle runs with one trigger instead of seven. |
Build Contract
Metric Enforcement
| # | FeatureID | Feature | Function | Artifact | Success Test | Safety Test | Regression Test | Value | State |
|---|---|---|---|---|---|---|---|---|---|
| 1 | LOOP-001 | Queryable metric enforcer | Enforce computable North Star formula at PRD creation time | .agents/skills/metric-definer/SKILL.md | S1 | F1 | create-prd gates | Every PRD enters with a measurable target | Stub |
| 2 | LOOP-002 | North Star measurement CLI | Return scalar metric value for any PRD via terminal | tools/scripts/measure-north-star.ts | S1 | F1 | agent-comms latency | Agents can measure without human intervention | Stub |
Return Signal
| # | FeatureID | Feature | Function | Artifact | Success Test | Safety Test | Regression Test | Value | State |
|---|---|---|---|---|---|---|---|---|---|
| 3 | LOOP-003 | Return signal propagation | Write validated outcome numbers back to PRD frontmatter | .agents/skills/proof-to-story/SKILL.md | S2 | F2 | proof-to-story output | Same number proves engineering, seeds marketing | Stub |
Experiment Infrastructure
| # | FeatureID | Feature | Function | Artifact | Success Test | Safety Test | Regression Test | Value | State |
|---|---|---|---|---|---|---|---|---|---|
| 4 | LOOP-004 | Session experiment logger | Aggregate overnight results into structured morning report | .agents/skills/session-experiment-logger/SKILL.md | S4 | F4 | Comms CLI read perf | Structured experiment records, not scattered noise | Stub |
| 5 | LOOP-005 | Autonomous loop orchestrator | Chain scaffold > activate > validate > measure > story | stackmates/.agents/skills/loop-orchestrator/SKILL.md | S5 | F5 | Plan CLI stability | One trigger runs the full pain-to-proof cycle | Stub |
Principles
SIO:
- Situation: When the priority stack has scored PRDs with Red tests waiting, and an agent has overnight compute budget available.
- Intention: The pain-to-proof cycle should run autonomously — measure, iterate, validate, propagate results — without seven separate human triggers.
- Obstacle: Metrics aren't queryable (prose, not formulas). Validated outcomes don't propagate back to frontmatter. No conductor chains the skills.
- Hardest Thing: Making the loop trustworthy enough to run unsupervised. A bad loop that produces false greens is worse than no loop.
Why Now: The skill chain exists. The CLIs exist. Karpathy proved the pattern works at research scale. The gap is three wires, not a new architecture.
Design Constraints: Activation over creation. Wire existing skills before building new ones. Every metric must name its data source. Every loop iteration must log its experiment.
Performance
- Priority Score: 1200 (Prepare)
- Quality Targets:
prd_full_cycles_per_month >= 1. North Star queryable for all Active PRDs within 30 days. - Kill Signal: Loop runs for 90 days with zero PRDs completing a full cycle. Or: metrics are defined but never queried.
Platform
- Current State: 7 skills built, 0 wired into a loop. 7/7 Active PRDs have
Queryable: No. - Build Ratio: 30% new (3 skills + 1 CLI), 70% wiring (3 skill upgrades + integration).
Protocols
- Build Order: Phase 1 (metric-definer + measure-north-star.ts) → Phase 2 (proof-to-story upgrade + score-prds upgrade) → Phase 3 (session-logger + loop-orchestrator)
- Commissioning: L2 when metric-definer enforces on one PRD. L3 when one PRD completes full cycle. L4 when
prd_full_cycles_per_month >= 1sustained for 60 days. - First Target: Sales CRM (L2-L3, clearest metric:
won_deals / total_deals > 0.30, real data in Postgres)
Players
- Demand-Side Jobs:
- As a platform operator, I want overnight agent sessions to produce structured experiment logs so I wake to a readable diff instead of scattered Comms.
- As a PRD author, I want validated outcomes to automatically boost the next scoring cycle so proven demand compounds without manual re-scoring.
- As an agent, I want to measure a PRD's North Star via CLI so I can determine whether my iteration improved or regressed the metric.