L0inner-loop

Autoresearch Loop

When the priority stack has scored PRDs with Red tests waiting and an agent has overnight compute budget — the pain-to-proof cycle should run autonomously without seven separate human triggers.

1,200

Priority Score

Pain × Demand × Edge × Trend × Conversion

Customer Journey

Why should I care?

Five cards that sell the dream

1Why

Seven skills, zero wire.

What's the cost of seven skills that don't compound?

The friction: Pain-signal-extractor through proof-to-story exist across two repos. Every step requires a human trigger. The chain never runs.

The desire: One trigger that chains scaffold, activate, validate, measure, story. Overnight compute turns dormant PRDs into proven demand.

The proof: Karpathy's autoresearch validated the pattern at research scale. The gap is three wires, not a new architecture.

The valuable stories behind this slide

2Evidence

Queryable: No.

How do you run an overnight loop without a measurement?

The friction: Every Active PRD has a North Star formula. None are executable. An agent can't run measure-north-star and get a number.

The desire: Enforce queryable metrics at PRD creation. Build the CLI that returns a scalar. The gauge works before the loop runs.

The proof: The priority board already ranks by 5P scores. Queryable metrics make ranking evidence-based, not opinion-based.

How metrics flow through the system

3Platform

The missing return wire.

If validated outcomes don't compound, does the loop actually close?

The friction: proof-to-story writes a meta article. But it doesn't write the actual number back to frontmatter where score-prds can read it.

The desire: Write actual_metric_value and validation_outcome to PRD frontmatter. Let proven demand weight higher in the next scoring cycle.

The proof: The scoring formula already exists. The frontmatter schema accepts new fields. The wire is short — one function in proof-to-story, one reader in score-prds.

The return signal stories

4Loop

Morning report.

What did the factory produce overnight?

The friction: Agent sessions scatter results across Comms channels. No aggregation. No diff-readable log. You wake to noise.

The desire: Session-experiment-logger reads Comms, aggregates deltas, writes one structured log per session. Morning diff in 30 seconds.

The proof: The daily context files already do this manually. Automation means the log exists even when you forget to write it.

How sessions aggregate into reports

5People

Program.md is the product.

What compounds faster — the code or the instructions that drive the code?

The friction: Karpathy's autoresearch compounds because the instruction set improves each generation. The Legacy Rule says the same thing. Neither runs automatically.

The desire: Treat CLAUDE.md, PRD templates, and skill files as the primary optimization target. The loop makes compounding automatic.

The proof: This session's inner-loop improvements were all template/command upgrades — the program improved, not just the output.

The full loop story

1 / 5

Same five positions. Different seat. The customer asks "will it run overnight?" The builder asks "will the morning report be trustworthy?"

Feature Dev Journey

How do we build this?

Five cards that sell the process

1Job

Activation over creation.

70% wiring, 30% new. What exists already?

7 skills built, 0 wired into a loop. The build contract is 6 rows — 3 skill upgrades, 1 new CLI command, 1 new skill, 1 orchestrator.

The dependency map

1 / 5

Situation

Seven skills span pain-to-proof across two repos but run in isolation. Every step requires a human trigger. Every North Star in the priority index reads 'Queryable: No'. The factory is designed but dormant.

Intention

One trigger runs scaffold, activate, validate, measure, story for the top uncommissioned PRDs. Metrics are queryable. Validated outcomes propagate back to frontmatter. One PRD completes a full pain-to-proof cycle per month.

Obstacle

Metrics aren't queryable (prose, not formulas). Validated outcomes don't propagate back to frontmatter. No conductor chains the skills. Trust gap: a bad loop that produces false greens is worse than no loop.

Hardest Thing

Making the loop trustworthy enough to run unsupervised. Budget caps, metric regression halts, and experiment logging are the safety rails — without them the loop runs fast in the wrong direction.

Priority (5P)

4/5

Pain

4/5

Demand

5/5

Edge

5/5

Trend

3/5

Convert

Readiness (5R)

Principles4 / 5

Performance2 / 5

Platform3 / 5

Process2 / 5

Players2 / 5

What Exists

Component	State	Gap
pain-signal-extractor skill	Working	Extracts pain from interviews. No chaining to next step.
create-prd skill	Working	Creates PRDs with 5P scoring. No queryable metric enforcement.
engineering-handoff skill	Working	9-gate pre-flight. Gate 8 can't check queryable metrics.
proof-to-story skill	Working	Writes meta article. Doesn't propagate numbers to frontmatter.
score-prds skill	Working	Scores by 5P. No confidence boost for validated PRDs.
measure-north-star CLI	Stub	Namespace exists. No scalar measurement implemented.
session-experiment-logger	Missing	No overnight aggregation. Results scattered across Comms.

PRD	Contributes

Kill Signal

Loop runs for 90 days with zero PRDs completing a full cycle. Or: metrics are defined but never queried.

Questions

What compounds faster — building new skills or wiring existing ones into a loop?

If every overnight session produced a structured experiment log, what would you learn by morning?
Which dormant algorithm would benefit most from 100 automated iterations?
Can LOOP-001 + LOOP-003 deliver value without the full orchestrator (LOOP-005)?