Benchmark Standards

How do you know progress is real instead of narrative?

Benchmarks convert standards from opinion into operational evidence.

Why Benchmarks

Templates for the content this repo produces. Each pattern has a job, a source of truth, and a flow.

Pattern	JTBD	Source	Flow
Page Flow	Structure any page for readable, linked content	`.claude/rules/page-flow.md`	Opening → Visual → Insight → Content → Context → Links → Questions
Profile	Extract a person's wisdom as reusable models	`.claude/rules/profile-pattern.md`	Conviction → Visual → Thesis → Wisdom → Context → Questions
PRD	Specify a product from dream to engineering	`.agents/skills/create-prd/`	Pictures → Index → Prompt Deck → Spec
FACT Hub	Define WHAT something is, link implementations	`.claude/rules/fact-and-star-architecture.md`	Definition → Dig Deeper → Stars
Decision	Decide with visible reasoning and accumulated learning	`control-system.md`	Outcome Map → Gates → Decision Log (with Learned)
Venture	Decision surface, not documentation	`.claude/rules/matter-first-pages.md`	Picture → Questions → Headline → Slow/Fast
Landing	Sell outcomes, not features	`.claude/rules/src-pages-gates.md`	Hero → Problem → Solution → Proof → CTA
Template	Structured gap that pulls thinking toward answers	`docs/pictures/patterns/`	See → Dream → Build
Industry	5P scan of any domain	`.agents/skills/deep-research/`	Principles → Performance → Platform → Protocols → Players

Domain-specific benchmark standards for each layer:

Family	Focus	Primary Use
AI/LLM	Model and workflow performance	Reliability, cost, latency, safety
Blockchain	Settlement and interoperability performance	Transaction quality and network utility
Wallet Safety	Wallet UX and architectural safety	Key protection, transaction transparency, destructive operation prevention
Information Architecture	Navigation and findability quality	Information retrieval speed and clarity
UI Design	Render, usability, and accessibility quality	Human-visible quality gates

Benchmarks only matter if they trigger operating decisions:

State	Trigger	Decision
Pass	Meets all required thresholds	Promote current standard
Warn	Misses a non-critical threshold	Run corrective loop and re-test
Fail	Misses a critical threshold	Hold rollout or rollback

No trigger, no benchmark discipline.

What benchmark would change your behavior if you actually measured it?

Which content pattern on this page has no benchmark family — and what would measuring it reveal?
When a benchmark triggers "Warn" but the team promotes anyway, what broke — the threshold or the discipline?
If quality depends on protocol not heroics, which protocol on this site is still depending on heroics?