Benchmark Standards
How do you know progress is real instead of narrative?
Benchmarks convert standards from opinion into operational evidence.
Why Benchmarks
| Without Benchmarks | With Benchmarks |
|---|---|
| Claims compete | Evidence compares |
| Work drifts | Variance is visible |
| Decisions are political | Decisions are thresholded |
| Quality depends on heroics | Quality depends on protocol |
Content Patterns
Templates for the content this repo produces. Each pattern has a job, a source of truth, and a flow.
| Pattern | JTBD | Source | Flow |
|---|---|---|---|
| Page Flow | Structure any page for readable, linked content | .claude/rules/page-flow.md | Opening → Visual → Insight → Content → Context → Links → Questions |
| Profile | Extract a person's wisdom as reusable models | .claude/rules/profile-pattern.md | Conviction → Visual → Thesis → Wisdom → Context → Questions |
| PRD | Specify a product from dream to engineering | .agents/skills/create-prd/ | Pictures → Index → Prompt Deck → Spec |
| FACT Hub | Define WHAT something is, link implementations | .claude/rules/fact-and-star-architecture.md | Definition → Dig Deeper → Stars |
| Decision | Decide with visible reasoning and accumulated learning | control-system.md | Outcome Map → Gates → Decision Log (with Learned) |
| Venture | Decision surface, not documentation | .claude/rules/matter-first-pages.md | Picture → Questions → Headline → Slow/Fast |
| Landing | Sell outcomes, not features | .claude/rules/src-pages-gates.md | Hero → Problem → Solution → Proof → CTA |
| Template | Structured gap that pulls thinking toward answers | docs/pictures/templates/ | See → Dream → Build |
| Industry | 5P scan of any domain | .agents/skills/deep-research/ | Principles → Performance → Platform → Protocols → Players |
Benchmark Families
Domain-specific benchmark standards for each layer:
| Family | Focus | Primary Use |
|---|---|---|
| AI/LLM | Model and workflow performance | Reliability, cost, latency, safety |
| Blockchain | Settlement and interoperability performance | Transaction quality and network utility |
| Wallet Safety | Wallet UX and architectural safety | Key protection, transaction transparency, destructive operation prevention |
| Information Architecture | Navigation and findability quality | Information retrieval speed and clarity |
| UI Design | Render, usability, and accessibility quality | Human-visible quality gates |
Trigger Loop
Benchmarks only matter if they trigger operating decisions:
| State | Trigger | Decision |
|---|---|---|
| Pass | Meets all required thresholds | Promote current standard |
| Warn | Misses a non-critical threshold | Run corrective loop and re-test |
| Fail | Misses a critical threshold | Hold rollout or rollback |
No trigger, no benchmark discipline.
Use Sequence
- Select the benchmark family for the system you are evaluating
- Define thresholds before execution
- Run evaluation with reproducible protocol
- Trigger decision workflow from result state
- Record outcome and update standard
Context
- Standards — Standards define repeatable outcomes
- Process Optimisation — PDCA operating loop
- Performance — Measurement and decision discipline
- Network Protocols — Coordination layer across systems
- Pictures — Templates and stories: the potential and the proof
- Tight Five — The meta-pattern every benchmark family instantiates
Questions
What benchmark would change your behavior if you actually measured it?
- Which content pattern on this page has no benchmark family — and what would measuring it reveal?
- When a benchmark triggers "Warn" but the team promotes anyway, what broke — the threshold or the discipline?
- If quality depends on protocol not heroics, which protocol on this site is still depending on heroics?