Pilot Selection Scorecard

The analysis layer that converts traced workflows into a ranked pilot recommendation. Companion to the Flow Discovery Kick-off — the kick-off produces the trace; the scorecard turns the trace into a defensible first-pilot pick.

The scorecard runs on every Workflow Row the kick-off has traced. Today's kick-off traces one; subsequent kick-offs add rows; the scorecard ranks across all of them. The first pilot is the row that scores highest on the combined Pain × Flow-Shape × Readiness — adjusted for Risk.

Why this exists

Without the scorecard, the AI Strategy Meeting argues about which workflow to attack. With it, the argument is settled by the table — and any push-back has to attack the data, not the recommender. That is what turns the kick-off from a workshop into a decision instrument.

Four pillars, seventeen columns

Pain — what today costs (lagging):

Volume / frequency (units per week / month)
Time per unit (clock time from trigger to delivery)
Annual hours consumed (volume × time)
Senior-time share (% of those hours that are senior capacity)
Error / rework rate (% of units that loop back for correction)

Flow shape — where the leverage is:

Hop count (total, from the trace)
Artifact hops (eliminable) vs Real hops (preserved)
Waste ratio (Artifact hops ÷ total hops — higher = more redesignable surface)
Current tool count (number of platforms one unit passes through)
SSOT gap count (number of places that claim authority for the same data)

Readiness — can we actually ship this:

Logic status — documented / tacit / unknown
Data availability — does the signal exist in any system?
Integration difficulty — how exposed are the systems involved?
Trust / customer-risk level — what failure mode does the workflow tolerate?

Expected outcome — the bet:

Expected cycle-time delta (current → target, from hop-count delta)
Expected capacity reclaimed (senior hours per week)
Confidence level / source evidence (trace evidence, owner attestation, prior engagement data)

Each pillar reads from one of the four maps

The pillars are not arbitrary. Each maps almost 1:1 to one of the four maps the kick-off populates:

Pain reads from the Value Stream Map metrics (Cycle Time, Wait Time, Flow Efficiency)
Waste reads from the Value Stream Map Seven Wastes + the hop count
Readiness reads from the Capability Map maturity levels
Confidence reads from the Outcome Map evidence requirements + Dependency Map unowned-blocker check

The four maps were designed to diagnose these dimensions before the scorecard existed. The scorecard names the synthesis the maps already made possible.

How to use the scorecard

Fill raw units first. Hours, hop counts, tool counts, error percentages — actual measured or owner-attested numbers. Raw evidence is the source of truth. Blanks are data; they reveal where the trace did not go deep enough.
Score each column 1–5 inside its pillar against the rubric below. Equal-weighting across all seventeen columns is wrong — whichever pillar has more columns silently dominates. Per-pillar scoring fixes that.
Compute Pilot Fit. The formula is explicit:
```
Pilot Fit = Pain × Waste × Readiness × Confidence × Risk Modifier
```
Each pillar score is the 1–5 average within its grouping. The Risk Modifier is a single selection:
```
Risk Modifier = 1.0 (low) | 0.75 (medium) | 0.5 (high)
```
The rule the formula encodes: high pain is not enough. The best first pilot is painful, waste-heavy, ready, evidenced, and not existentially risky. A high-pain row that fails on Readiness or carries existential customer risk loses to a slightly lower-pain row that ships cleanly.
Re-score after every kick-off. The ranking is a live document. New traces add rows; reconsidered evidence updates scores.

Scoring Rubric

Without thresholds the 1–5 scores are opinion. The rubric below makes them defensible. Use the highest-applicable score per column — a workflow that hits the 5 threshold on any sub-axis scores 5 for that column.

Pain (per column, take the highest matching threshold)

5 — annual hours >500h, OR senior-time share >40%, OR error rate >25%, OR per-unit time consumes >25% of an FTE day
4 — annual hours 200–500h, OR senior-share 20–40%, OR error rate 15–25%, OR per-unit time 10–25% of an FTE day
3 — annual hours 50–200h, OR senior-share 10–20%, OR error rate 5–15%, OR per-unit time 5–10% of an FTE day
2 — annual hours 10–50h, OR senior-share 5–10%, OR error rate 1–5%
1 — annual hours <10h, AND senior-share <5%, AND error rate <1%

Waste (per column)

5 — waste ratio >0.8, OR hop count >8, OR tool count >5, OR SSOT gap count ≥3
4 — waste ratio 0.6–0.8, OR hop count 5–8, OR tool count 4–5, OR SSOT gap count = 2
3 — waste ratio 0.4–0.6, OR hop count 3–5, OR tool count 3, OR SSOT gap count = 1
2 — waste ratio 0.2–0.4, OR hop count 2–3, OR tool count 2
1 — waste ratio <0.2, AND hop count <2, AND tool count 1

Readiness (per column)

5 — logic documented in writing, AND data 100% available in queryable systems, AND every required integration exists, AND no major capability gaps
4 — logic tacit but a one-hour interview surfaces it, AND data 80%+ available, AND integrations available with light work, AND Core capabilities at maturity 3+
3 — logic partially documented, AND data 50–80% available, AND some integration build required, AND Core capabilities at maturity 2
2 — logic unknown to the room, AND data <50% available, AND major integration gaps, AND Core capabilities at maturity 1
1 — logic missing entirely, AND data unavailable, AND no integrations, AND Core capabilities at maturity 0

Confidence (per column)

5 — every number owner-attested AND trace evidence on the page AND prior engagement data corroborates
4 — every number owner-attested AND trace evidence on the page
3 — every number owner-attested (no trace evidence yet)
2 — numbers are team estimates without owner attestation
1 — numbers are guesses, no source

Risk Modifier (one selection per workflow)

1.0 (low) — internal-only workflow, failure recoverable in hours, no customer or compliance exposure
0.75 (medium) — customer-facing or revenue-adjacent, failure visible but recoverable in days, no regulatory exposure
0.5 (high) — compliance, financial, safety-critical, or trust-critical; failure has regulatory, contractual, or brand consequences that survive the recovery

How to compute the pillar score

Take the average of column scores within the pillar. A Pain pillar with column scores (5, 4, 5, 4, 3) averages to 4.2. Apply that pillar number to the formula.

Calibration check

If two reasonable analysts running the rubric on the same trace produce per-pillar scores that differ by more than 1.0, the rubric thresholds need sharpening. Surface the gap as a kick-off lesson and feed it into the next version of this page.

This page changes as we learn. That is the point.

Where the scorecard sits in the workflow

The scorecard is the in-analyst-pass output. The in-room session produces rough pillar scores; the analyst pass within 48h normalises them against this rubric, computes the final Pilot Fit, and packages the result for the AI Strategy Meeting.

Critical distinction — Pilot Fit ranks Workflows, not Tasks

The Pilot Fit formula scores a Workflow Row — the entire JTBD as a single bet. The first-target Task Row is then chosen inside the highest-ranked workflow as the highest-waste Artifact-or-Hybrid step on the critical path. Tasks do not carry their own Pain × Waste × Readiness × Confidence × Risk profile in the same shape; only workflows do. Rank workflows; nominate tasks within the winner.

Context

Flow Discovery Kick-off — The kick-off this scorecard belongs to
Worked Examples — AEO + client onboarding with the rubric applied
AI Strategy Meeting — The decision meeting that consumes the Pilot Fit ranking
Outcome Map · Value Stream Map · Dependency Map · Capability Map — The four maps each pillar reads from

Questions

If two analysts disagree on a pillar score by more than one point — is the rubric wrong, or is the evidence missing, or is the workflow harder to classify than the kick-off assumed?

Which pillar has the most contested threshold in your engagement so far — and what would sharpen the language?
When the room's intuition picks one first-pilot and the scorecard picks another, which one ships — and what does the reconciliation teach you about the rubric?
If Pilot Fit is the headline number, why does the formula keep raw evidence as the source of truth — and what fails if you skip the raw fill and go straight to scoring?

Why this exists​

Four pillars, seventeen columns​

Each pillar reads from one of the four maps​

How to use the scorecard​

Scoring Rubric​

Pain (per column, take the highest matching threshold)​

Waste (per column)​

Readiness (per column)​

Confidence (per column)​

Risk Modifier (one selection per workflow)​

How to compute the pillar score​

Calibration check​

Where the scorecard sits in the workflow​

Critical distinction — Pilot Fit ranks Workflows, not Tasks​

Context​

Questions​