ETL Data Tool — Value Stream Map

Where does time die in the data acquisition pipeline?

Current State (Manual)

Google search NZ business (5 min)
  → Copy name + address to spreadsheet (2 min)
    → Search Companies Office website manually (5 min)
      → Copy directors to spreadsheet (3 min)
        → Visit company website for context (5 min)
          → Classify industry by gut feel (1 min)
            → No trust score, no provenance

Total per entity: ~21 minutes manual work. For 100 entities: ~35 hours.

Future State (Automated)

NZBN API bulk (100 entities in < 60s)
  → Companies Office API enrich (100 entities in < 5 min)
    → Crawl4AI website extraction (100 sites in < 20 min)
      → Trust scoring engine (100 records in < 1 min)
        → PostgreSQL load via agent-etl-cli (100 records in < 30s)

Total for 100 entities: ~30 minutes automated. Savings: 34.5 hours.

Waste Identification

Waste Type	Current	Future	Eliminated By
Manual search	5 min/entity	0	NZBN API (Feature #1)
Manual copy	5 min/entity	0	Direct API → PostgreSQL
Manual classification	1 min/entity	0	ANZSIC codes from NZBN (Feature #5)
No trust signal	Unmeasured risk	0-100 per record	Trust scoring (Feature #4)
No refresh	Stale forever	Weekly delta	Scheduled extraction (Feature #6)

Bottleneck Analysis

Stage	Throughput	Bottleneck?	Mitigation
NZBN API	700K entities available	No — free, bulk	Rate limit: respect API terms
Companies Office API	Per-entity lookup	Yes — sequential calls	Batch + cache directors
Crawl4AI	~3-5s per site	Yes — I/O bound	Docker parallelism, 10 concurrent
Trust scoring	CPU-bound calculation	No — simple formula	In-process, no external calls
PostgreSQL load	Drizzle batch insert	No — existing infra	Transaction batches of 50

Context

Outcome Map — What success looks like
Dependency Map — What must happen first

Current State (Manual)​

Future State (Automated)​

Waste Identification​

Bottleneck Analysis​

Context​

Current State (Manual)

Future State (Automated)

Waste Identification

Bottleneck Analysis

Context