Skip to main content

ETL Data Tool — Value Stream Map

Where does time die in the data acquisition pipeline?

Current State (Manual)

Google search NZ business (5 min)
→ Copy name + address to spreadsheet (2 min)
→ Search Companies Office website manually (5 min)
→ Copy directors to spreadsheet (3 min)
→ Visit company website for context (5 min)
→ Classify industry by gut feel (1 min)
→ No trust score, no provenance

Total per entity: ~21 minutes manual work. For 100 entities: ~35 hours.

Future State (Automated)

NZBN API bulk (100 entities in < 60s)
→ Companies Office API enrich (100 entities in < 5 min)
→ Crawl4AI website extraction (100 sites in < 20 min)
→ Trust scoring engine (100 records in < 1 min)
→ PostgreSQL load via agent-etl-cli (100 records in < 30s)

Total for 100 entities: ~30 minutes automated. Savings: 34.5 hours.

Waste Identification

Waste TypeCurrentFutureEliminated By
Manual search5 min/entity0NZBN API (Feature #1)
Manual copy5 min/entity0Direct API → PostgreSQL
Manual classification1 min/entity0ANZSIC codes from NZBN (Feature #5)
No trust signalUnmeasured risk0-100 per recordTrust scoring (Feature #4)
No refreshStale foreverWeekly deltaScheduled extraction (Feature #6)

Bottleneck Analysis

StageThroughputBottleneck?Mitigation
NZBN API700K entities availableNo — free, bulkRate limit: respect API terms
Companies Office APIPer-entity lookupYes — sequential callsBatch + cache directors
Crawl4AI~3-5s per siteYes — I/O boundDocker parallelism, 10 concurrent
Trust scoringCPU-bound calculationNo — simple formulaIn-process, no external calls
PostgreSQL loadDrizzle batch insertNo — existing infraTransaction batches of 50

Context