ETL Data Tool — Value Stream Map
Where does time die in the data acquisition pipeline?
Current State (Manual)
Google search NZ business (5 min)
→ Copy name + address to spreadsheet (2 min)
→ Search Companies Office website manually (5 min)
→ Copy directors to spreadsheet (3 min)
→ Visit company website for context (5 min)
→ Classify industry by gut feel (1 min)
→ No trust score, no provenance
Total per entity: ~21 minutes manual work. For 100 entities: ~35 hours.
Future State (Automated)
NZBN API bulk (100 entities in < 60s)
→ Companies Office API enrich (100 entities in < 5 min)
→ Crawl4AI website extraction (100 sites in < 20 min)
→ Trust scoring engine (100 records in < 1 min)
→ PostgreSQL load via agent-etl-cli (100 records in < 30s)
Total for 100 entities: ~30 minutes automated. Savings: 34.5 hours.
Waste Identification
| Waste Type | Current | Future | Eliminated By |
|---|---|---|---|
| Manual search | 5 min/entity | 0 | NZBN API (Feature #1) |
| Manual copy | 5 min/entity | 0 | Direct API → PostgreSQL |
| Manual classification | 1 min/entity | 0 | ANZSIC codes from NZBN (Feature #5) |
| No trust signal | Unmeasured risk | 0-100 per record | Trust scoring (Feature #4) |
| No refresh | Stale forever | Weekly delta | Scheduled extraction (Feature #6) |
Bottleneck Analysis
| Stage | Throughput | Bottleneck? | Mitigation |
|---|---|---|---|
| NZBN API | 700K entities available | No — free, bulk | Rate limit: respect API terms |
| Companies Office API | Per-entity lookup | Yes — sequential calls | Batch + cache directors |
| Crawl4AI | ~3-5s per site | Yes — I/O bound | Docker parallelism, 10 concurrent |
| Trust scoring | CPU-bound calculation | No — simple formula | In-process, no external calls |
| PostgreSQL load | Drizzle batch insert | No — existing infra | Transaction batches of 50 |
Context
- Outcome Map — What success looks like
- Dependency Map — What must happen first