← ETL Data Tool · Prompt Deck · Pictures

ETL Data Tool Spec

How do we turn empty repos into trusted data without hiring a data team?

Intent Contract

Dimension	Statement
Objective	Unified data acquisition pipeline that connects government APIs, web enrichment, and trust scoring — so CRM and nowcasting have real data without bespoke engineering per source.
Outcomes	1. 100 NZ businesses ingested from NZBN with trust scores within 7 days of deployment. 2. Companies Office directors linked to NZBN entities automatically. 3. Web enrichment adds business model classification to 50%+ of ingested entities.
Health Metrics	Pipeline run success rate >95%. Zero data corruption on failed runs (transaction rollback). Trust score computation completes for every ingested record.
Counter-metrics	Pipeline must not slow CRM page loads (async ingestion only). Trust scores must not create false confidence — low-evidence records must score low. Enrichment must not scrape personal data (NZ Privacy Act 2020).
Constraints	Hard: Free data sources first (NZBN, Companies Office, data.govt.nz). Hard: No LinkedIn scraping (Proxycurl sued Jan 2026). Hard: Trust score on every record — no data enters repos without provenance. Steering: TypeScript pipeline matching existing `agent-etl-cli.ts` patterns.
Autonomy	Allowed: API client implementation, extraction scheduling, trust score formula tuning. Escalate: New paid data sources, schema changes to business entity tables. Never: Store personal contact data without lawful purpose, bypass trust scoring, delete source records.
Blast Radius	PostgreSQL business entity tables (new INSERTs). CRM contact enrichment (new data flowing in). Scheduled jobs (new cron entries). Docker (Crawl4AI container).
Rollback	Each pipeline is independent — disable by removing cron entry. Ingested records can be soft-deleted by source batch ID. Trust scores can be recomputed from raw data. Crawl4AI container can be stopped without affecting other pipelines.
Stop Rules	Complete when: 100 businesses ingested with trust scores, Companies Office directors linked, web enrichment classifying 50%+ of entities. Halt when: NZBN API access revoked, or trust scores show no discrimination (all records score the same).

Story Contract

Stories are test contracts. Each row converts to test files. Tests must be RED before implementation starts. Tests going GREEN = value delivered.

Every story follows the P2P arc: Pain (who hurts, measurable cost) → Performance Target (the number that proves it works) → Validated Outcome (the Tight Five dimension this serves).

#	WHEN (Trigger + Precondition)	THEN (Exact Assertion)	ARTIFACT	Test Type	FORBIDDEN (Must not happen)	OUTCOME (Performance Number)
S1	Operator runs `npx tsx tools/scripts/etl/nzbn/ingest.ts --count=100` with valid NZBN API credentials configured	100 business entity records inserted into PostgreSQL. Each record has: NZBN number (non-null), entity name, status, registration date, BIC codes. Duplicate NZBNs rejected on re-run (upsert, not duplicate). Run completes within 5 minutes.	BLOCKER — test path TBD by engineering	integration	Command returns success but inserts 0 records. Records inserted with null NZBN numbers. Re-run creates duplicate records instead of upserting. API rate limit hit with no retry/backoff.	100 businesses ingested in <5 min vs manual NZBN lookup at ~3 min per entity (300 min saved)
S2	100 NZBN entities exist in DB AND operator runs Companies Office enrichment pipeline	Directors and shareholders linked to their NZBN entities. Each linked person has: name, role (director/shareholder), appointment date. Person records deduplicated across entities (same director at 2 companies = 1 person record, 2 role links).	BLOCKER — test path TBD by engineering	integration	Enrichment runs but creates zero person-entity links. Same person duplicated as separate records per company. Personal data stored without `lawful_purpose` flag.	People + roles enriched automatically vs manual Companies Office searches at ~5 min per entity
S3	50+ NZBN entities exist in DB AND Crawl4AI Docker container running AND operator triggers web enrichment	Crawl4AI extracts structured data from company websites found via NZBN entity URLs. Business model, services list, and industry classification written to entity record. `source_authority` set to 0.6 for web-sourced fields. Entities without discoverable websites marked as `web_enrichment_failed` (not silently skipped).	BLOCKER — test path TBD by engineering	integration	Enrichment silently skips entities with no website (no failure record). Web-sourced data given `source_authority = 1.0` (same as government API). Crawl4AI container crash leaves pipeline in broken state with no cleanup.	50%+ of entities classified by business model vs 0% classification today
S4	Entity has data from 2+ sources (NZBN + Companies Office, or NZBN + web enrichment)	Trust score computed as weighted composite: completeness (0.25) + source_authority (0.25) + freshness (0.20) + cross_reference (0.20) + consistency (0.10). Score stored with provenance trail (which sources contributed, when). Government-only entity scores higher than web-only entity for same completeness.	BLOCKER — test path TBD by engineering	unit + integration	All entities receive the same trust score regardless of source quality. Entity with only web data scores equal to entity with government + web data. Trust score computed but provenance trail missing — can't explain why a score is low.	Every record has a 0-100 trust score with provenance vs current "hope the data is good"
S5	8 ETL pipelines configured AND operator navigates to pipeline dashboard	Dashboard renders all 8 pipelines with: name, last run timestamp, last run status (success/failed/running), record count. Operator can trigger dry-run or live-run from dashboard. Failed pipeline shows error summary without requiring log file access.	BLOCKER — test path TBD by engineering	e2e	Dashboard shows "0 pipelines" when 8 are configured. Run button triggers pipeline but dashboard doesn't update status until page refresh. Failed run shows generic "error" with no actionable detail.	Pipeline status visible in <10s vs SSH + grep logs at ~5 min per pipeline check

Tight Five coverage check:

P Dimension	Stories covering it	Gap?
Pain (CRM has 0 NZ businesses, repos empty)	S1 (NZBN ingestion fills the void), S5 (visibility into pipeline health)	No
Demand (NZBN API free, 700K entities waiting)	S1 (government data), S2 (Companies Office directors)	No
Edge (trust scoring on every record)	S4 (composite trust score with provenance)	No
Trend (AI needs trusted data, not garbage)	S3 (web enrichment feeds AI classification), S4 (trust prevents confident-on-garbage)	No
Conversion (100 businesses, $0 cost)	S1 (100 entities target), S3 (50%+ classification rate)	No

Build Contract

#	Feature	Function	Outcome	Job	State
1	NZBN API wrapper	TypeScript client over NZBN REST API — search, retrieve, bulk	NZ businesses ingested from authoritative source	Lead generation	Gap
2	Companies Office integration	Directors, shareholders, filings via CO API	People + roles for every business entity	Contact enrichment	Gap
3	Web enrichment pipeline	Crawl4AI (Docker) extracts structured data from company websites	Business model, services, supply chain classified per entity	Business intelligence	Gap
4	Trust scoring engine	Composite score: completeness, authority, freshness, cross-ref, consistency	Every record has a 0-100 trust score with provenance	Data quality	Gap
5	ANZSIC classification	Map NZBN BIC codes + AI enrichment to ANZSIC06 industry codes	Every business classified by industry, model type, supply chain	Industry mapping	Gap
6	Scheduled extraction	Cron: monthly bulk refresh, weekly delta, on-demand enrichment triggers	Data stays fresh without manual intervention	Automation	Gap
7	Universal source connector	Connect files, APIs, databases, SaaS tools via configuration	No source is unsupported — connect anything	Data readiness	Partial
8	Security-first extraction	Validate paths, generate checksums, track lineage per record	Every byte has a provenance trail	Compliance	Built
9	Configuration-driven transforms	Map fields, clean data, infer types without writing code	Non-engineers can define data pipelines	Self-service	Built
10	Quality metrics engine	Score data quality per field and per source on every run	Bad data caught before it reaches AI	Data quality	Built
11	Type-safe loading	Validated schemas, transaction rollback, event emission	Load failures don't corrupt the database	Reliability	Built
12	ETL pipeline dashboard	8 pipelines with run/dry-run in UI	Operators see pipeline status and trigger runs	Visibility	Live

Principles

What truths constrain the design?

The Job

Element	Detail
Situation	CRM and nowcasting systems have schemas and repos but zero data flowing through them. The intake valve doesn't exist.
Intention	Any data source — government API, web scrape, bulk file — connects through one unified pipeline framework with trust scoring at ingestion.
Obstacle	No extraction strategy. Repos are built but empty. Each new data source requires bespoke engineering.
Hardest Thing	Trust. Scraped data mixed with government data mixed with manual entry — without trust scoring, the AI makes confident decisions on garbage.

Why Now

CRM has 0 NZ businesses classified. Sales pipeline is empty because lead data doesn't exist.
Nowcasting requires real-time business signals. No signals flowing = no predictions.
NZBN API is free (700K+ NZ entities). Companies Office API is free (directors, shareholders). The authoritative data is sitting there uncollected.
Crawl4AI is open-source (Apache 2.0, 58K+ GitHub stars). Self-hosted = $0.
Every month without data acquisition is a month of manual work the platform should eliminate.

Design Constraints

Constraint	Rationale
Free data sources first	NZBN + Companies Office + data.govt.nz before any paid service
Trust score on every record	No data enters repos without a composite trust score
Scheduled + triggered	Monthly bulk, weekly delta, on-demand enrichment — not just cron
TypeScript pipeline	Matches existing `agent-etl-cli.ts` patterns and Drizzle repos
NZ Privacy Act 2020 compliant	Business data freely; personal contact data with lawful purpose only
No LinkedIn scraping	Proxycurl sued Jan 2026. Use Companies Office for directors, not scraping

Data Acquisition Architecture

Three-layer pipeline. Each layer has a different trust level.

Layer 1: STRUCTURED APIs (free, high trust)
  NZBN API + Companies Office API → PostgreSQL
  ↓ trust_score: source_authority = 1.0

Layer 2: AI ENRICHMENT (low cost, medium trust)
  Crawl4AI (Docker) → company websites → structured extraction → PostgreSQL
  ↓ trust_score: source_authority = 0.6

Layer 3: VALIDATION & SCORING (compute cost only)
  LLM classification + cross-reference → composite trust score → PostgreSQL
  ↓ trust_score: composite of all 5 dimensions

Trust Score Formula

trust_score = (
    completeness × 0.25 +     # % fields populated
    source_authority × 0.25 +  # govt=1.0, website=0.6, directory=0.4
    freshness × 0.20 +         # decay from last_verified_at
    cross_reference × 0.20 +   # confirmed by N sources
    consistency × 0.10          # no contradictions
) × 100

Thresholds: <40 = Low, 40-70 = Medium, >70 = High

NZ Data Sources

Source	Data	Cost	Trust	Access
NZBN API	Entity name, status, trading names, addresses, industry codes, contacts	Free	High	REST API (subscription key)
NZBN Bulk Data	Complete business dataset (700K+)	Free	High	JSON/CSV monthly dumps
Companies Office API	Company registration, directors, shareholders, filings	Free	High	REST API
data.govt.nz	ANZSIC06 classification tables	Free	High	CSV/JSON download
Crawl4AI	Structured extraction from company websites	Free (self-hosted)	Medium	Docker, Apache 2.0
Firecrawl	Fallback scraping (95.3% reliability)	500 pages free/mo	Medium	API + MCP server

Tool Selection

Need	Tool	Why
NZ business data	NZBN API + Bulk Data	Free, authoritative, 700K+ entities
Industry classification	ANZSIC06 tables + NZBN BIC codes	Free, standardized
People + roles	Companies Office API (directors)	Free, public record
Website enrichment	Crawl4AI (Docker, self-hosted)	Free, 89.7% success, LLM extraction
Fallback scraping	Firecrawl free tier (500 pages/mo)	95.3% reliability, MCP server for agent integration
Future agent integration	Firecrawl MCP server	Agent-triggered enrichment

Performance

How do we know it's working?

Priority Score

PRIORITY = Pain x Demand x Edge x Trend x Conversion

Dimension	Score (1-5)	Evidence
Pain	5	CRM has 0 classified NZ businesses. Nowcast has 0 signals. Every data-consuming PRD blocked. Empty repos = zero value from existing schemas.
Demand	4	Internal engineering dependency for Sales CRM, Sales Dev, Nowcast, Business Idea Generator. No external demand yet (max 4 per scoring guide).
Edge	4	Domain-specific schemas (73 types) + NZBN/CO APIs (free, most competitors pay for data) + existing Drizzle repos ready to receive. 6+ months to replicate.
Trend	5	AI data readiness is structural shift. 73% of AI projects fail on data quality (Gartner). MCP scraping tools exploding (9 servers benchmarked 2026). Crawl4AI 58K stars.
Conversion	3	Clear internal path. External: $2K-50K/month tiers plausible but untested. No external paying customers yet.
Composite	1200	5 x 4 x 4 x 5 x 3

Quality Targets

Metric	Target	Now
NZ businesses ingested	100 (first batch)	0
Trust score coverage	100% of records	0%
Pipeline execution time	< 30 min for 100 entities	N/A
Query latency on loaded data	< 2s	N/A
Freshness: time since last refresh	< 7 days	N/A

Kill signal: If data loads but nobody queries it within 14 days, the pipeline is extraction theater. Track: queries per entity per week. If zero after 2 weeks, the downstream consumers don't need this data.

Platform

What do we control?

Current State

Component	Built	Wired	Working	Notes
PostgreSQL schemas (73 types)	Yes	Yes	Yes	Drizzle repos ready
agent-etl-cli.ts	Yes	Yes	Yes	DDD load order, dry-run, allowSafeSource
ETL pipeline dashboard	Yes	Yes	Yes	8 pipelines, run/dry-run
File source connectors	Yes	Yes	Yes	90% complete
API source connectors	Yes	Yes	Yes	85% complete
Database connectors	Yes	Yes	Partial	75% complete
NZBN API wrapper	No	No	No	Gap — Sprint 0
Companies Office integration	No	No	No	Gap — Sprint 0
Crawl4AI enrichment	No	No	No	Gap — Sprint 1
Trust scoring engine	No	No	No	Gap — Sprint 1
ANZSIC classification	No	No	No	Gap — Sprint 1
Scheduled extraction	No	No	No	Gap — Sprint 2

Build Ratio

~60% composition (existing ETL infra, Drizzle repos, CLI patterns), ~40% new code (NZBN wrapper, trust scoring, Crawl4AI integration, scheduling).

Protocols

How do we coordinate?

Build Order

Sprint	Features	What	Effort	Acceptance
0	#1, #2, #5	NZBN API wrapper + Companies Office + ANZSIC tables	1 week	100 NZ businesses in PostgreSQL with industry codes + directors
1	#3, #4	Crawl4AI enrichment + trust scoring	1 week	100 entities enriched from websites, all records trust-scored
2	#6	Scheduled extraction (cron + triggers)	3 days	Monthly bulk runs unattended, weekly delta detects changes
3	—	First downstream query: Sales Dev uses business data for leads	2 days	Sales Dev agent queries 100 businesses, filters by industry + trust score

Commissioning

#	Feature	Install	Test	Operational	Optimize
1	NZBN API wrapper	—	—	—	—
2	Companies Office integration	—	—	—	—
3	Web enrichment pipeline	—	—	—	—
4	Trust scoring engine	—	—	—	—
5	ANZSIC classification	—	—	—	—
6	Scheduled extraction	—	—	—	—
7	Universal source connector	L2	L2	—	—
8	Security-first extraction	L3	L3	—	—
9	Config-driven transforms	L3	L3	—	—
10	Quality metrics engine	L2	L2	—	—
11	Type-safe loading	L3	L3	—	—
12	ETL pipeline dashboard	L2	L2	L2	—

Scheduling Pattern

CRON (monthly):  NZBN bulk refresh → full re-score
CRON (weekly):   Companies Office change events API → incremental update
EVENT (on-demand): Website enrichment for new/priority entities → Crawl4AI
EVENT (trigger):  Trust score drops below threshold → re-verify

Agent-Facing Spec

Commands:

# NZBN ingest
npx tsx tools/scripts/etl/nzbn/ingest.ts --count=100 --region=all --dry-run
npx tsx tools/scripts/etl/nzbn/ingest.ts --count=100 --region=all --allowSafeSource

# Companies Office enrichment
npx tsx tools/scripts/etl/companies-office/enrich.ts --source=nzbn --dry-run

# Web enrichment
npx tsx tools/scripts/etl/web-enrich/crawl.ts --source=nzbn --limit=100 --dry-run

# Trust scoring
npx tsx tools/scripts/etl/trust/score.ts --rescore-all

Boundaries:

Action	Rule
Always	Use --dry-run first, respect rate limits, log provenance
Ask first	Bulk operations > 1000 entities, schema changes
Never	Scrape LinkedIn, store personal contact details without lawful basis, bypass robots.txt

Test Contract:

#	Feature	Test File	Assertion
1	NZBN API wrapper	`tools/scripts/etl/nzbn/ingest.spec.ts`	Returns 100 entities with valid NZBN numbers and status
2	Companies Office integration	`tools/scripts/etl/companies-office/enrich.spec.ts`	Each entity gets directors array with name + appointment date
3	Web enrichment pipeline	`tools/scripts/etl/web-enrich/crawl.spec.ts`	Structured extraction returns business model + services fields
4	Trust scoring engine	`tools/scripts/etl/trust/score.spec.ts`	Every record gets 0-100 score, no nulls, thresholds enforced
5	ANZSIC classification	`tools/scripts/etl/nzbn/anzsic.spec.ts`	BIC codes mapped to ANZSIC06, unmapped codes flagged
6	Scheduled extraction	`tools/scripts/etl/scheduler/cron.spec.ts`	Monthly bulk and weekly delta fire on schedule, idempotent
7	Universal source connector	`libs/infrastructure/data-pipelines/src/connectors.spec.ts`	File, API, and DB sources connect and return typed records
8	Security-first extraction	`libs/infrastructure/data-pipelines/src/security.spec.ts`	Path traversal blocked, checksums match, lineage recorded
9	Config-driven transforms	`libs/infrastructure/data-pipelines/src/transforms.spec.ts`	Field mapping produces expected output without code changes
10	Quality metrics engine	`libs/infrastructure/data-pipelines/src/quality.spec.ts`	Bad fields flagged, quality score ≥ threshold to proceed
11	Type-safe loading	`libs/infrastructure/data-pipelines/src/loader.spec.ts`	Schema validation rejects bad records, transaction rolls back
12	ETL pipeline dashboard	`apps/stackmates/e2e/etl-dashboard.spec.ts`	8 pipelines visible, run/dry-run buttons functional

Legal Compliance

Data Type	NZ Privacy Act 2020	Action
Business name, address, NZBN, industry	NOT personal information	Collect freely from public sources
Director/officer names + roles	Public record (Companies Office)	Low risk — collect with attribution
Contact email/phone of individuals	Personal information	Must have lawful purpose, inform individual
Scraped personal data at scale	Triggers Privacy Act obligations	Minimize, document lawful basis, implement retention limits

Players

Who creates harmony?

Demand-Side Jobs

Job 1: Lead Generation Data

Situation: Sales Dev agent needs qualified NZ businesses to prospect, but CRM has zero classified entities.

Element	Detail
Struggling moment	Agent opens CRM, finds 0 businesses matching target industry/region criteria
Current workaround	Manual Google searches, copy-paste from directories, no systematic classification
What progress looks like	100 NZ businesses classified by industry + model + supply chain, with directors and trust scores, queryable in PostgreSQL
Hidden objection	"Scraped data is always garbage — we'll spend more time cleaning than using"
Switch trigger	First query returns 10 qualified leads with trust score > 70 in under 2 seconds

Features that serve this job: #1, #2, #3, #4, #5

Job 2: Nowcast Signal Feed

Situation: Pipeline Nowcast needs real-time business signals to predict variance, but no signals are flowing.

Element	Detail
Struggling moment	Nowcast model has no input — predictions impossible without data
Current workaround	Manual assessment, gut feel, delayed reaction
What progress looks like	Weekly delta feed of NZ business changes (new registrations, status changes, director moves) flowing to Nowcast
Hidden objection	"By the time scraped data arrives it's already stale"
Switch trigger	NZBN change events API delivers same-day updates for monitored entities

Features that serve this job: #1, #6

Job 3: AI-Ready Data for Ventures

Situation: Every venture on the platform needs clean, structured data but each builds its own extraction pipeline.

Element	Detail
Struggling moment	New venture launches, immediately blocked by "where do we get data?"
Current workaround	Each venture writes bespoke scripts, duplicating 80% of the work
What progress looks like	Any venture queries the unified data layer, gets trusted records in < 2s
Hidden objection	"Universal tools never work for my specific domain"
Switch trigger	First venture (Stackmates) uses ETL-loaded data without writing any extraction code

Features that serve this job: #7, #8, #9, #10, #11

Relationship to Other PRDs

PRD	Relationship	Data Flow
Sales CRM & RFP	Peer (consumer)	ETL → business profiles + contacts → CRM entities
Sales Dev Agent	Peer (consumer)	ETL → qualified leads → prospecting pipeline
Pipeline Nowcast	Peer (consumer)	ETL → business signals → variance prediction
Business Idea Generator	Peer (consumer)	ETL → market data → idea validation
Agent Platform	Peer (infra)	Agent triggers ETL enrichment via MCP/A2A
Data Interface	Child (access layer)	ETL loads → Data Interface queries

Context

PRD Index — ETL Data Tool
Prompt Deck — 5-card pitch
Pictures — Pre-flight maps

Intent Contract​

Story Contract​

Build Contract​

Principles​

The Job​

Why Now​

Design Constraints​

Data Acquisition Architecture​

Trust Score Formula​

NZ Data Sources​

Tool Selection​

Performance​

Priority Score​

Quality Targets​

Platform​

Current State​

Build Ratio​

Protocols​

Build Order​

Commissioning​

Scheduling Pattern​

Agent-Facing Spec​

Legal Compliance​

Players​

Demand-Side Jobs​

Job 1: Lead Generation Data​

Job 2: Nowcast Signal Feed​

Job 3: AI-Ready Data for Ventures​

Relationship to Other PRDs​

Context​