ETL Data Tool Spec
How do we turn empty repos into trusted data without hiring a data team?
Intent Contract
| Dimension | Statement |
|---|---|
| Objective | Unified data acquisition pipeline that connects government APIs, web enrichment, and trust scoring — so CRM and nowcasting have real data without bespoke engineering per source. |
| Outcomes | 1. 100 NZ businesses ingested from NZBN with trust scores within 7 days of deployment. 2. Companies Office directors linked to NZBN entities automatically. 3. Web enrichment adds business model classification to 50%+ of ingested entities. |
| Health Metrics | Pipeline run success rate >95%. Zero data corruption on failed runs (transaction rollback). Trust score computation completes for every ingested record. |
| Counter-metrics | Pipeline must not slow CRM page loads (async ingestion only). Trust scores must not create false confidence — low-evidence records must score low. Enrichment must not scrape personal data (NZ Privacy Act 2020). |
| Constraints | Hard: Free data sources first (NZBN, Companies Office, data.govt.nz). Hard: No LinkedIn scraping (Proxycurl sued Jan 2026). Hard: Trust score on every record — no data enters repos without provenance. Steering: TypeScript pipeline matching existing agent-etl-cli.ts patterns. |
| Autonomy | Allowed: API client implementation, extraction scheduling, trust score formula tuning. Escalate: New paid data sources, schema changes to business entity tables. Never: Store personal contact data without lawful purpose, bypass trust scoring, delete source records. |
| Blast Radius | PostgreSQL business entity tables (new INSERTs). CRM contact enrichment (new data flowing in). Scheduled jobs (new cron entries). Docker (Crawl4AI container). |
| Rollback | Each pipeline is independent — disable by removing cron entry. Ingested records can be soft-deleted by source batch ID. Trust scores can be recomputed from raw data. Crawl4AI container can be stopped without affecting other pipelines. |
| Stop Rules | Complete when: 100 businesses ingested with trust scores, Companies Office directors linked, web enrichment classifying 50%+ of entities. Halt when: NZBN API access revoked, or trust scores show no discrimination (all records score the same). |
Story Contract
Stories are test contracts. Each row converts to test files. Tests must be RED before implementation starts. Tests going GREEN = value delivered.
Every story follows the P2P arc: Pain (who hurts, measurable cost) → Performance Target (the number that proves it works) → Validated Outcome (the Tight Five dimension this serves).
| # | WHEN (Trigger + Precondition) | THEN (Exact Assertion) | ARTIFACT | Test Type | FORBIDDEN (Must not happen) | OUTCOME (Performance Number) |
|---|---|---|---|---|---|---|
| S1 | Operator runs npx tsx tools/scripts/etl/nzbn/ingest.ts --count=100 with valid NZBN API credentials configured | 100 business entity records inserted into PostgreSQL. Each record has: NZBN number (non-null), entity name, status, registration date, BIC codes. Duplicate NZBNs rejected on re-run (upsert, not duplicate). Run completes within 5 minutes. | BLOCKER — test path TBD by engineering | integration | Command returns success but inserts 0 records. Records inserted with null NZBN numbers. Re-run creates duplicate records instead of upserting. API rate limit hit with no retry/backoff. | 100 businesses ingested in <5 min vs manual NZBN lookup at ~3 min per entity (300 min saved) |
| S2 | 100 NZBN entities exist in DB AND operator runs Companies Office enrichment pipeline | Directors and shareholders linked to their NZBN entities. Each linked person has: name, role (director/shareholder), appointment date. Person records deduplicated across entities (same director at 2 companies = 1 person record, 2 role links). | BLOCKER — test path TBD by engineering | integration | Enrichment runs but creates zero person-entity links. Same person duplicated as separate records per company. Personal data stored without lawful_purpose flag. | People + roles enriched automatically vs manual Companies Office searches at ~5 min per entity |
| S3 | 50+ NZBN entities exist in DB AND Crawl4AI Docker container running AND operator triggers web enrichment | Crawl4AI extracts structured data from company websites found via NZBN entity URLs. Business model, services list, and industry classification written to entity record. source_authority set to 0.6 for web-sourced fields. Entities without discoverable websites marked as web_enrichment_failed (not silently skipped). | BLOCKER — test path TBD by engineering | integration | Enrichment silently skips entities with no website (no failure record). Web-sourced data given source_authority = 1.0 (same as government API). Crawl4AI container crash leaves pipeline in broken state with no cleanup. | 50%+ of entities classified by business model vs 0% classification today |
| S4 | Entity has data from 2+ sources (NZBN + Companies Office, or NZBN + web enrichment) | Trust score computed as weighted composite: completeness (0.25) + source_authority (0.25) + freshness (0.20) + cross_reference (0.20) + consistency (0.10). Score stored with provenance trail (which sources contributed, when). Government-only entity scores higher than web-only entity for same completeness. | BLOCKER — test path TBD by engineering | unit + integration | All entities receive the same trust score regardless of source quality. Entity with only web data scores equal to entity with government + web data. Trust score computed but provenance trail missing — can't explain why a score is low. | Every record has a 0-100 trust score with provenance vs current "hope the data is good" |
| S5 | 8 ETL pipelines configured AND operator navigates to pipeline dashboard | Dashboard renders all 8 pipelines with: name, last run timestamp, last run status (success/failed/running), record count. Operator can trigger dry-run or live-run from dashboard. Failed pipeline shows error summary without requiring log file access. | BLOCKER — test path TBD by engineering | e2e | Dashboard shows "0 pipelines" when 8 are configured. Run button triggers pipeline but dashboard doesn't update status until page refresh. Failed run shows generic "error" with no actionable detail. | Pipeline status visible in <10s vs SSH + grep logs at ~5 min per pipeline check |
Tight Five coverage check:
| P Dimension | Stories covering it | Gap? |
|---|---|---|
| Pain (CRM has 0 NZ businesses, repos empty) | S1 (NZBN ingestion fills the void), S5 (visibility into pipeline health) | No |
| Demand (NZBN API free, 700K entities waiting) | S1 (government data), S2 (Companies Office directors) | No |
| Edge (trust scoring on every record) | S4 (composite trust score with provenance) | No |
| Trend (AI needs trusted data, not garbage) | S3 (web enrichment feeds AI classification), S4 (trust prevents confident-on-garbage) | No |
| Conversion (100 businesses, $0 cost) | S1 (100 entities target), S3 (50%+ classification rate) | No |
Build Contract
| # | Feature | Function | Outcome | Job | State |
|---|---|---|---|---|---|
| 1 | NZBN API wrapper | TypeScript client over NZBN REST API — search, retrieve, bulk | NZ businesses ingested from authoritative source | Lead generation | Gap |
| 2 | Companies Office integration | Directors, shareholders, filings via CO API | People + roles for every business entity | Contact enrichment | Gap |
| 3 | Web enrichment pipeline | Crawl4AI (Docker) extracts structured data from company websites | Business model, services, supply chain classified per entity | Business intelligence | Gap |
| 4 | Trust scoring engine | Composite score: completeness, authority, freshness, cross-ref, consistency | Every record has a 0-100 trust score with provenance | Data quality | Gap |
| 5 | ANZSIC classification | Map NZBN BIC codes + AI enrichment to ANZSIC06 industry codes | Every business classified by industry, model type, supply chain | Industry mapping | Gap |
| 6 | Scheduled extraction | Cron: monthly bulk refresh, weekly delta, on-demand enrichment triggers | Data stays fresh without manual intervention | Automation | Gap |
| 7 | Universal source connector | Connect files, APIs, databases, SaaS tools via configuration | No source is unsupported — connect anything | Data readiness | Partial |
| 8 | Security-first extraction | Validate paths, generate checksums, track lineage per record | Every byte has a provenance trail | Compliance | Built |
| 9 | Configuration-driven transforms | Map fields, clean data, infer types without writing code | Non-engineers can define data pipelines | Self-service | Built |
| 10 | Quality metrics engine | Score data quality per field and per source on every run | Bad data caught before it reaches AI | Data quality | Built |
| 11 | Type-safe loading | Validated schemas, transaction rollback, event emission | Load failures don't corrupt the database | Reliability | Built |
| 12 | ETL pipeline dashboard | 8 pipelines with run/dry-run in UI | Operators see pipeline status and trigger runs | Visibility | Live |
Principles
What truths constrain the design?
The Job
| Element | Detail |
|---|---|
| Situation | CRM and nowcasting systems have schemas and repos but zero data flowing through them. The intake valve doesn't exist. |
| Intention | Any data source — government API, web scrape, bulk file — connects through one unified pipeline framework with trust scoring at ingestion. |
| Obstacle | No extraction strategy. Repos are built but empty. Each new data source requires bespoke engineering. |
| Hardest Thing | Trust. Scraped data mixed with government data mixed with manual entry — without trust scoring, the AI makes confident decisions on garbage. |
Why Now
- CRM has 0 NZ businesses classified. Sales pipeline is empty because lead data doesn't exist.
- Nowcasting requires real-time business signals. No signals flowing = no predictions.
- NZBN API is free (700K+ NZ entities). Companies Office API is free (directors, shareholders). The authoritative data is sitting there uncollected.
- Crawl4AI is open-source (Apache 2.0, 58K+ GitHub stars). Self-hosted = $0.
- Every month without data acquisition is a month of manual work the platform should eliminate.
Design Constraints
| Constraint | Rationale |
|---|---|
| Free data sources first | NZBN + Companies Office + data.govt.nz before any paid service |
| Trust score on every record | No data enters repos without a composite trust score |
| Scheduled + triggered | Monthly bulk, weekly delta, on-demand enrichment — not just cron |
| TypeScript pipeline | Matches existing agent-etl-cli.ts patterns and Drizzle repos |
| NZ Privacy Act 2020 compliant | Business data freely; personal contact data with lawful purpose only |
| No LinkedIn scraping | Proxycurl sued Jan 2026. Use Companies Office for directors, not scraping |
Data Acquisition Architecture
Three-layer pipeline. Each layer has a different trust level.
Layer 1: STRUCTURED APIs (free, high trust)
NZBN API + Companies Office API → PostgreSQL
↓ trust_score: source_authority = 1.0
Layer 2: AI ENRICHMENT (low cost, medium trust)
Crawl4AI (Docker) → company websites → structured extraction → PostgreSQL
↓ trust_score: source_authority = 0.6
Layer 3: VALIDATION & SCORING (compute cost only)
LLM classification + cross-reference → composite trust score → PostgreSQL
↓ trust_score: composite of all 5 dimensions
Trust Score Formula
trust_score = (
completeness × 0.25 + # % fields populated
source_authority × 0.25 + # govt=1.0, website=0.6, directory=0.4
freshness × 0.20 + # decay from last_verified_at
cross_reference × 0.20 + # confirmed by N sources
consistency × 0.10 # no contradictions
) × 100
Thresholds: <40 = Low, 40-70 = Medium, >70 = High
NZ Data Sources
| Source | Data | Cost | Trust | Access |
|---|---|---|---|---|
| NZBN API | Entity name, status, trading names, addresses, industry codes, contacts | Free | High | REST API (subscription key) |
| NZBN Bulk Data | Complete business dataset (700K+) | Free | High | JSON/CSV monthly dumps |
| Companies Office API | Company registration, directors, shareholders, filings | Free | High | REST API |
| data.govt.nz | ANZSIC06 classification tables | Free | High | CSV/JSON download |
| Crawl4AI | Structured extraction from company websites | Free (self-hosted) | Medium | Docker, Apache 2.0 |
| Firecrawl | Fallback scraping (95.3% reliability) | 500 pages free/mo | Medium | API + MCP server |
Tool Selection
| Need | Tool | Why |
|---|---|---|
| NZ business data | NZBN API + Bulk Data | Free, authoritative, 700K+ entities |
| Industry classification | ANZSIC06 tables + NZBN BIC codes | Free, standardized |
| People + roles | Companies Office API (directors) | Free, public record |
| Website enrichment | Crawl4AI (Docker, self-hosted) | Free, 89.7% success, LLM extraction |
| Fallback scraping | Firecrawl free tier (500 pages/mo) | 95.3% reliability, MCP server for agent integration |
| Future agent integration | Firecrawl MCP server | Agent-triggered enrichment |
Performance
How do we know it's working?
Priority Score
PRIORITY = Pain x Demand x Edge x Trend x Conversion
| Dimension | Score (1-5) | Evidence |
|---|---|---|
| Pain | 5 | CRM has 0 classified NZ businesses. Nowcast has 0 signals. Every data-consuming PRD blocked. Empty repos = zero value from existing schemas. |
| Demand | 4 | Internal engineering dependency for Sales CRM, Sales Dev, Nowcast, Business Idea Generator. No external demand yet (max 4 per scoring guide). |
| Edge | 4 | Domain-specific schemas (73 types) + NZBN/CO APIs (free, most competitors pay for data) + existing Drizzle repos ready to receive. 6+ months to replicate. |
| Trend | 5 | AI data readiness is structural shift. 73% of AI projects fail on data quality (Gartner). MCP scraping tools exploding (9 servers benchmarked 2026). Crawl4AI 58K stars. |
| Conversion | 3 | Clear internal path. External: $2K-50K/month tiers plausible but untested. No external paying customers yet. |
| Composite | 1200 | 5 x 4 x 4 x 5 x 3 |
Quality Targets
| Metric | Target | Now |
|---|---|---|
| NZ businesses ingested | 100 (first batch) | 0 |
| Trust score coverage | 100% of records | 0% |
| Pipeline execution time | < 30 min for 100 entities | N/A |
| Query latency on loaded data | < 2s | N/A |
| Freshness: time since last refresh | < 7 days | N/A |
Kill signal: If data loads but nobody queries it within 14 days, the pipeline is extraction theater. Track: queries per entity per week. If zero after 2 weeks, the downstream consumers don't need this data.
Platform
What do we control?
Current State
| Component | Built | Wired | Working | Notes |
|---|---|---|---|---|
| PostgreSQL schemas (73 types) | Yes | Yes | Yes | Drizzle repos ready |
| agent-etl-cli.ts | Yes | Yes | Yes | DDD load order, dry-run, allowSafeSource |
| ETL pipeline dashboard | Yes | Yes | Yes | 8 pipelines, run/dry-run |
| File source connectors | Yes | Yes | Yes | 90% complete |
| API source connectors | Yes | Yes | Yes | 85% complete |
| Database connectors | Yes | Yes | Partial | 75% complete |
| NZBN API wrapper | No | No | No | Gap — Sprint 0 |
| Companies Office integration | No | No | No | Gap — Sprint 0 |
| Crawl4AI enrichment | No | No | No | Gap — Sprint 1 |
| Trust scoring engine | No | No | No | Gap — Sprint 1 |
| ANZSIC classification | No | No | No | Gap — Sprint 1 |
| Scheduled extraction | No | No | No | Gap — Sprint 2 |
Build Ratio
~60% composition (existing ETL infra, Drizzle repos, CLI patterns), ~40% new code (NZBN wrapper, trust scoring, Crawl4AI integration, scheduling).
Protocols
How do we coordinate?
Build Order
| Sprint | Features | What | Effort | Acceptance |
|---|---|---|---|---|
| 0 | #1, #2, #5 | NZBN API wrapper + Companies Office + ANZSIC tables | 1 week | 100 NZ businesses in PostgreSQL with industry codes + directors |
| 1 | #3, #4 | Crawl4AI enrichment + trust scoring | 1 week | 100 entities enriched from websites, all records trust-scored |
| 2 | #6 | Scheduled extraction (cron + triggers) | 3 days | Monthly bulk runs unattended, weekly delta detects changes |
| 3 | — | First downstream query: Sales Dev uses business data for leads | 2 days | Sales Dev agent queries 100 businesses, filters by industry + trust score |
Commissioning
| # | Feature | Install | Test | Operational | Optimize |
|---|---|---|---|---|---|
| 1 | NZBN API wrapper | — | — | — | — |
| 2 | Companies Office integration | — | — | — | — |
| 3 | Web enrichment pipeline | — | — | — | — |
| 4 | Trust scoring engine | — | — | — | — |
| 5 | ANZSIC classification | — | — | — | — |
| 6 | Scheduled extraction | — | — | — | — |
| 7 | Universal source connector | L2 | L2 | — | — |
| 8 | Security-first extraction | L3 | L3 | — | — |
| 9 | Config-driven transforms | L3 | L3 | — | — |
| 10 | Quality metrics engine | L2 | L2 | — | — |
| 11 | Type-safe loading | L3 | L3 | — | — |
| 12 | ETL pipeline dashboard | L2 | L2 | L2 | — |
Scheduling Pattern
CRON (monthly): NZBN bulk refresh → full re-score
CRON (weekly): Companies Office change events API → incremental update
EVENT (on-demand): Website enrichment for new/priority entities → Crawl4AI
EVENT (trigger): Trust score drops below threshold → re-verify
Agent-Facing Spec
Commands:
# NZBN ingest
npx tsx tools/scripts/etl/nzbn/ingest.ts --count=100 --region=all --dry-run
npx tsx tools/scripts/etl/nzbn/ingest.ts --count=100 --region=all --allowSafeSource
# Companies Office enrichment
npx tsx tools/scripts/etl/companies-office/enrich.ts --source=nzbn --dry-run
# Web enrichment
npx tsx tools/scripts/etl/web-enrich/crawl.ts --source=nzbn --limit=100 --dry-run
# Trust scoring
npx tsx tools/scripts/etl/trust/score.ts --rescore-all
Boundaries:
| Action | Rule |
|---|---|
| Always | Use --dry-run first, respect rate limits, log provenance |
| Ask first | Bulk operations > 1000 entities, schema changes |
| Never | Scrape LinkedIn, store personal contact details without lawful basis, bypass robots.txt |
Test Contract:
| # | Feature | Test File | Assertion |
|---|---|---|---|
| 1 | NZBN API wrapper | tools/scripts/etl/nzbn/ingest.spec.ts | Returns 100 entities with valid NZBN numbers and status |
| 2 | Companies Office integration | tools/scripts/etl/companies-office/enrich.spec.ts | Each entity gets directors array with name + appointment date |
| 3 | Web enrichment pipeline | tools/scripts/etl/web-enrich/crawl.spec.ts | Structured extraction returns business model + services fields |
| 4 | Trust scoring engine | tools/scripts/etl/trust/score.spec.ts | Every record gets 0-100 score, no nulls, thresholds enforced |
| 5 | ANZSIC classification | tools/scripts/etl/nzbn/anzsic.spec.ts | BIC codes mapped to ANZSIC06, unmapped codes flagged |
| 6 | Scheduled extraction | tools/scripts/etl/scheduler/cron.spec.ts | Monthly bulk and weekly delta fire on schedule, idempotent |
| 7 | Universal source connector | libs/infrastructure/data-pipelines/src/connectors.spec.ts | File, API, and DB sources connect and return typed records |
| 8 | Security-first extraction | libs/infrastructure/data-pipelines/src/security.spec.ts | Path traversal blocked, checksums match, lineage recorded |
| 9 | Config-driven transforms | libs/infrastructure/data-pipelines/src/transforms.spec.ts | Field mapping produces expected output without code changes |
| 10 | Quality metrics engine | libs/infrastructure/data-pipelines/src/quality.spec.ts | Bad fields flagged, quality score ≥ threshold to proceed |
| 11 | Type-safe loading | libs/infrastructure/data-pipelines/src/loader.spec.ts | Schema validation rejects bad records, transaction rolls back |
| 12 | ETL pipeline dashboard | apps/stackmates/e2e/etl-dashboard.spec.ts | 8 pipelines visible, run/dry-run buttons functional |
Legal Compliance
| Data Type | NZ Privacy Act 2020 | Action |
|---|---|---|
| Business name, address, NZBN, industry | NOT personal information | Collect freely from public sources |
| Director/officer names + roles | Public record (Companies Office) | Low risk — collect with attribution |
| Contact email/phone of individuals | Personal information | Must have lawful purpose, inform individual |
| Scraped personal data at scale | Triggers Privacy Act obligations | Minimize, document lawful basis, implement retention limits |
Players
Who creates harmony?
Demand-Side Jobs
Job 1: Lead Generation Data
Situation: Sales Dev agent needs qualified NZ businesses to prospect, but CRM has zero classified entities.
| Element | Detail |
|---|---|
| Struggling moment | Agent opens CRM, finds 0 businesses matching target industry/region criteria |
| Current workaround | Manual Google searches, copy-paste from directories, no systematic classification |
| What progress looks like | 100 NZ businesses classified by industry + model + supply chain, with directors and trust scores, queryable in PostgreSQL |
| Hidden objection | "Scraped data is always garbage — we'll spend more time cleaning than using" |
| Switch trigger | First query returns 10 qualified leads with trust score > 70 in under 2 seconds |
Features that serve this job: #1, #2, #3, #4, #5
Job 2: Nowcast Signal Feed
Situation: Pipeline Nowcast needs real-time business signals to predict variance, but no signals are flowing.
| Element | Detail |
|---|---|
| Struggling moment | Nowcast model has no input — predictions impossible without data |
| Current workaround | Manual assessment, gut feel, delayed reaction |
| What progress looks like | Weekly delta feed of NZ business changes (new registrations, status changes, director moves) flowing to Nowcast |
| Hidden objection | "By the time scraped data arrives it's already stale" |
| Switch trigger | NZBN change events API delivers same-day updates for monitored entities |
Features that serve this job: #1, #6
Job 3: AI-Ready Data for Ventures
Situation: Every venture on the platform needs clean, structured data but each builds its own extraction pipeline.
| Element | Detail |
|---|---|
| Struggling moment | New venture launches, immediately blocked by "where do we get data?" |
| Current workaround | Each venture writes bespoke scripts, duplicating 80% of the work |
| What progress looks like | Any venture queries the unified data layer, gets trusted records in < 2s |
| Hidden objection | "Universal tools never work for my specific domain" |
| Switch trigger | First venture (Stackmates) uses ETL-loaded data without writing any extraction code |
Features that serve this job: #7, #8, #9, #10, #11
Relationship to Other PRDs
| PRD | Relationship | Data Flow |
|---|---|---|
| Sales CRM & RFP | Peer (consumer) | ETL → business profiles + contacts → CRM entities |
| Sales Dev Agent | Peer (consumer) | ETL → qualified leads → prospecting pipeline |
| Pipeline Nowcast | Peer (consumer) | ETL → business signals → variance prediction |
| Business Idea Generator | Peer (consumer) | ETL → market data → idea validation |
| Agent Platform | Peer (infra) | Agent triggers ETL enrichment via MCP/A2A |
| Data Interface | Child (access layer) | ETL loads → Data Interface queries |
Context
- PRD Index — ETL Data Tool
- Prompt Deck — 5-card pitch
- Pictures — Pre-flight maps