Skip to main content

← ETL Data Tool · Prompt Deck · Pictures

ETL Data Tool Spec

How do we turn empty repos into trusted data without hiring a data team?

Intent Contract

DimensionStatement
ObjectiveUnified data acquisition pipeline that connects government APIs, web enrichment, and trust scoring — so CRM and nowcasting have real data without bespoke engineering per source.
Outcomes1. 100 NZ businesses ingested from NZBN with trust scores within 7 days of deployment. 2. Companies Office directors linked to NZBN entities automatically. 3. Web enrichment adds business model classification to 50%+ of ingested entities.
Health MetricsPipeline run success rate >95%. Zero data corruption on failed runs (transaction rollback). Trust score computation completes for every ingested record.
Counter-metricsPipeline must not slow CRM page loads (async ingestion only). Trust scores must not create false confidence — low-evidence records must score low. Enrichment must not scrape personal data (NZ Privacy Act 2020).
ConstraintsHard: Free data sources first (NZBN, Companies Office, data.govt.nz). Hard: No LinkedIn scraping (Proxycurl sued Jan 2026). Hard: Trust score on every record — no data enters repos without provenance. Steering: TypeScript pipeline matching existing agent-etl-cli.ts patterns.
AutonomyAllowed: API client implementation, extraction scheduling, trust score formula tuning. Escalate: New paid data sources, schema changes to business entity tables. Never: Store personal contact data without lawful purpose, bypass trust scoring, delete source records.
Blast RadiusPostgreSQL business entity tables (new INSERTs). CRM contact enrichment (new data flowing in). Scheduled jobs (new cron entries). Docker (Crawl4AI container).
RollbackEach pipeline is independent — disable by removing cron entry. Ingested records can be soft-deleted by source batch ID. Trust scores can be recomputed from raw data. Crawl4AI container can be stopped without affecting other pipelines.
Stop RulesComplete when: 100 businesses ingested with trust scores, Companies Office directors linked, web enrichment classifying 50%+ of entities. Halt when: NZBN API access revoked, or trust scores show no discrimination (all records score the same).

Story Contract

Stories are test contracts. Each row converts to test files. Tests must be RED before implementation starts. Tests going GREEN = value delivered.

Every story follows the P2P arc: Pain (who hurts, measurable cost) → Performance Target (the number that proves it works) → Validated Outcome (the Tight Five dimension this serves).

#WHEN (Trigger + Precondition)THEN (Exact Assertion)ARTIFACTTest TypeFORBIDDEN (Must not happen)OUTCOME (Performance Number)
S1Operator runs npx tsx tools/scripts/etl/nzbn/ingest.ts --count=100 with valid NZBN API credentials configured100 business entity records inserted into PostgreSQL. Each record has: NZBN number (non-null), entity name, status, registration date, BIC codes. Duplicate NZBNs rejected on re-run (upsert, not duplicate). Run completes within 5 minutes.BLOCKER — test path TBD by engineeringintegrationCommand returns success but inserts 0 records. Records inserted with null NZBN numbers. Re-run creates duplicate records instead of upserting. API rate limit hit with no retry/backoff.100 businesses ingested in <5 min vs manual NZBN lookup at ~3 min per entity (300 min saved)
S2100 NZBN entities exist in DB AND operator runs Companies Office enrichment pipelineDirectors and shareholders linked to their NZBN entities. Each linked person has: name, role (director/shareholder), appointment date. Person records deduplicated across entities (same director at 2 companies = 1 person record, 2 role links).BLOCKER — test path TBD by engineeringintegrationEnrichment runs but creates zero person-entity links. Same person duplicated as separate records per company. Personal data stored without lawful_purpose flag.People + roles enriched automatically vs manual Companies Office searches at ~5 min per entity
S350+ NZBN entities exist in DB AND Crawl4AI Docker container running AND operator triggers web enrichmentCrawl4AI extracts structured data from company websites found via NZBN entity URLs. Business model, services list, and industry classification written to entity record. source_authority set to 0.6 for web-sourced fields. Entities without discoverable websites marked as web_enrichment_failed (not silently skipped).BLOCKER — test path TBD by engineeringintegrationEnrichment silently skips entities with no website (no failure record). Web-sourced data given source_authority = 1.0 (same as government API). Crawl4AI container crash leaves pipeline in broken state with no cleanup.50%+ of entities classified by business model vs 0% classification today
S4Entity has data from 2+ sources (NZBN + Companies Office, or NZBN + web enrichment)Trust score computed as weighted composite: completeness (0.25) + source_authority (0.25) + freshness (0.20) + cross_reference (0.20) + consistency (0.10). Score stored with provenance trail (which sources contributed, when). Government-only entity scores higher than web-only entity for same completeness.BLOCKER — test path TBD by engineeringunit + integrationAll entities receive the same trust score regardless of source quality. Entity with only web data scores equal to entity with government + web data. Trust score computed but provenance trail missing — can't explain why a score is low.Every record has a 0-100 trust score with provenance vs current "hope the data is good"
S58 ETL pipelines configured AND operator navigates to pipeline dashboardDashboard renders all 8 pipelines with: name, last run timestamp, last run status (success/failed/running), record count. Operator can trigger dry-run or live-run from dashboard. Failed pipeline shows error summary without requiring log file access.BLOCKER — test path TBD by engineeringe2eDashboard shows "0 pipelines" when 8 are configured. Run button triggers pipeline but dashboard doesn't update status until page refresh. Failed run shows generic "error" with no actionable detail.Pipeline status visible in <10s vs SSH + grep logs at ~5 min per pipeline check

Tight Five coverage check:

P DimensionStories covering itGap?
Pain (CRM has 0 NZ businesses, repos empty)S1 (NZBN ingestion fills the void), S5 (visibility into pipeline health)No
Demand (NZBN API free, 700K entities waiting)S1 (government data), S2 (Companies Office directors)No
Edge (trust scoring on every record)S4 (composite trust score with provenance)No
Trend (AI needs trusted data, not garbage)S3 (web enrichment feeds AI classification), S4 (trust prevents confident-on-garbage)No
Conversion (100 businesses, $0 cost)S1 (100 entities target), S3 (50%+ classification rate)No

Build Contract

#FeatureFunctionOutcomeJobState
1NZBN API wrapperTypeScript client over NZBN REST API — search, retrieve, bulkNZ businesses ingested from authoritative sourceLead generationGap
2Companies Office integrationDirectors, shareholders, filings via CO APIPeople + roles for every business entityContact enrichmentGap
3Web enrichment pipelineCrawl4AI (Docker) extracts structured data from company websitesBusiness model, services, supply chain classified per entityBusiness intelligenceGap
4Trust scoring engineComposite score: completeness, authority, freshness, cross-ref, consistencyEvery record has a 0-100 trust score with provenanceData qualityGap
5ANZSIC classificationMap NZBN BIC codes + AI enrichment to ANZSIC06 industry codesEvery business classified by industry, model type, supply chainIndustry mappingGap
6Scheduled extractionCron: monthly bulk refresh, weekly delta, on-demand enrichment triggersData stays fresh without manual interventionAutomationGap
7Universal source connectorConnect files, APIs, databases, SaaS tools via configurationNo source is unsupported — connect anythingData readinessPartial
8Security-first extractionValidate paths, generate checksums, track lineage per recordEvery byte has a provenance trailComplianceBuilt
9Configuration-driven transformsMap fields, clean data, infer types without writing codeNon-engineers can define data pipelinesSelf-serviceBuilt
10Quality metrics engineScore data quality per field and per source on every runBad data caught before it reaches AIData qualityBuilt
11Type-safe loadingValidated schemas, transaction rollback, event emissionLoad failures don't corrupt the databaseReliabilityBuilt
12ETL pipeline dashboard8 pipelines with run/dry-run in UIOperators see pipeline status and trigger runsVisibilityLive

Principles

What truths constrain the design?

The Job

ElementDetail
SituationCRM and nowcasting systems have schemas and repos but zero data flowing through them. The intake valve doesn't exist.
IntentionAny data source — government API, web scrape, bulk file — connects through one unified pipeline framework with trust scoring at ingestion.
ObstacleNo extraction strategy. Repos are built but empty. Each new data source requires bespoke engineering.
Hardest ThingTrust. Scraped data mixed with government data mixed with manual entry — without trust scoring, the AI makes confident decisions on garbage.

Why Now

  • CRM has 0 NZ businesses classified. Sales pipeline is empty because lead data doesn't exist.
  • Nowcasting requires real-time business signals. No signals flowing = no predictions.
  • NZBN API is free (700K+ NZ entities). Companies Office API is free (directors, shareholders). The authoritative data is sitting there uncollected.
  • Crawl4AI is open-source (Apache 2.0, 58K+ GitHub stars). Self-hosted = $0.
  • Every month without data acquisition is a month of manual work the platform should eliminate.

Design Constraints

ConstraintRationale
Free data sources firstNZBN + Companies Office + data.govt.nz before any paid service
Trust score on every recordNo data enters repos without a composite trust score
Scheduled + triggeredMonthly bulk, weekly delta, on-demand enrichment — not just cron
TypeScript pipelineMatches existing agent-etl-cli.ts patterns and Drizzle repos
NZ Privacy Act 2020 compliantBusiness data freely; personal contact data with lawful purpose only
No LinkedIn scrapingProxycurl sued Jan 2026. Use Companies Office for directors, not scraping

Data Acquisition Architecture

Three-layer pipeline. Each layer has a different trust level.

Layer 1: STRUCTURED APIs (free, high trust)
NZBN API + Companies Office API → PostgreSQL
↓ trust_score: source_authority = 1.0

Layer 2: AI ENRICHMENT (low cost, medium trust)
Crawl4AI (Docker) → company websites → structured extraction → PostgreSQL
↓ trust_score: source_authority = 0.6

Layer 3: VALIDATION & SCORING (compute cost only)
LLM classification + cross-reference → composite trust score → PostgreSQL
↓ trust_score: composite of all 5 dimensions

Trust Score Formula

trust_score = (
completeness × 0.25 + # % fields populated
source_authority × 0.25 + # govt=1.0, website=0.6, directory=0.4
freshness × 0.20 + # decay from last_verified_at
cross_reference × 0.20 + # confirmed by N sources
consistency × 0.10 # no contradictions
) × 100

Thresholds: <40 = Low, 40-70 = Medium, >70 = High

NZ Data Sources

SourceDataCostTrustAccess
NZBN APIEntity name, status, trading names, addresses, industry codes, contactsFreeHighREST API (subscription key)
NZBN Bulk DataComplete business dataset (700K+)FreeHighJSON/CSV monthly dumps
Companies Office APICompany registration, directors, shareholders, filingsFreeHighREST API
data.govt.nzANZSIC06 classification tablesFreeHighCSV/JSON download
Crawl4AIStructured extraction from company websitesFree (self-hosted)MediumDocker, Apache 2.0
FirecrawlFallback scraping (95.3% reliability)500 pages free/moMediumAPI + MCP server

Tool Selection

NeedToolWhy
NZ business dataNZBN API + Bulk DataFree, authoritative, 700K+ entities
Industry classificationANZSIC06 tables + NZBN BIC codesFree, standardized
People + rolesCompanies Office API (directors)Free, public record
Website enrichmentCrawl4AI (Docker, self-hosted)Free, 89.7% success, LLM extraction
Fallback scrapingFirecrawl free tier (500 pages/mo)95.3% reliability, MCP server for agent integration
Future agent integrationFirecrawl MCP serverAgent-triggered enrichment

Performance

How do we know it's working?

Priority Score

PRIORITY = Pain x Demand x Edge x Trend x Conversion

DimensionScore (1-5)Evidence
Pain5CRM has 0 classified NZ businesses. Nowcast has 0 signals. Every data-consuming PRD blocked. Empty repos = zero value from existing schemas.
Demand4Internal engineering dependency for Sales CRM, Sales Dev, Nowcast, Business Idea Generator. No external demand yet (max 4 per scoring guide).
Edge4Domain-specific schemas (73 types) + NZBN/CO APIs (free, most competitors pay for data) + existing Drizzle repos ready to receive. 6+ months to replicate.
Trend5AI data readiness is structural shift. 73% of AI projects fail on data quality (Gartner). MCP scraping tools exploding (9 servers benchmarked 2026). Crawl4AI 58K stars.
Conversion3Clear internal path. External: $2K-50K/month tiers plausible but untested. No external paying customers yet.
Composite12005 x 4 x 4 x 5 x 3

Quality Targets

MetricTargetNow
NZ businesses ingested100 (first batch)0
Trust score coverage100% of records0%
Pipeline execution time< 30 min for 100 entitiesN/A
Query latency on loaded data< 2sN/A
Freshness: time since last refresh< 7 daysN/A

Kill signal: If data loads but nobody queries it within 14 days, the pipeline is extraction theater. Track: queries per entity per week. If zero after 2 weeks, the downstream consumers don't need this data.

Platform

What do we control?

Current State

ComponentBuiltWiredWorkingNotes
PostgreSQL schemas (73 types)YesYesYesDrizzle repos ready
agent-etl-cli.tsYesYesYesDDD load order, dry-run, allowSafeSource
ETL pipeline dashboardYesYesYes8 pipelines, run/dry-run
File source connectorsYesYesYes90% complete
API source connectorsYesYesYes85% complete
Database connectorsYesYesPartial75% complete
NZBN API wrapperNoNoNoGap — Sprint 0
Companies Office integrationNoNoNoGap — Sprint 0
Crawl4AI enrichmentNoNoNoGap — Sprint 1
Trust scoring engineNoNoNoGap — Sprint 1
ANZSIC classificationNoNoNoGap — Sprint 1
Scheduled extractionNoNoNoGap — Sprint 2

Build Ratio

~60% composition (existing ETL infra, Drizzle repos, CLI patterns), ~40% new code (NZBN wrapper, trust scoring, Crawl4AI integration, scheduling).

Protocols

How do we coordinate?

Build Order

SprintFeaturesWhatEffortAcceptance
0#1, #2, #5NZBN API wrapper + Companies Office + ANZSIC tables1 week100 NZ businesses in PostgreSQL with industry codes + directors
1#3, #4Crawl4AI enrichment + trust scoring1 week100 entities enriched from websites, all records trust-scored
2#6Scheduled extraction (cron + triggers)3 daysMonthly bulk runs unattended, weekly delta detects changes
3First downstream query: Sales Dev uses business data for leads2 daysSales Dev agent queries 100 businesses, filters by industry + trust score

Commissioning

#FeatureInstallTestOperationalOptimize
1NZBN API wrapper
2Companies Office integration
3Web enrichment pipeline
4Trust scoring engine
5ANZSIC classification
6Scheduled extraction
7Universal source connectorL2L2
8Security-first extractionL3L3
9Config-driven transformsL3L3
10Quality metrics engineL2L2
11Type-safe loadingL3L3
12ETL pipeline dashboardL2L2L2

Scheduling Pattern

CRON (monthly):  NZBN bulk refresh → full re-score
CRON (weekly): Companies Office change events API → incremental update
EVENT (on-demand): Website enrichment for new/priority entities → Crawl4AI
EVENT (trigger): Trust score drops below threshold → re-verify

Agent-Facing Spec

Commands:

# NZBN ingest
npx tsx tools/scripts/etl/nzbn/ingest.ts --count=100 --region=all --dry-run
npx tsx tools/scripts/etl/nzbn/ingest.ts --count=100 --region=all --allowSafeSource

# Companies Office enrichment
npx tsx tools/scripts/etl/companies-office/enrich.ts --source=nzbn --dry-run

# Web enrichment
npx tsx tools/scripts/etl/web-enrich/crawl.ts --source=nzbn --limit=100 --dry-run

# Trust scoring
npx tsx tools/scripts/etl/trust/score.ts --rescore-all

Boundaries:

ActionRule
AlwaysUse --dry-run first, respect rate limits, log provenance
Ask firstBulk operations > 1000 entities, schema changes
NeverScrape LinkedIn, store personal contact details without lawful basis, bypass robots.txt

Test Contract:

#FeatureTest FileAssertion
1NZBN API wrappertools/scripts/etl/nzbn/ingest.spec.tsReturns 100 entities with valid NZBN numbers and status
2Companies Office integrationtools/scripts/etl/companies-office/enrich.spec.tsEach entity gets directors array with name + appointment date
3Web enrichment pipelinetools/scripts/etl/web-enrich/crawl.spec.tsStructured extraction returns business model + services fields
4Trust scoring enginetools/scripts/etl/trust/score.spec.tsEvery record gets 0-100 score, no nulls, thresholds enforced
5ANZSIC classificationtools/scripts/etl/nzbn/anzsic.spec.tsBIC codes mapped to ANZSIC06, unmapped codes flagged
6Scheduled extractiontools/scripts/etl/scheduler/cron.spec.tsMonthly bulk and weekly delta fire on schedule, idempotent
7Universal source connectorlibs/infrastructure/data-pipelines/src/connectors.spec.tsFile, API, and DB sources connect and return typed records
8Security-first extractionlibs/infrastructure/data-pipelines/src/security.spec.tsPath traversal blocked, checksums match, lineage recorded
9Config-driven transformslibs/infrastructure/data-pipelines/src/transforms.spec.tsField mapping produces expected output without code changes
10Quality metrics enginelibs/infrastructure/data-pipelines/src/quality.spec.tsBad fields flagged, quality score ≥ threshold to proceed
11Type-safe loadinglibs/infrastructure/data-pipelines/src/loader.spec.tsSchema validation rejects bad records, transaction rolls back
12ETL pipeline dashboardapps/stackmates/e2e/etl-dashboard.spec.ts8 pipelines visible, run/dry-run buttons functional
Data TypeNZ Privacy Act 2020Action
Business name, address, NZBN, industryNOT personal informationCollect freely from public sources
Director/officer names + rolesPublic record (Companies Office)Low risk — collect with attribution
Contact email/phone of individualsPersonal informationMust have lawful purpose, inform individual
Scraped personal data at scaleTriggers Privacy Act obligationsMinimize, document lawful basis, implement retention limits

Players

Who creates harmony?

Demand-Side Jobs

Job 1: Lead Generation Data

Situation: Sales Dev agent needs qualified NZ businesses to prospect, but CRM has zero classified entities.

ElementDetail
Struggling momentAgent opens CRM, finds 0 businesses matching target industry/region criteria
Current workaroundManual Google searches, copy-paste from directories, no systematic classification
What progress looks like100 NZ businesses classified by industry + model + supply chain, with directors and trust scores, queryable in PostgreSQL
Hidden objection"Scraped data is always garbage — we'll spend more time cleaning than using"
Switch triggerFirst query returns 10 qualified leads with trust score > 70 in under 2 seconds

Features that serve this job: #1, #2, #3, #4, #5

Job 2: Nowcast Signal Feed

Situation: Pipeline Nowcast needs real-time business signals to predict variance, but no signals are flowing.

ElementDetail
Struggling momentNowcast model has no input — predictions impossible without data
Current workaroundManual assessment, gut feel, delayed reaction
What progress looks likeWeekly delta feed of NZ business changes (new registrations, status changes, director moves) flowing to Nowcast
Hidden objection"By the time scraped data arrives it's already stale"
Switch triggerNZBN change events API delivers same-day updates for monitored entities

Features that serve this job: #1, #6

Job 3: AI-Ready Data for Ventures

Situation: Every venture on the platform needs clean, structured data but each builds its own extraction pipeline.

ElementDetail
Struggling momentNew venture launches, immediately blocked by "where do we get data?"
Current workaroundEach venture writes bespoke scripts, duplicating 80% of the work
What progress looks likeAny venture queries the unified data layer, gets trusted records in < 2s
Hidden objection"Universal tools never work for my specific domain"
Switch triggerFirst venture (Stackmates) uses ETL-loaded data without writing any extraction code

Features that serve this job: #7, #8, #9, #10, #11

Relationship to Other PRDs

PRDRelationshipData Flow
Sales CRM & RFPPeer (consumer)ETL → business profiles + contacts → CRM entities
Sales Dev AgentPeer (consumer)ETL → qualified leads → prospecting pipeline
Pipeline NowcastPeer (consumer)ETL → business signals → variance prediction
Business Idea GeneratorPeer (consumer)ETL → market data → idea validation
Agent PlatformPeer (infra)Agent triggers ETL enrichment via MCP/A2A
Data InterfaceChild (access layer)ETL loads → Data Interface queries

Context