Data Footprint
All we are is state machine that are impacted by data flow.
What routine checks and balances needs to be setup to manage this process optimally?
Subject Expertise
- Data governance: The oversight to ensure that data brings value and supports your business strategy.
- Interoperability and usability: The ability to interact with multiple user profiles and other systems.
- Integrity: All operations processes that keep the data warehouse running in production.
- Security, privacy, compliance: Protect the data from threats.
- Reliability: The ability of a system to recover from failures and continue to function.
- Performance efficiency: The ability of a system to adapt to changes in load.
- Cost optimization: Reduce costs while maximizing the value delivered.
- ETL
- Standards Compliance
Intelligence you gain is only a good as your data is true
Data Business Rules
What are you legally required to do before collecting personal information?
You're in "real privacy" territory as soon as you collect names and contact details in NZ. LinkedIn also expects a public privacy policy URL before users authorise your app.
NZ Privacy Act 2020
Any NZ business or solo operator collecting personal information is an "agency" under the Privacy Act 2020. Thirteen privacy principles apply.
| Principle | Rule | Sales Dev Impact |
|---|---|---|
| Purpose | State why you collect it | Sales outreach, CRM logging, campaign tracking |
| Necessity | Only collect what's reasonably needed | Name, email, LinkedIn URL — not everything available |
| Security | Store securely | Supabase (encrypted at rest), RBAC access controls |
| Disclosure | Name who you share it with | Resend, CRM, hosting providers, analytics |
| Access | Let people see and correct their data | Provide a mechanism on request |
| Deletion | Delete on request | Honour within reasonable time |
LinkedIn API
LinkedIn requires a public privacy policy URL before users authorise your app. Your policy must be at least as strong as LinkedIn's.
| Requirement | What's Needed |
|---|---|
| Privacy Policy URL | Publicly accessible, linked in app config |
| Data handling | Explain what member data you store and why |
| Consent | Show policy to users before they grant access |
| Revocation | Delete data when consent is revoked |
| Storage limits | Follow LinkedIn's limits on member data retention |
Privacy Policy
Host at a stable URL (/privacy on your domain or public doc). Reference from LinkedIn app config and product website. Guide for NZ businesses.
| Section | Content |
|---|---|
| What you collect | Names, emails, LinkedIn profile URLs, messages, event logs |
| Why you collect it | Sales pipeline, campaign tracking, AI-assisted outreach |
| Where it's stored | Supabase, Vercel, Resend, LinkedIn API, analytics |
| Who processes it | Third-party services with data access |
| Retention | How long per data type, deletion on request |
| Contact | How to access, correct, or delete |
Commissioning Gate
No Sales Dev Agent feature that touches personal data ships without:
- Privacy policy hosted at public URL
- LinkedIn app config references that URL
- Data collection limited to stated purposes
- Deletion pathway tested end-to-end
- Third-party processors named in policy
Playbook
Here's an optimized framework for AI-driven business data analysis operations, synthesized from industry best practices and emerging agent architectures:
| Role | Job → Workflow | Prompt Example | Context | Best Fit Model | Essential Tools |
|---|---|---|---|---|---|
| Data Collector | Automate data ingestion → ETL pipelines | "Act as an ETL pipeline architect. Identify all relevant data sources in [department], establish automated ingestion via APIs/web scraping, and structure into [format] for analysis. Document schema relationships." | Cross-system data integration, real-time streaming | GPT-4 Turbo + CodeLlama-70B | Python (BeautifulSoup/Scrapy), Airflow, Fivetran, Snowflake, AWS Glue |
| Quality Sentinel | Clean/validate datasets → Anomaly detection | "Act as data integrity agent. Analyze [dataset] for missing values, outliers, and formatting errors. Implement automated correction rules while maintaining audit trail. Generate data health report." | Regulatory compliance, ML model readiness | Claude 3 Sonnet | Great Expectations, Pandera, Trifacta, Dataiku |
| Insight Miner | Exploratory analysis → Hypothesis testing | "Act as principal analyst. Using [dataset], test whether [business hypothesis] holds. Apply appropriate statistical methods (p < 0.05). Visualize relationships between [variables]. Suggest 3 actionable next steps." | Strategic decision support, market trend analysis | GPT-4 Omni + Wolfram | SQL, R/Python statsmodels, Jupyter, Tableau |
| Forecast Architect | Predictive modeling → Scenario planning | "Act as quantitative modeler. Develop ARIMA/prophet model for [metric] forecasting. Backtest against last 5 years data. Show 3 scenarios (+/- 15% variance). Output interactive visualizations with confidence intervals." | Financial planning, inventory optimization | Google Gemini Advanced | Prophet, TensorFlow, Azure ML, Power BI |
| Storyteller | Insight synthesis → Executive reporting | "Act as VP Strategy. Transform [analysis findings] into C-suite presentation: 5 slides max. Highlight 3 key opportunities, 2 risks, and 1 recommended initiative with ROI projection. Use automotive industry benchmarks." | Board-level communication, investor relations | Claude 3 Opus | Think-Cell, PowerPoint, Domo, Looker |
| Process Optimizer | Continuous improvement → AutoML | "Act as ML Ops engineer. Implement automated feature engineering on [dataset]. Deploy best-performing model from H2O AutoML. Monitor drift with [threshold]. Create retraining pipeline in AWS SageMaker." | Operational efficiency, real-time decision systems | Mistral-Large + SAS Viya | H2O.ai, SageMaker, MLflow, Kubeflow |
| Compliance Guard | GDPR/PII protection → Audit trails | "Act as data governance officer. Anonymize [sensitive fields] using k-anonymity (k=5). Implement role-based access controls. Generate compliance certificate with embedded watermarks for audit purposes." | Regulatory requirements, ethical AI practices | IBM Watsonx | Immuta, Privitar, Collibra, Azure Purview |
Platform
Toolkit and tech stack to maximize gains from your data footprint.
- Agent Stack Architecture: Layer LangChain/Microsoft AutoGen for orchestration between specialized models
- Tool Integration: Use vector databases (Pinecone/Weaviate) for contextual memory across workflows
- Validation: Implement human-in-the-loop review points pre-execution of critical business decisions
- Security: Zero-trust design with encrypted data vaults (Vault12/Anjuna) for sensitive financial data
This framework enables autonomous operation while maintaining necessary governance controls, combining the analytical depth of tools like Databricks with the conversational interface of LAMBDA agents. For maximum ROI, prioritize implementations that bridge departmental silos through unified data lakes on Snowflake/Redshift.
AI Strategy Checklist
Evolve a comprehensive Data and AI strategy checklist that addresses all aspects of data management across SaaS products, internal systems, and manual processes like spreadsheets. This checklist is designed to be valuable for any business owner or director regardless of technical background.
Data Inventory and Ecosystem Mapping
- What is the complete inventory of your data sources? (SaaS platforms, internal databases, spreadsheets, documents, etc.)
- Which systems contain your most business-critical data?
- For each data source, who "owns" the data and is responsible for its accuracy?
- What types of data do you collect and store? (customer, financial, operational, etc.)
- Which systems serve as your "source of truth" for different types of reference data?
- How much of your critical business data resides in unstructured formats (emails, documents) or personal tools (spreadsheets)?
- Are there any "shadow IT" systems or unauthorized data repositories in use?
Data Quality and Governance
- How do you measure and ensure data quality across all systems?
- What processes exist for data validation and cleaning?
- How do you handle incomplete or inaccurate data?
- What governance policies are in place regarding data access, usage, and modification?
- How do you comply with relevant data regulations (GDPR, CCPA, industry-specific)?
- Do you have a data dictionary or catalog that defines key data elements across systems?
- How do you track the lineage of data as it moves between systems?
Data Integration and Flow
- How is data currently shared or synchronized between your various systems?
- What integration challenges exist between your SaaS platforms and internal systems?
- Are you using any middleware or iPaaS solutions (Zapier, MuleSoft, etc.) to connect systems?
- How much manual effort is required to move data between systems?
- What are your biggest pain points regarding data duplication or inconsistency?
- Do you have real-time data needs, and if so, are they being met?
- How do reference data updates propagate across your ecosystem?
Data Access and Utilization
- Who has access to what data, and how is this access controlled?
- What tools do different stakeholders use to access and analyze data?
- How long does it typically take to gather the data needed for key decisions?
- Are there bottlenecks in accessing or retrieving data when needed?
- Do business users have self-service analytics capabilities, or are they dependent on IT?
- What skills gaps exist in your organization regarding data analysis?
Reporting and Analytics Capabilities
- What reports are regularly generated from your data, and how are they created?
- How much time is spent preparing reports versus analyzing them?
- Are your current analytics capabilities descriptive, diagnostic, predictive, or prescriptive?
- What visualization tools are being used (Tableau, Power BI, custom dashboards)?
- How frequently is data refreshed in your reports and dashboards?
- What metrics or KPIs do you track most closely?
- Are there any insights you wish you could extract but currently cannot?
Decision-Making Processes
- Which decisions are currently data-driven versus intuition-based?
- How quickly can you act on insights derived from your data?
- What is the typical lag time between data collection and decision-making?
- How do you measure the effectiveness of data-driven decisions?
- Are there recurring decisions that could be automated?
- What operational inefficiencies could be addressed through better data usage?
AI Readiness and Opportunities
- What specific business problems are you trying to solve that AI could potentially address?
- Have you identified high-value use cases for AI implementation?
- Is your data of sufficient quality and quantity to support AI initiatives?
- What experiments or proofs of concept have you conducted with AI?
- Do you have clear objectives and measurable outcomes for potential AI implementations?
- Which areas of your business would benefit most from predictive analytics?
- Are there repetitive cognitive tasks that could be augmented or automated with AI?
Implementation and Resource Considerations
- What is your current technology infrastructure's ability to support advanced analytics or AI?
- Do you have the necessary in-house expertise, or will you need external support?
- What is your budget allocation for data and AI initiatives?
- How would you prioritize potential data and AI projects?
- What change management considerations should be addressed?
- How will you measure ROI on data and AI investments?
- What timeline would be realistic for implementing key initiatives?
Strategy and Future Planning
- What is your long-term vision for leveraging data and AI?
- How does your data strategy align with your overall business strategy?
- What competitive advantages could be gained through better data utilization?
- How do you plan to scale your data capabilities as your business grows?
- What emerging technologies or approaches should be on your radar?
- How will you continuously improve your data ecosystem?
- What skills development is needed to support your future data and AI initiatives?
The Commissioning Instrument
The data footprint is not an audit. It's a commissioning instrument — a mechanical check that reveals the gap between what's built and what's connected.
Every table, every API route, every UI entry point has a maturity state:
| State | What It Means | Signal |
|---|---|---|
| Schema exists | Structure is defined | Drawings done |
| Data exists | Records populated | Material procured |
| API exists | Programmatic access | Wired |
| UI exists | Human can reach it | Controls proven |
| Feedback loop closes | Usage improves the system | Operating |
A system can score perfectly on schema and data while having zero UI or API entry points. Structure without connection is inert. The data footprint reveals this gap mechanically — not through judgment, but through counting entry points.
The question: for each table, can a human reach it? Can an agent use it? If neither, the capability exists in theory only.
The State Machine Score
165 tables. 169 schema files. 71 repository implementations. This is the current footprint.
| State | Name | Evidence Mechanism | Tables at State |
|---|---|---|---|
| 0 | Idea | Concept named in schema file | 165 |
| 1 | Schema | Table in information_schema | 165 |
| 2 | Migration | Table exists in DB | 165 |
| 3 | Data | record_count > 0 in meta_database_introspection | ~12 (not yet measured) |
| 4 | Repository | Repo class exists in repositories/src/lib/ | 71 |
| 5 | Server Actions | API route responds | Not yet measured |
| 6 | CRUD UI | has_crud_interface = true in meta_table_documentation | 7 |
| 7 | ETL Pipeline | pipeline_in populated in meta_table_documentation | 0 |
| 8 | A2A API | has_agent_interface = true in meta_table_documentation | 3 |
| 9 | E2E Tests | Intent/E2E test pass in CI | Not yet measured |
| 10 | Commissioned | All prior states verified | 0 |
The gap: meta_table_documentation has 0 rows. The instrument exists but has never been seeded. Until it is seeded, states 6, 7, and 8 cannot be read deterministically.
Domain Breakdown
| Domain | Tables | Has Repo | CRUD UI | A2A API | State 3 (data) |
|---|---|---|---|---|---|
| agent | 32 | ~20 | 2 | 2 | agent_profiles: 6, deals: 3, contacts: 5 |
| std | 22 | ~6 | 0 | 0 | 0 |
| priority | 13 | 5 | 0 | 0 | problems: 1 |
| platform | 9 | 4 | 1 | 0 | 0 |
| planning | 9 | 1 | 0 | 0 | plans: 26, tasks: 150, events: 216 |
| tech | 9 | 0 | 0 | 0 | 0 |
| value | 8 | 1 | 0 | 0 | 0 |
| venture | 8 | 3 | 1 | 0 | 0 |
| job | 7 | 0 | 0 | 0 | 0 |
| governance | 4 | 1 | 0 | 0 | permissions: 70, role_perms: 86 |
| knowledge | 3 | 0 | 0 | 0 | 0 |
| other | 32 | ~10 | 3 | 1 | org: 1, system_user: 1 |
The Four Wiring Gaps
To achieve 100% deterministic commissioning:
| # | Gap | File to Wire | What "Done" Looks Like |
|---|---|---|---|
| 1 | meta_table_documentation empty | tools/scripts/etl/database-introspection/run.ts | One row per table, auto-seeded from information_schema |
| 2 | meta_database_introspection never run | Same script | Row counts, column counts, FK graph populated for all 165 tables |
| 3 | CRUD + API detection not writing to DB | tools/scripts/analysis/commissioning-detection/detect-crud-pages.ts + detect-api-endpoints.ts | hasCrudInterface and hasAgentInterface flags written to meta_table_documentation |
| 4 | Dashboard not querying the instrument | GET /api/commissioning/status | Endpoint reads meta_table_documentation — commisioning-status.md becomes a view, not a manual file |
The deterministic query (once seeded):
SELECT
table_name,
has_crud_interface,
has_agent_interface,
pipeline_in IS NOT NULL as has_etl,
record_count > 0 as has_data,
meta_score
FROM meta_table_documentation
ORDER BY meta_score DESC, table_name;
This single query answers: data footprint, human access, agent access, ETL state — for every table. No manual interpretation required.
Context
- Platform — The accumulated assets the footprint measures
- Legal Operations — Broader legal framework this page's business rules sit within
- Sales Dev Agent — First PRD gated on privacy compliance
- Identity & Access — Auth and RBAC that enforces data access controls
- Work Charts — Who does what, and receipts that prove it
- Flow Engineering — The maps that produce this instrument
- Commissioning — Where maturity states are tracked
- Standard KPI Metrics
- Decision Making
Links
- Data Management Strategy
- NZ Privacy Act 2020 — Privacy Principles
- LinkedIn API — Getting Started
- NZ Privacy Policy Guide
- LinkedIn GDPR Compliance
- LinkedIn API Limits
Questions
If structure without connection is inert, how many of your tables exist in theory only — and what does that reveal about what you've actually built?
- Which data domain in your business has the widest gap between schema maturity and feedback loop maturity?
- When the commissioning instrument shows a table with data but no API, is that a build priority or an intentional boundary?
- Does having a privacy policy change what you collect — or does it just document what you were already doing?
- What breaks first when a user requests deletion and your data flows span five services?