CI Strategy Audit

Is our E2E testing sound — or are we bleeding silently?

Verdict

The CI pipeline is structurally sound. Four workflows cover merge gates, E2E smoke, architecture enforcement, and repo health. The foundation is correct — the remaining gaps are instrumentation and speed.

Dimension	Grade	Evidence
Test structure	Good	66 specs, 8 domains, tiered tags, fixture patterns
CI enforcement	Shipped	`e2e-smoke.yml` gates PRs — API integration + Playwright browser
Architecture gate	Shipped	`pr-quality-gate.yml` — tsconfig validation, architecture audit, module boundary lint
Health loop	Shipped	`monorepo-typecheck.yml` — daily cron at 07:00 NZDT, full typecheck
Auth pattern	Correct	Storage state via Better Auth HTTP — matches best practice
DB isolation	Good	pgvector service container on port 5433, schema push via drizzle-kit
Server warmup	Partial	curl loop + auth endpoint validation — not a proper `/api/health`
Flakiness tracking	Missing	`retries: 2` hides flakes — real failure rate unknown
Sharding	Missing	66 tests serial on 2 workers — slow feedback
Preview deploy testing	Missing	`isProduction` flag exists, not wired to Vercel preview URLs

What's Shipped

Four workflows, each answering one question.

`e2e-smoke.yml` — Merge Loop (E2E)

Two parallel jobs on every PR and push to main:

Job	Time	What It Proves
API Integration	~5 min	Agent contract + DB proof (no browser, no Next.js build)
E2E Browser	~15 min	Full Next.js standalone build + Playwright `@smoke` against real browser

Features: pgvector service container, NX cache, Playwright browser cache, auth endpoint pre-validation, paths-ignore for docs, concurrency with cancel-in-progress, workflow_dispatch with custom grep filter.

Cost: ~2,640 min/month (API ~660 + E2E ~1,980). Over GitHub Actions free tier at current push/PR volume.

`pr-quality-gate.yml` — Merge Loop (Architecture)

Two sequential jobs on every non-draft PR:

Job	Time	What It Catches
Structural	~30s	tsconfig `rootDir`/spec exclude violations (no install needed)
Quality Gate	~6 min	Architecture layer violations + module boundary lint (affected only)

Features: draft PR skip, architecture audit with PR comments, artifact upload.

Cost: ~440 min/month.

`monorepo-typecheck.yml` — Health Loop

Daily cron (07:00 NZDT / 18:00 UTC). Full monorepo typecheck across all projects. Not a merge gate — catches cross-project TS drift that individual PRs miss.

Cost: ~300 min/month.

`claude.yml` — Agent Loop

Claude Code action triggered by @claude mentions in issues and PR comments. Gives the AI agent read/write access to the repo for code reviews and issue resolution.

What Works

Do not change these. They match industry best practice.

Pattern	Implementation	Why it's correct
Two-loop separation	Merge gate (PR) vs health check (cron)	Fast PRs, thorough health — never combined
Storage state auth	Single setup project, saved state, reused across tests	Eliminates per-test login overhead
Service container DB	pgvector on port 5433, schema push via drizzle-kit	Real DB, real extensions, real constraints
Auth pre-validation	curl checks server + `/api/auth/get-session` before Playwright	Catches auth startup failures early
Architecture PR comments	`github-script` posts audit failures to PR	Developer sees violations without opening CI logs
Draft PR skip	`github.event.pull_request.draft == false`	Saves CI minutes on WIP
Concurrency cancel	`cancel-in-progress: true` per workflow + ref	Rapid pushes don't queue — latest wins
Paths ignore	`.md`, `.claude/`, `docs/`, `LICENSE`	Docs-only changes skip CI entirely
NX + Playwright caching	Separate cache keys per workflow + browser cache	Reduces install time on repeat runs
Preflight validation	`env-validation.ts` before all tests	Catches env misconfig before timeout
Test tier tags	smoke/crm/rfp/insights/value/comprehensive	Structure exists and `@smoke` is enforced
Constants for routes	`routes.ts`, `selectors.ts` as single source	Prevents hard-coded selectors

Two Remaining Gaps

Gap 1: Retries Hide Flakes (HIGH)

retries: 2 in the E2E browser job means a test can fail twice, pass on the third attempt, and CI reports green. No tracking of which tests needed retries. Flaky tests accumulate invisibly.

The 31 failing E2E tests from 2026-03-19 may include tests that sometimes pass on retry — the real failure rate is unknown.

Fix: Parse Playwright results after test run. Flag any test with retry > 0 that eventually passed. Track in a quarantine file. Target: flakiness rate at or below 5% per CI Pipeline Health benchmarks.

Gap 2: No Health Endpoint (MEDIUM)

The E2E browser job uses a curl loop checking for HTTP 200/302/307, then validates the auth endpoint. This works but is fragile — it checks "server responds" and "auth responds" but not "app is fully ready" (DB migrations complete, critical tables seeded, all middleware initialized).

A proper /api/health endpoint would check all readiness conditions in one call and return structured status.

Fix: Create /api/health route. Check DB connection, auth middleware, critical seeds. Return 200 with { status: 'ok' } when fully ready, 503 with { status: 'starting', missing: [...] } when not. Replace the curl loop with a single health check wait.

Engineering Task Spec

Two phases for immediate gaps. Two phases for speed and preview testing when ready.

Phase 1: Flakiness Tracking

Files to create:

Action	File	Purpose
Create	`apps/dreamineering/drmg-sales-e2e/scripts/parse-flaky.ts`	Parse Playwright results, flag retried-then-passed tests
Create	`apps/dreamineering/drmg-sales-e2e/quarantine.json`	Known flaky test registry
Modify	`.github/workflows/e2e-smoke.yml`	Add post-test step to run parse-flaky and upload artifact

Parser spec:

Read Playwright HTML report or JSON output after test run
Flag any test with retry > 0 that eventually passed
Output flaky test list as artifact
Optional: CI can exclude quarantined tests from gate

Acceptance: After 10 smoke runs, quarantine.json is empty or contains fewer than 5% of tests.

Phase 2: Health Endpoint

Files to create/modify:

Action	File	Purpose
Create	`apps/dreamineering/drmg-sales/app/api/health/route.ts`	Structured readiness check
Modify	`.github/workflows/e2e-smoke.yml`	Replace curl loop with health endpoint wait

Health endpoint spec:

// GET /api/health
// Checks:
//   1. DB connection alive (SELECT 1)
//   2. Auth middleware initialized (session endpoint responds)
//   3. Critical tables exist (schema push completed)
// Returns:
//   200 { status: 'ok', timestamp } — fully ready
//   503 { status: 'starting', missing: ['db'|'auth'|'schema'] } — not ready

Acceptance: E2E browser job waits for /api/health returning 200 instead of curl loop. Eliminates timing-based flakes.

Phase 3: Sharding (when E2E exceeds 10 min)

Files to modify:

Action	File	Change
Modify	`.github/workflows/e2e-smoke.yml`	Add matrix strategy: `shard: [1/4, 2/4, 3/4, 4/4]`
Modify	`playwright.config.ts`	Increase workers to 4 for CI

Spec: Pass --shard=${{ matrix.shard }} to Playwright. Merge results from all shards before status check. Use Playwright merge-reports CLI.

Trigger: Only implement when smoke suite wall-clock time consistently exceeds 10 minutes. Current ~15 min E2E job includes build time — measure Playwright-only duration first.

Acceptance: Smoke suite Playwright-only time under 5 minutes with sharding.

Phase 4: Preview Deploy Testing (when Vercel is active)

Files to create/modify:

Action	File	Purpose
Create	`.github/workflows/e2e-preview.yml`	Run E2E against Vercel preview URL
Modify	`playwright.config.ts`	Skip `webServer` when `BASE_URL` is set (already handled by `isProduction` flag)

Spec: Trigger on Vercel deployment success. Wait for preview URL. Run smoke + domain tests against BASE_URL=<preview-url>.

Trigger: Only implement when Vercel preview deploys are active and stable.

Acceptance: PR deploys to Vercel. E2E runs against preview URL. Tests pass against real infrastructure.

Phase Gate Rule

Never add the next phase while the current one is flaky. Each phase must prove stability (green rate above 95% over 7 days) before the next ships. This matches the phase-based rollout principle.

Cost Reality

Current monthly CI budget (estimated at 5 pushes/day, 22 working days):

Workflow	Per Run	Monthly Estimate
`e2e-smoke.yml` (API + E2E)	~20 min	~2,640 min
`pr-quality-gate.yml`	~6 min	~440 min
`monorepo-typecheck.yml`	~10 min	~300 min
Total		~3,380 min/month

GitHub Actions free tier is 2,000 min/month. Current setup exceeds it. Monitor against CI Pipeline Health benchmarks. If budget exceeds 85%, reduce push frequency, increase paths-ignore scope, or evaluate cheaper runners (Ubicloud at ~0.3x cost, self-hosted at ~0x).

Benchmark Alignment

These benchmarks from Engineering Quality measure pipeline health:

Benchmark	Threshold	Current Status
8.1 Merge loop duration	10 minutes or less	~20 min (API 5 + E2E 15) — needs sharding or scope reduction
8.2 CI green rate	95% or higher over 7 days	Unknown — need 7 days of data
8.3 E2E flakiness rate	5% or less	Unknown — no tracking (Gap 1)
8.4 Push to merge-ready	15 minutes or less	~20 min — exceeds target
8.5 Health loop runs daily	0 missed days	Configured — cron at 07:00 NZDT
8.6 Cache hit rate	80% or higher	Configured — NX + Playwright caching

Context

CI Testing Infrastructure — Two-loop pipeline design
Engineering Quality Benchmarks — Pass/fail thresholds for CI health
Testing Strategy — Layer model and selection rules
Testing Economics — Cost-benefit per test layer
Developer Experience — CI SLOs from developer perspective

Questions

What's the real failure rate when retries are stripped away — and what does that reveal about test quality?

The merge loop takes ~20 minutes but the benchmark says 10 — is the build step or the test step the bottleneck?
At ~3,380 min/month on a 2,000 min free tier, what's the actual bill — and is it worth switching to Ubicloud or self-hosted?
Which of the 31 failing tests from 2026-03-19 are flaky versus genuinely broken — and does the answer change the fix priority?

Verdict​

What's Shipped​

e2e-smoke.yml — Merge Loop (E2E)​

pr-quality-gate.yml — Merge Loop (Architecture)​

monorepo-typecheck.yml — Health Loop​

claude.yml — Agent Loop​

What Works​

Two Remaining Gaps​

Gap 1: Retries Hide Flakes (HIGH)​

Gap 2: No Health Endpoint (MEDIUM)​

Engineering Task Spec​

Phase 1: Flakiness Tracking​

Phase 2: Health Endpoint​

Phase 3: Sharding (when E2E exceeds 10 min)​

Phase 4: Preview Deploy Testing (when Vercel is active)​

Phase Gate Rule​

Cost Reality​

Benchmark Alignment​

Context​

Questions​