Skip to main content

CI Strategy Audit

Is our E2E testing sound — or are we bleeding silently?

Verdict

The CI pipeline is structurally sound. Four workflows cover merge gates, E2E smoke, architecture enforcement, and repo health. The foundation is correct — the remaining gaps are instrumentation and speed.

DimensionGradeEvidence
Test structureGood66 specs, 8 domains, tiered tags, fixture patterns
CI enforcementShippede2e-smoke.yml gates PRs — API integration + Playwright browser
Architecture gateShippedpr-quality-gate.yml — tsconfig validation, architecture audit, module boundary lint
Health loopShippedmonorepo-typecheck.yml — daily cron at 07:00 NZDT, full typecheck
Auth patternCorrectStorage state via Better Auth HTTP — matches best practice
DB isolationGoodpgvector service container on port 5433, schema push via drizzle-kit
Server warmupPartialcurl loop + auth endpoint validation — not a proper /api/health
Flakiness trackingMissingretries: 2 hides flakes — real failure rate unknown
ShardingMissing66 tests serial on 2 workers — slow feedback
Preview deploy testingMissingisProduction flag exists, not wired to Vercel preview URLs

What's Shipped

Four workflows, each answering one question.

e2e-smoke.yml — Merge Loop (E2E)

Two parallel jobs on every PR and push to main:

JobTimeWhat It Proves
API Integration~5 minAgent contract + DB proof (no browser, no Next.js build)
E2E Browser~15 minFull Next.js standalone build + Playwright @smoke against real browser

Features: pgvector service container, NX cache, Playwright browser cache, auth endpoint pre-validation, paths-ignore for docs, concurrency with cancel-in-progress, workflow_dispatch with custom grep filter.

Cost: ~2,640 min/month (API ~660 + E2E ~1,980). Over GitHub Actions free tier at current push/PR volume.

pr-quality-gate.yml — Merge Loop (Architecture)

Two sequential jobs on every non-draft PR:

JobTimeWhat It Catches
Structural~30stsconfig rootDir/spec exclude violations (no install needed)
Quality Gate~6 minArchitecture layer violations + module boundary lint (affected only)

Features: draft PR skip, architecture audit with PR comments, artifact upload.

Cost: ~440 min/month.

monorepo-typecheck.yml — Health Loop

Daily cron (07:00 NZDT / 18:00 UTC). Full monorepo typecheck across all projects. Not a merge gate — catches cross-project TS drift that individual PRs miss.

Cost: ~300 min/month.

claude.yml — Agent Loop

Claude Code action triggered by @claude mentions in issues and PR comments. Gives the AI agent read/write access to the repo for code reviews and issue resolution.

What Works

Do not change these. They match industry best practice.

PatternImplementationWhy it's correct
Two-loop separationMerge gate (PR) vs health check (cron)Fast PRs, thorough health — never combined
Storage state authSingle setup project, saved state, reused across testsEliminates per-test login overhead
Service container DBpgvector on port 5433, schema push via drizzle-kitReal DB, real extensions, real constraints
Auth pre-validationcurl checks server + /api/auth/get-session before PlaywrightCatches auth startup failures early
Architecture PR commentsgithub-script posts audit failures to PRDeveloper sees violations without opening CI logs
Draft PR skipgithub.event.pull_request.draft == falseSaves CI minutes on WIP
Concurrency cancelcancel-in-progress: true per workflow + refRapid pushes don't queue — latest wins
Paths ignore.md, .claude/, docs/, LICENSEDocs-only changes skip CI entirely
NX + Playwright cachingSeparate cache keys per workflow + browser cacheReduces install time on repeat runs
Preflight validationenv-validation.ts before all testsCatches env misconfig before timeout
Test tier tagssmoke/crm/rfp/insights/value/comprehensiveStructure exists and @smoke is enforced
Constants for routesroutes.ts, selectors.ts as single sourcePrevents hard-coded selectors

Two Remaining Gaps

Gap 1: Retries Hide Flakes (HIGH)

retries: 2 in the E2E browser job means a test can fail twice, pass on the third attempt, and CI reports green. No tracking of which tests needed retries. Flaky tests accumulate invisibly.

The 31 failing E2E tests from 2026-03-19 may include tests that sometimes pass on retry — the real failure rate is unknown.

Fix: Parse Playwright results after test run. Flag any test with retry > 0 that eventually passed. Track in a quarantine file. Target: flakiness rate at or below 5% per CI Pipeline Health benchmarks.

Gap 2: No Health Endpoint (MEDIUM)

The E2E browser job uses a curl loop checking for HTTP 200/302/307, then validates the auth endpoint. This works but is fragile — it checks "server responds" and "auth responds" but not "app is fully ready" (DB migrations complete, critical tables seeded, all middleware initialized).

A proper /api/health endpoint would check all readiness conditions in one call and return structured status.

Fix: Create /api/health route. Check DB connection, auth middleware, critical seeds. Return 200 with { status: 'ok' } when fully ready, 503 with { status: 'starting', missing: [...] } when not. Replace the curl loop with a single health check wait.

Engineering Task Spec

Two phases for immediate gaps. Two phases for speed and preview testing when ready.

Phase 1: Flakiness Tracking

Files to create:

ActionFilePurpose
Createapps/dreamineering/drmg-sales-e2e/scripts/parse-flaky.tsParse Playwright results, flag retried-then-passed tests
Createapps/dreamineering/drmg-sales-e2e/quarantine.jsonKnown flaky test registry
Modify.github/workflows/e2e-smoke.ymlAdd post-test step to run parse-flaky and upload artifact

Parser spec:

  • Read Playwright HTML report or JSON output after test run
  • Flag any test with retry > 0 that eventually passed
  • Output flaky test list as artifact
  • Optional: CI can exclude quarantined tests from gate

Acceptance: After 10 smoke runs, quarantine.json is empty or contains fewer than 5% of tests.

Phase 2: Health Endpoint

Files to create/modify:

ActionFilePurpose
Createapps/dreamineering/drmg-sales/app/api/health/route.tsStructured readiness check
Modify.github/workflows/e2e-smoke.ymlReplace curl loop with health endpoint wait

Health endpoint spec:

// GET /api/health
// Checks:
// 1. DB connection alive (SELECT 1)
// 2. Auth middleware initialized (session endpoint responds)
// 3. Critical tables exist (schema push completed)
// Returns:
// 200 { status: 'ok', timestamp } — fully ready
// 503 { status: 'starting', missing: ['db'|'auth'|'schema'] } — not ready

Acceptance: E2E browser job waits for /api/health returning 200 instead of curl loop. Eliminates timing-based flakes.

Phase 3: Sharding (when E2E exceeds 10 min)

Files to modify:

ActionFileChange
Modify.github/workflows/e2e-smoke.ymlAdd matrix strategy: shard: [1/4, 2/4, 3/4, 4/4]
Modifyplaywright.config.tsIncrease workers to 4 for CI

Spec: Pass --shard=${{ matrix.shard }} to Playwright. Merge results from all shards before status check. Use Playwright merge-reports CLI.

Trigger: Only implement when smoke suite wall-clock time consistently exceeds 10 minutes. Current ~15 min E2E job includes build time — measure Playwright-only duration first.

Acceptance: Smoke suite Playwright-only time under 5 minutes with sharding.

Phase 4: Preview Deploy Testing (when Vercel is active)

Files to create/modify:

ActionFilePurpose
Create.github/workflows/e2e-preview.ymlRun E2E against Vercel preview URL
Modifyplaywright.config.tsSkip webServer when BASE_URL is set (already handled by isProduction flag)

Spec: Trigger on Vercel deployment success. Wait for preview URL. Run smoke + domain tests against BASE_URL=<preview-url>.

Trigger: Only implement when Vercel preview deploys are active and stable.

Acceptance: PR deploys to Vercel. E2E runs against preview URL. Tests pass against real infrastructure.

Phase Gate Rule

Never add the next phase while the current one is flaky. Each phase must prove stability (green rate above 95% over 7 days) before the next ships. This matches the phase-based rollout principle.

Cost Reality

Current monthly CI budget (estimated at 5 pushes/day, 22 working days):

WorkflowPer RunMonthly Estimate
e2e-smoke.yml (API + E2E)~20 min~2,640 min
pr-quality-gate.yml~6 min~440 min
monorepo-typecheck.yml~10 min~300 min
Total~3,380 min/month

GitHub Actions free tier is 2,000 min/month. Current setup exceeds it. Monitor against CI Pipeline Health benchmarks. If budget exceeds 85%, reduce push frequency, increase paths-ignore scope, or evaluate cheaper runners (Ubicloud at ~0.3x cost, self-hosted at ~0x).

Benchmark Alignment

These benchmarks from Engineering Quality measure pipeline health:

BenchmarkThresholdCurrent Status
8.1 Merge loop duration10 minutes or less~20 min (API 5 + E2E 15) — needs sharding or scope reduction
8.2 CI green rate95% or higher over 7 daysUnknown — need 7 days of data
8.3 E2E flakiness rate5% or lessUnknown — no tracking (Gap 1)
8.4 Push to merge-ready15 minutes or less~20 min — exceeds target
8.5 Health loop runs daily0 missed daysConfigured — cron at 07:00 NZDT
8.6 Cache hit rate80% or higherConfigured — NX + Playwright caching

Context

Questions

What's the real failure rate when retries are stripped away — and what does that reveal about test quality?

  • The merge loop takes ~20 minutes but the benchmark says 10 — is the build step or the test step the bottleneck?
  • At ~3,380 min/month on a 2,000 min free tier, what's the actual bill — and is it worth switching to Ubicloud or self-hosted?
  • Which of the 31 failing tests from 2026-03-19 are flaky versus genuinely broken — and does the answer change the fix priority?