CI Strategy Audit
Is our E2E testing sound — or are we bleeding silently?
Verdict
The CI pipeline is structurally sound. Four workflows cover merge gates, E2E smoke, architecture enforcement, and repo health. The foundation is correct — the remaining gaps are instrumentation and speed.
| Dimension | Grade | Evidence |
|---|---|---|
| Test structure | Good | 66 specs, 8 domains, tiered tags, fixture patterns |
| CI enforcement | Shipped | e2e-smoke.yml gates PRs — API integration + Playwright browser |
| Architecture gate | Shipped | pr-quality-gate.yml — tsconfig validation, architecture audit, module boundary lint |
| Health loop | Shipped | monorepo-typecheck.yml — daily cron at 07:00 NZDT, full typecheck |
| Auth pattern | Correct | Storage state via Better Auth HTTP — matches best practice |
| DB isolation | Good | pgvector service container on port 5433, schema push via drizzle-kit |
| Server warmup | Partial | curl loop + auth endpoint validation — not a proper /api/health |
| Flakiness tracking | Missing | retries: 2 hides flakes — real failure rate unknown |
| Sharding | Missing | 66 tests serial on 2 workers — slow feedback |
| Preview deploy testing | Missing | isProduction flag exists, not wired to Vercel preview URLs |
What's Shipped
Four workflows, each answering one question.
e2e-smoke.yml — Merge Loop (E2E)
Two parallel jobs on every PR and push to main:
| Job | Time | What It Proves |
|---|---|---|
| API Integration | ~5 min | Agent contract + DB proof (no browser, no Next.js build) |
| E2E Browser | ~15 min | Full Next.js standalone build + Playwright @smoke against real browser |
Features: pgvector service container, NX cache, Playwright browser cache, auth endpoint pre-validation, paths-ignore for docs, concurrency with cancel-in-progress, workflow_dispatch with custom grep filter.
Cost: ~2,640 min/month (API ~660 + E2E ~1,980). Over GitHub Actions free tier at current push/PR volume.
pr-quality-gate.yml — Merge Loop (Architecture)
Two sequential jobs on every non-draft PR:
| Job | Time | What It Catches |
|---|---|---|
| Structural | ~30s | tsconfig rootDir/spec exclude violations (no install needed) |
| Quality Gate | ~6 min | Architecture layer violations + module boundary lint (affected only) |
Features: draft PR skip, architecture audit with PR comments, artifact upload.
Cost: ~440 min/month.
monorepo-typecheck.yml — Health Loop
Daily cron (07:00 NZDT / 18:00 UTC). Full monorepo typecheck across all projects. Not a merge gate — catches cross-project TS drift that individual PRs miss.
Cost: ~300 min/month.
claude.yml — Agent Loop
Claude Code action triggered by @claude mentions in issues and PR comments. Gives the AI agent read/write access to the repo for code reviews and issue resolution.
What Works
Do not change these. They match industry best practice.
| Pattern | Implementation | Why it's correct |
|---|---|---|
| Two-loop separation | Merge gate (PR) vs health check (cron) | Fast PRs, thorough health — never combined |
| Storage state auth | Single setup project, saved state, reused across tests | Eliminates per-test login overhead |
| Service container DB | pgvector on port 5433, schema push via drizzle-kit | Real DB, real extensions, real constraints |
| Auth pre-validation | curl checks server + /api/auth/get-session before Playwright | Catches auth startup failures early |
| Architecture PR comments | github-script posts audit failures to PR | Developer sees violations without opening CI logs |
| Draft PR skip | github.event.pull_request.draft == false | Saves CI minutes on WIP |
| Concurrency cancel | cancel-in-progress: true per workflow + ref | Rapid pushes don't queue — latest wins |
| Paths ignore | .md, .claude/, docs/, LICENSE | Docs-only changes skip CI entirely |
| NX + Playwright caching | Separate cache keys per workflow + browser cache | Reduces install time on repeat runs |
| Preflight validation | env-validation.ts before all tests | Catches env misconfig before timeout |
| Test tier tags | smoke/crm/rfp/insights/value/comprehensive | Structure exists and @smoke is enforced |
| Constants for routes | routes.ts, selectors.ts as single source | Prevents hard-coded selectors |
Two Remaining Gaps
Gap 1: Retries Hide Flakes (HIGH)
retries: 2 in the E2E browser job means a test can fail twice, pass on the third attempt, and CI reports green. No tracking of which tests needed retries. Flaky tests accumulate invisibly.
The 31 failing E2E tests from 2026-03-19 may include tests that sometimes pass on retry — the real failure rate is unknown.
Fix: Parse Playwright results after test run. Flag any test with retry > 0 that eventually passed. Track in a quarantine file. Target: flakiness rate at or below 5% per CI Pipeline Health benchmarks.
Gap 2: No Health Endpoint (MEDIUM)
The E2E browser job uses a curl loop checking for HTTP 200/302/307, then validates the auth endpoint. This works but is fragile — it checks "server responds" and "auth responds" but not "app is fully ready" (DB migrations complete, critical tables seeded, all middleware initialized).
A proper /api/health endpoint would check all readiness conditions in one call and return structured status.
Fix: Create /api/health route. Check DB connection, auth middleware, critical seeds. Return 200 with { status: 'ok' } when fully ready, 503 with { status: 'starting', missing: [...] } when not. Replace the curl loop with a single health check wait.
Engineering Task Spec
Two phases for immediate gaps. Two phases for speed and preview testing when ready.
Phase 1: Flakiness Tracking
Files to create:
| Action | File | Purpose |
|---|---|---|
| Create | apps/dreamineering/drmg-sales-e2e/scripts/parse-flaky.ts | Parse Playwright results, flag retried-then-passed tests |
| Create | apps/dreamineering/drmg-sales-e2e/quarantine.json | Known flaky test registry |
| Modify | .github/workflows/e2e-smoke.yml | Add post-test step to run parse-flaky and upload artifact |
Parser spec:
- Read Playwright HTML report or JSON output after test run
- Flag any test with
retry > 0that eventually passed - Output flaky test list as artifact
- Optional: CI can exclude quarantined tests from gate
Acceptance: After 10 smoke runs, quarantine.json is empty or contains fewer than 5% of tests.
Phase 2: Health Endpoint
Files to create/modify:
| Action | File | Purpose |
|---|---|---|
| Create | apps/dreamineering/drmg-sales/app/api/health/route.ts | Structured readiness check |
| Modify | .github/workflows/e2e-smoke.yml | Replace curl loop with health endpoint wait |
Health endpoint spec:
// GET /api/health
// Checks:
// 1. DB connection alive (SELECT 1)
// 2. Auth middleware initialized (session endpoint responds)
// 3. Critical tables exist (schema push completed)
// Returns:
// 200 { status: 'ok', timestamp } — fully ready
// 503 { status: 'starting', missing: ['db'|'auth'|'schema'] } — not ready
Acceptance: E2E browser job waits for /api/health returning 200 instead of curl loop. Eliminates timing-based flakes.
Phase 3: Sharding (when E2E exceeds 10 min)
Files to modify:
| Action | File | Change |
|---|---|---|
| Modify | .github/workflows/e2e-smoke.yml | Add matrix strategy: shard: [1/4, 2/4, 3/4, 4/4] |
| Modify | playwright.config.ts | Increase workers to 4 for CI |
Spec: Pass --shard=${{ matrix.shard }} to Playwright. Merge results from all shards before status check. Use Playwright merge-reports CLI.
Trigger: Only implement when smoke suite wall-clock time consistently exceeds 10 minutes. Current ~15 min E2E job includes build time — measure Playwright-only duration first.
Acceptance: Smoke suite Playwright-only time under 5 minutes with sharding.
Phase 4: Preview Deploy Testing (when Vercel is active)
Files to create/modify:
| Action | File | Purpose |
|---|---|---|
| Create | .github/workflows/e2e-preview.yml | Run E2E against Vercel preview URL |
| Modify | playwright.config.ts | Skip webServer when BASE_URL is set (already handled by isProduction flag) |
Spec: Trigger on Vercel deployment success. Wait for preview URL. Run smoke + domain tests against BASE_URL=<preview-url>.
Trigger: Only implement when Vercel preview deploys are active and stable.
Acceptance: PR deploys to Vercel. E2E runs against preview URL. Tests pass against real infrastructure.
Phase Gate Rule
Never add the next phase while the current one is flaky. Each phase must prove stability (green rate above 95% over 7 days) before the next ships. This matches the phase-based rollout principle.
Cost Reality
Current monthly CI budget (estimated at 5 pushes/day, 22 working days):
| Workflow | Per Run | Monthly Estimate |
|---|---|---|
e2e-smoke.yml (API + E2E) | ~20 min | ~2,640 min |
pr-quality-gate.yml | ~6 min | ~440 min |
monorepo-typecheck.yml | ~10 min | ~300 min |
| Total | ~3,380 min/month |
GitHub Actions free tier is 2,000 min/month. Current setup exceeds it. Monitor against CI Pipeline Health benchmarks. If budget exceeds 85%, reduce push frequency, increase paths-ignore scope, or evaluate cheaper runners (Ubicloud at ~0.3x cost, self-hosted at ~0x).
Benchmark Alignment
These benchmarks from Engineering Quality measure pipeline health:
| Benchmark | Threshold | Current Status |
|---|---|---|
| 8.1 Merge loop duration | 10 minutes or less | ~20 min (API 5 + E2E 15) — needs sharding or scope reduction |
| 8.2 CI green rate | 95% or higher over 7 days | Unknown — need 7 days of data |
| 8.3 E2E flakiness rate | 5% or less | Unknown — no tracking (Gap 1) |
| 8.4 Push to merge-ready | 15 minutes or less | ~20 min — exceeds target |
| 8.5 Health loop runs daily | 0 missed days | Configured — cron at 07:00 NZDT |
| 8.6 Cache hit rate | 80% or higher | Configured — NX + Playwright caching |
Context
- CI Testing Infrastructure — Two-loop pipeline design
- Engineering Quality Benchmarks — Pass/fail thresholds for CI health
- Testing Strategy — Layer model and selection rules
- Testing Economics — Cost-benefit per test layer
- Developer Experience — CI SLOs from developer perspective
Questions
What's the real failure rate when retries are stripped away — and what does that reveal about test quality?
- The merge loop takes ~20 minutes but the benchmark says 10 — is the build step or the test step the bottleneck?
- At ~3,380 min/month on a 2,000 min free tier, what's the actual bill — and is it worth switching to Ubicloud or self-hosted?
- Which of the 31 failing tests from 2026-03-19 are flaky versus genuinely broken — and does the answer change the fix priority?