AI Design Quality
How do you make an AI build pages that look right — not just compile right?
"Present in the DOM" is not "visible to a human." Every technique here closes that gap by making the AI evaluate its own output against measurable thresholds before shipping.
Five technique families. Each standalone. Use one or stack them.
Self-Refine Loop
The same AI plays three roles in sequence: builder, critic, fixer. Feed the output back as input until the rubric passes.
GENERATE → Build the component against a spec
CRITIQUE → Score it against a rubric with measurable thresholds
REFINE → Fix each failure, quoting the old line and the new
REPEAT → Re-run critique. Zero failures = stop. Max 3 iterations.
Source: Madaan et al. (2023). 20% average quality improvement. 30% code error reduction (Google Research 2025).
What makes it work:
- The critique step needs a concrete rubric — vague feedback ("improve the design") produces identical output each iteration
- "No issues found" on first critique means the rubric is too soft
- Rotate rubric focus per iteration (visual, then semantic, then performance) to break score plateaus
- Costs 3x tokens. Use only for high-stakes outputs — landing pages, hero sections, CTA blocks
Example critique rubric:
| # | Criterion | Threshold |
|---|---|---|
| 1 | Contrast ratio | >= 4.5:1 text vs background |
| 2 | CTA touch target | >= 44x44px |
| 3 | CTA color exclusivity | Accent color on CTA only |
| 4 | Value prop visibility | Visible without scroll at 768px |
| 5 | Semantic HTML | No div-as-button |
| 6 | Design system compliance | Zero arbitrary bracket values |
| 7 | Five-second test | Reader identifies what/who/action |
| 8 | Mobile-first | Plain utilities = mobile, breakpoint prefixes = larger |
Four-Perspective Audit
Four specialized evaluators review the same page from different angles. Same artifact, different seat.
| Evaluator | Focus |
|---|---|
| Design Critic | Visual hierarchy, CTA placement, mobile, trust signals |
| Copywriter | Headlines, value props, emotional triggers, awareness level |
| SEO Specialist | Meta tags, heading structure, performance |
| Growth Strategist | Synthesizes findings, assigns A-F grade, prioritizes actions |
Each evaluator scores independently with evidence. The synthesis pass resolves conflicts and produces a single prioritized action list.
Design Critic scoring weights:
| Dimension | Weight |
|---|---|
| Visual hierarchy | 25% |
| CTA design | 25% |
| Trust signals | 20% |
| Mobile experience | 15% |
| Spacing + typography | 15% |
Source: n8n workflow #9545 — four-agent landing page analysis. Full audit in under 90 seconds.
Browser Verification
Three layers of automated verification, from cheapest to richest.
Layer 1: Accessibility Tree
The browser's accessibility tree is a structured, text-based representation of the page. 2-5KB vs 100KB+ for screenshots. Deterministic, fast.
Check:
- All interactive elements have accessible names
- Heading hierarchy (h1 then h2 then h3, no skips)
- Landmark regions present (main, nav, footer)
- All images have alt text
Layer 2: Computed Styles
Read actual rendered values from the browser — not what the code says, what the browser computed.
Check:
- Every text element: computed color differs from computed background-color
- Every interactive element: computed width and height >= 44px
- CTA background-color appears on zero non-clickable elements
document.scrollWidth <= window.innerWidth(no horizontal overflow)
Layer 3: Screenshot Comparison
Capture at four breakpoints (375px, 768px, 1024px, 1440px). Evaluate each:
- What is the single most prominent element? Is it the value prop or CTA?
- Can you identify what this page offers in 5 seconds?
- Is there visual clutter competing for attention?
- Does the content hierarchy make sense at this width?
- Are there obvious rendering failures (invisible text, broken layout)?
Compare 375px vs 1440px — is the information architecture preserved?
The loop:
Deploy to preview →
Layer 1: Accessibility audit (5 seconds) →
Layer 2: Computed styles audit (10 seconds) →
Layer 3: Screenshot critique at 4 breakpoints (30 seconds) →
Synthesize failures →
Fix in code →
Re-deploy →
Re-verify
Scoring Rubric
A 100-point rubric that produces a letter grade. Makes audits comparable across pages and over time.
| Dimension | Weight | Score 1 | Score 3 | Score 5 |
|---|---|---|---|---|
| Value Clarity | 20 | Cannot identify offer in 10s | Offer clear, benefit vague | Offer + benefit + audience clear in 5s |
| Visual Hierarchy | 15 | Multiple competing focal points | Clear hierarchy, CTA buried | Single focal point per viewport, CTA dominant |
| CTA Design | 15 | CTA below fold, same style as body | CTA visible, color not reserved | CTA above fold, unique color, 44px+ target, centered |
| Social Proof | 10 | None | Generic ("trusted by thousands") | Specific numbers + recognizable names |
| Mobile Experience | 10 | Broken at 375px | Functional but cramped | Full experience preserved, touch-optimized |
| Load Performance | 10 | LCP > 4s | LCP 2.5-4s | LCP < 2.5s |
| Accessibility | 10 | Fails keyboard nav or contrast | Passes basic contrast, partial keyboard | Full WCAG 2.2 AA |
| Copy Quality | 5 | Generic or AI-sounding | Clear but not compelling | Specific, benefit-led, addresses objection |
| Trust Signals | 3 | No indicators | Basic (copyright, privacy link) | Security badges, social proof, money-back |
| Page Weight | 2 | > 3MB | 1-3MB | < 1MB |
Grading: A (90+), B (80-89), C (70-79), D (60-69), F (<60).
Rules: Score 5 requires quoted evidence. Score 1 requires quoted evidence of failure. "Looks good" is not evidence.
Severity-Rated Audit
Evaluate against UX principles that linters cannot catch. Rate each finding by severity. Fix from the top.
| Severity | Definition | Action |
|---|---|---|
| CRITICAL | Blocks core user task (broken CTA, invisible text, no keyboard nav) | Fix before shipping |
| HIGH | Degrades experience significantly (poor contrast, tiny targets) | Fix before shipping |
| MEDIUM | Noticeable quality issue (inconsistent spacing, missing states) | Fix in next iteration |
| LOW | Polish item (animation timing, micro-interactions) | Backlog |
UX principles to evaluate:
- Loading states present
- Visual hierarchy (not everything same weight)
- Keyboard navigation complete
- Spacing consistent
- Error states handled
- Empty states designed
- Touch targets adequate
- Color not sole status indicator
The workflow: Read the code, evaluate against principles, present severity-rated report with line references, fix CRITICAL and HIGH items, re-evaluate to confirm resolution.
Prompt Patterns
Five reusable critique patterns. Copy and adapt.
Rubric-Anchored
Review [artifact] against this rubric. For each criterion:
1. Score 1-5
2. Quote the specific evidence (line number, element, computed value)
3. State what change would raise the score by 1
[Insert rubric table]
"No issues found" means the rubric is too soft. Look harder.
Perspective Rotation
Evaluate this page from three perspectives. Same page, different seat.
FIRST-TIME VISITOR: Can you understand what this offers in 5 seconds?
RETURN VISITOR: Can you find what you came back for in 2 clicks?
COMPETITOR: What would you copy? What would you attack?
For each perspective, list 3 specific observations with element references.
Inversion
You are trying to make this landing page FAIL. List the top 5 ways it
could lose a visitor. For each:
- The specific element or absence that causes the failure
- How a real user would experience it
- The fix (one sentence)
Threshold Gatekeeper
You are a quality gate. This page must pass ALL of these to ship:
[List of measurable thresholds]
For each threshold:
- PASS: quote the evidence
- FAIL: quote the violation and state the fix
If ANY threshold fails, output "HOLD — [count] failures" at the top.
If all pass, output "SHIP" at the top.
Progressive Disclosure
Score this page on progressive disclosure:
1. HERO (0-2s): Does the visitor know WHAT + WHO + ACTION?
2. SCROLL 1 (2-10s): Does the visitor understand the PROBLEM you solve?
3. SCROLL 2 (10-30s): Does the visitor see PROOF it works?
4. SCROLL 3 (30-60s): Is the OBJECTION handled before the final CTA?
5. FINAL CTA: Is it the same action as the hero CTA (consistency)?
Score each stage 1-5. A page that scores 5 on stages 1-2 but 1 on
stages 4-5 is a funnel leak. Identify where the leak is.
Context
- Product Design — The audit sequence and situation router
- Rendering Verification — Technical rendering audit
- Landing Page Review — Conversion-focused audit
- Design Review — Solo and team review workflows
- Process Quality Assurance — Deming's 14 points
Links
- Self-Refine: Iterative Refinement with Self-Feedback (paper) — The foundational self-refine pattern
- Self-Refine — Learn Prompting — Practical guide to the pattern
- The Iteration Loop: Generate, Critique, Refine — Applied iteration loop
- Prompt Engineering: The Refinement Loop — Refinement loop mechanics
- Meta-Prompting: LLMs Crafting Their Own Prompts — Meta-prompting techniques
- Meta Prompting — Prompt Engineering Guide — Meta-prompting reference
- From Art to Engineering: A Practical Rubric for GPT-4.1 Prompt Design — Rubric-based prompt evaluation
- Landing Page Analysis with Claude's 4-Agent System — n8n — Multi-agent audit workflow
- I Built a Plugin to Fix AI-Generated Interfaces — XIN HU — UX audit plugin pattern
- I Built a Plugin to Fix AI-Generated Interfaces — DEV Community — Same, community discussion
- Playwright MCP: AI-Powered Browser Automation Guide — Browser automation for AI testing
- Playwright DevTools MCP — Chrome DevTools Protocol for AI
- AI-Driven Design: Playwright MCP Screenshots, Visual Diffs — egghead.io — Visual diff workflow
- AI Is Blind: How Playwright MCP Revolutionizes Visual Testing — Why code-level checks miss visual bugs
- Using Claude Code to Debug Visual Regressions — Vizzly — Visual regression debugging
- Landing Page Design Best Practices — Fermat Commerce — Conversion design patterns
- Landing Page Optimization Guide — Conversion Sciences — CRO methodology
- 12 Days of AI: Landing Page Optimization — Trust Insights — AI-specific landing page optimization
- Landing Page Conversion Tips — landing.report — Data-driven conversion tips
- Landing Page Optimization Best Practices — Prismic — Comprehensive optimization guide
- Visual Hierarchy Mobile Conversion — Mobile visual hierarchy
- UX Optimization for Landing Pages — Landingi — UX-focused conversion
- CRO with AI — Landingi — AI-powered conversion optimization
- CLIP-Driven CTA Optimization Pipeline — Computer vision for CTA optimization
- The Self-Improving Prompt System — Prompts that improve themselves
- Claude Code UX Researcher: Automated Competitor Benchmarking — Automated UX benchmarking
- Enhancing LLM Planning through Intrinsic Self-Critique (paper) — Self-critique for planning quality
- 6 Playwright MCP Servers for AI Testing — Bug0 — MCP server comparison
- Playwright MCP Changes Build vs Buy — Bug0 — Impact on testing strategy
Questions
Can an AI consistently judge visual quality when it cannot see the rendered page — and if not, what is the minimum verification layer that closes the gap?
- Which layer of browser verification (accessibility tree, computed styles, or screenshots) catches the most failures per second of evaluation time?
- At what point does adding critique iterations produce diminishing returns — and does rotating the rubric focus actually break the plateau?
- If the scoring rubric produces an A grade but real users bounce, what is the rubric missing?