AI Design Quality

How do you make an AI build pages that look right — not just compile right?

"Present in the DOM" is not "visible to a human." Every technique here closes that gap by making the AI evaluate its own output against measurable thresholds before shipping.

Five technique families. Each standalone. Use one or stack them.

Self-Refine Loop

The same AI plays three roles in sequence: builder, critic, fixer. Feed the output back as input until the rubric passes.

GENERATE  → Build the component against a spec
CRITIQUE  → Score it against a rubric with measurable thresholds
REFINE    → Fix each failure, quoting the old line and the new
REPEAT    → Re-run critique. Zero failures = stop. Max 3 iterations.

Source: Madaan et al. (2023). 20% average quality improvement. 30% code error reduction (Google Research 2025).

What makes it work:

The critique step needs a concrete rubric — vague feedback ("improve the design") produces identical output each iteration
"No issues found" on first critique means the rubric is too soft
Rotate rubric focus per iteration (visual, then semantic, then performance) to break score plateaus
Costs 3x tokens. Use only for high-stakes outputs — landing pages, hero sections, CTA blocks

Example critique rubric:

#	Criterion	Threshold
1	Contrast ratio	>= 4.5:1 text vs background
2	CTA touch target	>= 44x44px
3	CTA color exclusivity	Accent color on CTA only
4	Value prop visibility	Visible without scroll at 768px
5	Semantic HTML	No div-as-button
6	Design system compliance	Zero arbitrary bracket values
7	Five-second test	Reader identifies what/who/action
8	Mobile-first	Plain utilities = mobile, breakpoint prefixes = larger

Four-Perspective Audit

Four specialized evaluators review the same page from different angles. Same artifact, different seat.

Evaluator	Focus
Design Critic	Visual hierarchy, CTA placement, mobile, trust signals
Copywriter	Headlines, value props, emotional triggers, awareness level
SEO Specialist	Meta tags, heading structure, performance
Growth Strategist	Synthesizes findings, assigns A-F grade, prioritizes actions

Each evaluator scores independently with evidence. The synthesis pass resolves conflicts and produces a single prioritized action list.

Design Critic scoring weights:

Dimension	Weight
Visual hierarchy	25%
CTA design	25%
Trust signals	20%
Mobile experience	15%
Spacing + typography	15%

Source: n8n workflow #9545 — four-agent landing page analysis. Full audit in under 90 seconds.

Browser Verification

Three layers of automated verification, from cheapest to richest.

Layer 1: Accessibility Tree

The browser's accessibility tree is a structured, text-based representation of the page. 2-5KB vs 100KB+ for screenshots. Deterministic, fast.

Check:

All interactive elements have accessible names
Heading hierarchy (h1 then h2 then h3, no skips)
Landmark regions present (main, nav, footer)
All images have alt text

Layer 2: Computed Styles

Read actual rendered values from the browser — not what the code says, what the browser computed.

Check:

Every text element: computed color differs from computed background-color
Every interactive element: computed width and height >= 44px
CTA background-color appears on zero non-clickable elements
document.scrollWidth <= window.innerWidth (no horizontal overflow)

Layer 3: Screenshot Comparison

Capture at four breakpoints (375px, 768px, 1024px, 1440px). Evaluate each:

What is the single most prominent element? Is it the value prop or CTA?
Can you identify what this page offers in 5 seconds?
Is there visual clutter competing for attention?
Does the content hierarchy make sense at this width?
Are there obvious rendering failures (invisible text, broken layout)?

Compare 375px vs 1440px — is the information architecture preserved?

The loop:

Deploy to preview →
  Layer 1: Accessibility audit (5 seconds) →
  Layer 2: Computed styles audit (10 seconds) →
  Layer 3: Screenshot critique at 4 breakpoints (30 seconds) →
  Synthesize failures →
  Fix in code →
  Re-deploy →
  Re-verify

Scoring Rubric

A 100-point rubric that produces a letter grade. Makes audits comparable across pages and over time.

Dimension	Weight	Score 1	Score 3	Score 5
Value Clarity	20	Cannot identify offer in 10s	Offer clear, benefit vague	Offer + benefit + audience clear in 5s
Visual Hierarchy	15	Multiple competing focal points	Clear hierarchy, CTA buried	Single focal point per viewport, CTA dominant
CTA Design	15	CTA below fold, same style as body	CTA visible, color not reserved	CTA above fold, unique color, 44px+ target, centered
Social Proof	10	None	Generic ("trusted by thousands")	Specific numbers + recognizable names
Mobile Experience	10	Broken at 375px	Functional but cramped	Full experience preserved, touch-optimized
Load Performance	10	LCP > 4s	LCP 2.5-4s	LCP < 2.5s
Accessibility	10	Fails keyboard nav or contrast	Passes basic contrast, partial keyboard	Full WCAG 2.2 AA
Copy Quality	5	Generic or AI-sounding	Clear but not compelling	Specific, benefit-led, addresses objection
Trust Signals	3	No indicators	Basic (copyright, privacy link)	Security badges, social proof, money-back
Page Weight	2	> 3MB	1-3MB	< 1MB

Grading: A (90+), B (80-89), C (70-79), D (60-69), F (<60).

Rules: Score 5 requires quoted evidence. Score 1 requires quoted evidence of failure. "Looks good" is not evidence.

Severity-Rated Audit

Evaluate against UX principles that linters cannot catch. Rate each finding by severity. Fix from the top.

Severity	Definition	Action
CRITICAL	Blocks core user task (broken CTA, invisible text, no keyboard nav)	Fix before shipping
HIGH	Degrades experience significantly (poor contrast, tiny targets)	Fix before shipping
MEDIUM	Noticeable quality issue (inconsistent spacing, missing states)	Fix in next iteration
LOW	Polish item (animation timing, micro-interactions)	Backlog

UX principles to evaluate:

Loading states present
Visual hierarchy (not everything same weight)
Keyboard navigation complete
Spacing consistent
Error states handled
Empty states designed
Touch targets adequate
Color not sole status indicator

The workflow: Read the code, evaluate against principles, present severity-rated report with line references, fix CRITICAL and HIGH items, re-evaluate to confirm resolution.

Prompt Patterns

Five reusable critique patterns. Copy and adapt.

Rubric-Anchored

Review [artifact] against this rubric. For each criterion:
1. Score 1-5
2. Quote the specific evidence (line number, element, computed value)
3. State what change would raise the score by 1

[Insert rubric table]

"No issues found" means the rubric is too soft. Look harder.

Perspective Rotation

Evaluate this page from three perspectives. Same page, different seat.

FIRST-TIME VISITOR: Can you understand what this offers in 5 seconds?
RETURN VISITOR: Can you find what you came back for in 2 clicks?
COMPETITOR: What would you copy? What would you attack?

For each perspective, list 3 specific observations with element references.

Inversion

You are trying to make this landing page FAIL. List the top 5 ways it
could lose a visitor. For each:
- The specific element or absence that causes the failure
- How a real user would experience it
- The fix (one sentence)

Threshold Gatekeeper

You are a quality gate. This page must pass ALL of these to ship:
[List of measurable thresholds]

For each threshold:
- PASS: quote the evidence
- FAIL: quote the violation and state the fix

If ANY threshold fails, output "HOLD — [count] failures" at the top.
If all pass, output "SHIP" at the top.

Progressive Disclosure

Score this page on progressive disclosure:
1. HERO (0-2s): Does the visitor know WHAT + WHO + ACTION?
2. SCROLL 1 (2-10s): Does the visitor understand the PROBLEM you solve?
3. SCROLL 2 (10-30s): Does the visitor see PROOF it works?
4. SCROLL 3 (30-60s): Is the OBJECTION handled before the final CTA?
5. FINAL CTA: Is it the same action as the hero CTA (consistency)?

Score each stage 1-5. A page that scores 5 on stages 1-2 but 1 on
stages 4-5 is a funnel leak. Identify where the leak is.

Context

Product Design — The audit sequence and situation router
Rendering Verification — Technical rendering audit
Landing Page Review — Conversion-focused audit
Design Review — Solo and team review workflows
Process Quality Assurance — Deming's 14 points

Questions

Can an AI consistently judge visual quality when it cannot see the rendered page — and if not, what is the minimum verification layer that closes the gap?

Which layer of browser verification (accessibility tree, computed styles, or screenshots) catches the most failures per second of evaluation time?
At what point does adding critique iterations produce diminishing returns — and does rotating the rubric focus actually break the plateau?
If the scoring rubric produces an A grade but real users bounce, what is the rubric missing?

Self-Refine Loop​

Four-Perspective Audit​

Browser Verification​

Layer 1: Accessibility Tree​

Layer 2: Computed Styles​

Layer 3: Screenshot Comparison​

Scoring Rubric​

Severity-Rated Audit​

Prompt Patterns​

Rubric-Anchored​

Perspective Rotation​

Inversion​

Threshold Gatekeeper​

Progressive Disclosure​

Context​

Links​

Questions​