Skip to main content

AI Design Quality

How do you make an AI build pages that look right — not just compile right?

"Present in the DOM" is not "visible to a human." Every technique here closes that gap by making the AI evaluate its own output against measurable thresholds before shipping.

Five technique families. Each standalone. Use one or stack them.

Self-Refine Loop

The same AI plays three roles in sequence: builder, critic, fixer. Feed the output back as input until the rubric passes.

GENERATE  → Build the component against a spec
CRITIQUE → Score it against a rubric with measurable thresholds
REFINE → Fix each failure, quoting the old line and the new
REPEAT → Re-run critique. Zero failures = stop. Max 3 iterations.

Source: Madaan et al. (2023). 20% average quality improvement. 30% code error reduction (Google Research 2025).

What makes it work:

  • The critique step needs a concrete rubric — vague feedback ("improve the design") produces identical output each iteration
  • "No issues found" on first critique means the rubric is too soft
  • Rotate rubric focus per iteration (visual, then semantic, then performance) to break score plateaus
  • Costs 3x tokens. Use only for high-stakes outputs — landing pages, hero sections, CTA blocks

Example critique rubric:

#CriterionThreshold
1Contrast ratio>= 4.5:1 text vs background
2CTA touch target>= 44x44px
3CTA color exclusivityAccent color on CTA only
4Value prop visibilityVisible without scroll at 768px
5Semantic HTMLNo div-as-button
6Design system complianceZero arbitrary bracket values
7Five-second testReader identifies what/who/action
8Mobile-firstPlain utilities = mobile, breakpoint prefixes = larger

Four-Perspective Audit

Four specialized evaluators review the same page from different angles. Same artifact, different seat.

EvaluatorFocus
Design CriticVisual hierarchy, CTA placement, mobile, trust signals
CopywriterHeadlines, value props, emotional triggers, awareness level
SEO SpecialistMeta tags, heading structure, performance
Growth StrategistSynthesizes findings, assigns A-F grade, prioritizes actions

Each evaluator scores independently with evidence. The synthesis pass resolves conflicts and produces a single prioritized action list.

Design Critic scoring weights:

DimensionWeight
Visual hierarchy25%
CTA design25%
Trust signals20%
Mobile experience15%
Spacing + typography15%

Source: n8n workflow #9545 — four-agent landing page analysis. Full audit in under 90 seconds.

Browser Verification

Three layers of automated verification, from cheapest to richest.

Layer 1: Accessibility Tree

The browser's accessibility tree is a structured, text-based representation of the page. 2-5KB vs 100KB+ for screenshots. Deterministic, fast.

Check:

  • All interactive elements have accessible names
  • Heading hierarchy (h1 then h2 then h3, no skips)
  • Landmark regions present (main, nav, footer)
  • All images have alt text

Layer 2: Computed Styles

Read actual rendered values from the browser — not what the code says, what the browser computed.

Check:

  • Every text element: computed color differs from computed background-color
  • Every interactive element: computed width and height >= 44px
  • CTA background-color appears on zero non-clickable elements
  • document.scrollWidth <= window.innerWidth (no horizontal overflow)

Layer 3: Screenshot Comparison

Capture at four breakpoints (375px, 768px, 1024px, 1440px). Evaluate each:

  1. What is the single most prominent element? Is it the value prop or CTA?
  2. Can you identify what this page offers in 5 seconds?
  3. Is there visual clutter competing for attention?
  4. Does the content hierarchy make sense at this width?
  5. Are there obvious rendering failures (invisible text, broken layout)?

Compare 375px vs 1440px — is the information architecture preserved?

The loop:

Deploy to preview →
Layer 1: Accessibility audit (5 seconds) →
Layer 2: Computed styles audit (10 seconds) →
Layer 3: Screenshot critique at 4 breakpoints (30 seconds) →
Synthesize failures →
Fix in code →
Re-deploy →
Re-verify

Scoring Rubric

A 100-point rubric that produces a letter grade. Makes audits comparable across pages and over time.

DimensionWeightScore 1Score 3Score 5
Value Clarity20Cannot identify offer in 10sOffer clear, benefit vagueOffer + benefit + audience clear in 5s
Visual Hierarchy15Multiple competing focal pointsClear hierarchy, CTA buriedSingle focal point per viewport, CTA dominant
CTA Design15CTA below fold, same style as bodyCTA visible, color not reservedCTA above fold, unique color, 44px+ target, centered
Social Proof10NoneGeneric ("trusted by thousands")Specific numbers + recognizable names
Mobile Experience10Broken at 375pxFunctional but crampedFull experience preserved, touch-optimized
Load Performance10LCP > 4sLCP 2.5-4sLCP < 2.5s
Accessibility10Fails keyboard nav or contrastPasses basic contrast, partial keyboardFull WCAG 2.2 AA
Copy Quality5Generic or AI-soundingClear but not compellingSpecific, benefit-led, addresses objection
Trust Signals3No indicatorsBasic (copyright, privacy link)Security badges, social proof, money-back
Page Weight2> 3MB1-3MB< 1MB

Grading: A (90+), B (80-89), C (70-79), D (60-69), F (<60).

Rules: Score 5 requires quoted evidence. Score 1 requires quoted evidence of failure. "Looks good" is not evidence.

Severity-Rated Audit

Evaluate against UX principles that linters cannot catch. Rate each finding by severity. Fix from the top.

SeverityDefinitionAction
CRITICALBlocks core user task (broken CTA, invisible text, no keyboard nav)Fix before shipping
HIGHDegrades experience significantly (poor contrast, tiny targets)Fix before shipping
MEDIUMNoticeable quality issue (inconsistent spacing, missing states)Fix in next iteration
LOWPolish item (animation timing, micro-interactions)Backlog

UX principles to evaluate:

  • Loading states present
  • Visual hierarchy (not everything same weight)
  • Keyboard navigation complete
  • Spacing consistent
  • Error states handled
  • Empty states designed
  • Touch targets adequate
  • Color not sole status indicator

The workflow: Read the code, evaluate against principles, present severity-rated report with line references, fix CRITICAL and HIGH items, re-evaluate to confirm resolution.

Prompt Patterns

Five reusable critique patterns. Copy and adapt.

Rubric-Anchored

Review [artifact] against this rubric. For each criterion:
1. Score 1-5
2. Quote the specific evidence (line number, element, computed value)
3. State what change would raise the score by 1

[Insert rubric table]

"No issues found" means the rubric is too soft. Look harder.

Perspective Rotation

Evaluate this page from three perspectives. Same page, different seat.

FIRST-TIME VISITOR: Can you understand what this offers in 5 seconds?
RETURN VISITOR: Can you find what you came back for in 2 clicks?
COMPETITOR: What would you copy? What would you attack?

For each perspective, list 3 specific observations with element references.

Inversion

You are trying to make this landing page FAIL. List the top 5 ways it
could lose a visitor. For each:
- The specific element or absence that causes the failure
- How a real user would experience it
- The fix (one sentence)

Threshold Gatekeeper

You are a quality gate. This page must pass ALL of these to ship:
[List of measurable thresholds]

For each threshold:
- PASS: quote the evidence
- FAIL: quote the violation and state the fix

If ANY threshold fails, output "HOLD — [count] failures" at the top.
If all pass, output "SHIP" at the top.

Progressive Disclosure

Score this page on progressive disclosure:
1. HERO (0-2s): Does the visitor know WHAT + WHO + ACTION?
2. SCROLL 1 (2-10s): Does the visitor understand the PROBLEM you solve?
3. SCROLL 2 (10-30s): Does the visitor see PROOF it works?
4. SCROLL 3 (30-60s): Is the OBJECTION handled before the final CTA?
5. FINAL CTA: Is it the same action as the hero CTA (consistency)?

Score each stage 1-5. A page that scores 5 on stages 1-2 but 1 on
stages 4-5 is a funnel leak. Identify where the leak is.

Context

Questions

Can an AI consistently judge visual quality when it cannot see the rendered page — and if not, what is the minimum verification layer that closes the gap?

  • Which layer of browser verification (accessibility tree, computed styles, or screenshots) catches the most failures per second of evaluation time?
  • At what point does adding critique iterations produce diminishing returns — and does rotating the rubric focus actually break the plateau?
  • If the scoring rubric produces an A grade but real users bounce, what is the rubric missing?