Guide · comparison

screenshot to code vs real design tokens: why vision guesses miss the real button

July 4, 20267 min read

Drop a screenshot into an AI tool and ask for the code. You get something back in seconds, and it looks right at a glance. Then you notice the primary button is the wrong one, the blue is close but not the real blue, and the spacing between cards is a number the model made up. This is the core problem with screenshot to code: a flat image is a lossy record of a design, and the model has to guess back everything the pixels dropped. There is a better input available, and most tools ignore half of it.

what a screenshot actually throws away

A screenshot is the final render with all the reasoning stripped out. The research on this is blunt. One arXiv study of screenshot-to-code models catalogs three recurring failure modes: element omission, where components go missing, element distortion, where shape, size, or color come out wrong, and element misarrangement, where things land in the wrong place. Those are not edge cases. They are what happens when a model has to reconstruct hierarchy, nesting, and spatial relationships that were never written down in the pixels.

The failure gets worse as the design gets richer. The MiroMiro team put it well: AI screenshot to code infers a UI from pixels, so the result drifts the more complex the design gets. For a clean, simple layout you get a usable rough draft in seconds. For a real product page with overlapping layers, states, and a considered type scale, you get something that is close but off, and off in ways you have to hand-fix line by line.

Three things a screenshot can never tell you on its own:

The real primary button. Two buttons can look similar in a still image. Which one is the actual call to action depends on fill, weight, contrast against its background, and where it sits in the flow. A guess picks wrong often enough to matter.
The exact values. A model reading pixels estimates #635bff as "some purple." It cannot read the true hex, the real font weight, or the shadow blur radius, because antialiasing and compression have already blurred them.
The true spacing scale. Good design systems run on a scale, 4px 8px 16px 24px. Eyeballing gaps from an image gives you 15px here and 23px there, and the scale that made the page feel coherent is gone.

the other extreme: DOM-only extractors

The obvious fix is to stop guessing and read the real thing. Tools that walk the rendered DOM and pull computed styles get you exact values, because getComputedStyle returns what the browser actually resolved: the real hex, the real font stack, the real padding. This is the honest core of the anti-screenshot argument. Click a section, get the CSS the site actually ships, no drift, no inference. For grabbing a palette or a spacing scale, this is genuinely more accurate than any vision model, and the broader token-extraction space has settled on this method for good reason.

But reading only the DOM has its own blind spot, and it is the mirror image of the screenshot problem. The DOM tells you every value that exists. It does not tell you which values matter, or what the page looks like once it renders. A stylesheet lists forty colors, including the ones that never paint a visible pixel. It lists a "primary" class that a redesign quietly demoted. It has no opinion on whether the hero is centered or split, because that is a rendered outcome of flexbox and viewport width, not a single readable property. And it treats a decorative background illustration as just another node, when what you actually need is a description you can recreate as an original rather than an asset to copy.

So the two common approaches fail in opposite directions. Vision sees the render but guesses the values. DOM reads the values but misses the render.

read the render and the DOM together

uiscanner does both on purpose. It reads the rendered page (vision) and the DOM in the same pass, then reconciles them. The DOM owns the values, the exact colors, fonts, type scale, radii, spacing, shadows, and motion. Vision owns the semantics, which button is truly primary, whether the hero is centered or split, and which imagery is decorative. Each side covers the other's blind spot.

That reconciliation is where the real button gets identified correctly. The DOM might list three candidates; the render shows which one dominates. It is also how decorative imagery gets handled honestly. uiscanner describes it and writes a prompt to recreate it as an original, rather than pulling the asset bytes. That is a deliberate line: uiscanner is transformative by design. It describes and tokenizes a page and never copies or re-hosts the original site's assets. You rebuild in your own stack from the tokens and structure.

	screenshot to code	DOM-only extractor	uiscanner
Reads	pixels only	computed styles only	the rendered page plus the DOM
Exact hex and font	guessed	exact	exact (DOM owns values)
Real primary button	often wrong	ambiguous	resolved from the render
Centered vs split hero	inferred	not visible in CSS	read from the render
Decorative imagery	redrawn or hallucinated	copied as an asset	described, recreated as an original
Output	code you fix by hand	a values panel	tokens, structure, and a build prompt

the deliverable is a brief, not a panel

The point of reading both sides is what you do with the result. A screenshot tool hands you code to repair. A DOM extractor hands you a panel of values to retype. uiscanner hands your agent a build brief: the tokens, a section-by-section map of how the page is assembled, and a build prompt tailored to an archetype (landing, dashboard, marketing, ecommerce, portfolio, mobile, or general). That last part matters for AI coding, because the thing writing your code never saw the panel. It needs structured input.

That is why uiscanner lives as an MCP server and a CLI, not just a web app. One line wires it into Claude Code, Cursor, or Codex:

claude mcp add uiscanner -- npx -y uiscanner-mcp@latest

Or run a one-off scan straight from the terminal:

npx -y uiscanner-mcp@latest scan stripe.com --target landing

The MCP surface gives your agent five tools: ui_teardown to scan a URL, ui_probe for a cheap preflight that spends no scan, get_teardown to re-fetch a finished one, retarget_teardown to re-tailor the build prompt to another archetype without re-scanning, and whoami to check your plan. Every scan returns an id and a shareable uiscanner.com/t/<id> link. Setup for every client lives at https://uiscanner.com/mcp, and there is a Chrome extension too.

where each approach actually fits

None of this makes screenshot to code useless. If you have only an image, a Dribbble shot or a competitor's marketing PDF with no live URL, a vision model is your only option, and for a simple layout it will save you real time. A DOM-only inspector is the right reach when you just want to eyeball a palette or copy one section's CSS and a human is doing the next step.

uiscanner is built for the case in between and beyond both: you have a live page, and the next step is an AI one. You want your agent to rebuild a design language from real values and a real structure, not from a guess and not from a flat dump of every class in the sheet. Point it at a reference, hand your agent the teardown, and it builds against a system instead of inventing generic defaults.

Want to see the output before wiring anything up? Browse example teardowns, including full breakdowns of the Stripe and Linear design systems, or read how uiscanner differs from a pure CSS inspector. The free plan includes 5 scans per rolling 30 days; Indie ($12/mo) gets 100 and Studio ($39/mo) gets 400, on one quota shared across the web app, MCP, and CLI.