A design in Figma is a pile of boxes at absolute coordinates. A working website is rows and columns that reflow on a phone. Getting from the first to the second is a thousand small judgment calls, and in 2021 I set out to teach a model to make them.
Why Builder needed it
The plan for Qwik had been the Vercel playbook. Win developers with an open-source framework, let them get to know Builder, and some of them become customers of the visual CMS. It’s a clean top-of-funnel story. The conversion never came in the numbers we wanted, so the company looked at where else it could grow, and the answer was to make the product easier to use.
The sharpest friction was the import step. Designers lived in Figma, where you place boxes and images in free-form absolute positions. The CMS needed responsive structure, which means thinking in HTML: columns, rows, alignment, things that stretch and wrap. We had hand-written heuristics for the translation, and they only went so far. I thought a model could learn the judgment calls the heuristics kept getting wrong.
The first instinct: a neural network
My first attempt was a neural network, and it ran into two walls fast.
A neural net expects a fixed-size input. A design can have six elements or six hundred, and I needed something that handled both without padding the problem into a shape it wasn’t. On top of that, there was no big labeled dataset for “design to good structure.” We had a modest amount of data and a model class that’s hungry for a lot of it.
Where the data came from
Before solving the model, I solved the data, because real responsive websites are a free source of labeled examples. Take a site whose row-and-column structure you already know, strip it down to the absolute position of every box, and you have an input (flat boxes) paired with a correct output (the hierarchy that produced them).
To go further, I leaned on genetic algorithms to generate and explore more layouts, keeping the ones we validated as correct. As a way to solve the structuring problem outright, the genetic approach hit a wall quickly. It needed a growing pile of hand-tuned parameters and stopped scaling. Its lasting value was the corpus of validated designs it left behind, which is what made the next approach possible.
The corpus did one more thing that turned out to matter. Because you already know the correct tree for each site, you also know, for any two merges the algorithm might consider, which one the real structure actually performed first. That is exactly the label the model needed, minted for free by replaying known-good trees.
The reframe: which of these two merges is better
The idea that unlocked everything was small, and a little smaller than I expected. Stop asking the model to understand a whole design at once. Don’t even ask it to judge one pair in isolation. Ask it the most modest comparative question there is: given two merges I could make right now, which one should happen first?
That single move erases the variable-size problem. A design with six hundred elements and a design with six are the same task to the model, because it only ever looks at two candidate merges and says which it prefers. And a candidate merge has a clean, fixed set of features, twenty-one numbers, always the same twenty-one: the two boxes it would join (their widths, heights, child counts), the box you’d get out, the gap and distance between them, whether their edges or centers line up, and a few cheap signals, like how many other merges are competing for the same boxes, and whether taking this one keeps a good move available on the next pass.
The model is gradient-boosted trees (XGBoost), and for each pair of candidates it returns four probabilities: both merges are good, both are bad, the first is better, the second is better. Not a yes/no on one pair, but a verdict on the matchup.
The tournament
One comparison isn’t a decision; it’s a vote. So every legal merge plays every other legal merge in a round-robin, and the votes accumulate into a score:
def rank(options):
for a, b in all_pairs(options):
both_good, both_bad, a_wins, b_wins = model.compare(a, b)
verdict = argmax([both_good, both_bad, a_wins, b_wins])
if verdict == 0: a.score += both_good / 2; b.score += both_good / 2
elif verdict == 1: a.score -= both_bad / 2; b.score -= both_bad / 2
elif verdict == 2: a.score += a_wins
else: b.score += b_wins
return sorted(options, key=lambda o: o.score, reverse=True)
The merge that beats the most rivals floats to the top. The “both good” and “both bad” classes are what let it move fast: when a cluster of merges are all clearly right, they can be applied in the same pass. A small k-means split on the scores separates the obvious winners from the maybes, so a clean design collapses in a handful of rounds instead of one merge at a time.
Building the hierarchy bottom-up
The ranking drives a greedy loop. Enumerate the legal merges, rank them, apply the best (or the whole top cluster), and run again on the smaller set. Repeat until a single root remains.
def structure(boxes):
while len(boxes) > 1:
options = legal_merges(boxes) # skip any union that would
winners = top_cluster(rank(options)) # swallow or cross a third box
boxes = apply(boxes, winners) # each winning pair becomes one new box
return boxes[0] # the root of the hierarchy
The “legal merges” check is what keeps the output sane. A merge is off the table if the box you’d get would overlap or cross a third box, which stops the model from gluing together things that visually belong apart.
A design, becoming a tree
Boxes at absolute coordinates
A Figma frame is what you see here: a pile of boxes placed at fixed x/y positions. No rows, no columns, no idea what belongs with what. That's the input.
Only legal merges count
The engine only ever considers joining two boxes whose combined bounding box wouldn't swallow or cross a third. That single rule (dashed = allowed, crossed = illegal) keeps it from gluing together things that visually belong apart.
Score the matchups, merge the winner
Every legal merge plays every other in a round-robin, and an XGBoost model votes on each pairing. The merge that beats the most rivals wins the round — here the tightest, best-aligned pairs go first: the two text lines, then the three cards.
Bottom-up, the tree falls out
Merge, then run again on the smaller set. Repeat until a single root remains. A blue box is a row (children side by side); a maroon box is a column (children stacked). The responsive structure assembles itself from the bottom up.
Why gradient-boosted trees, in the deep-learning era
What I trained was an ensemble of decision trees (XGBoost), not a single tree and not a neural net. Reaching for trees in 2021 felt unfashionable, and three things made it the right call.
The comparison framing made the model size-independent, so trees were enough; I didn’t need a network to swallow a whole variable-length design. Trees also need far less data than a neural net, which matched the corpus I had. And when the model merged something wrong, I could rank the features by how much each one drove its splits and read straight off which signal misled it, then fix the data behind it. That inspectability turned debugging from guesswork into a short investigation.
What it became
The model shipped as the structure stage of Visual Copilot, Builder’s one-click Figma-to-code tool. By launch it sat inside a three-stage pipeline: the merge model turns a design into a clean hierarchy, Mitosis turns that hierarchy into framework-agnostic components, and a fine-tuned language model specializes the output for React, Vue, Svelte, Angular, Qwik, Solid, or plain HTML. The interesting part is what didn’t happen: when LLMs arrived, they didn’t replace the tree model, they slotted in next to it. Structure stayed a job for the cheap, inspectable, size-independent thing, and the language model did the framework-specific writing on top. Its success is also what spun out a dedicated AI team at Builder, which I ended up leading.
The lesson I carry from it is the reframe. The fanciest model I could have trained wouldn’t have beaten a small one pointed at the right question. Turning “understand this whole design” into “which of these two merges is better” is the same instinct as doing the hard analysis in a compiler so the runtime stays simple. Find the framing where a simple tool is enough, and most of the difficulty quietly goes away.