Using “underdrawings” for accurate text and numbers

Posted by samcollins 2 days ago

Using “underdrawings” for accurate text and numbers(samcollins.blog)

264 points | 86 comments

IdiotSavage 2 hours ago|

> Transform this image into a photographed claymation diorama of assorted artisan chocolates and candies […] viewed from a low-angle

Side note: whenever I read prompts for image generation, I notice very specific details which the model obviously ignored. Here the chocolates / candies in the last two images look anything but artisanal. They look very "sterile" and mass-produced. The viewing angle is also not accurate.

Why do we even bother writing such elaborate prompts, when the model ignores most of it anyway?

8-prime 1 hour ago||

I have noticed the same thing.The few times I wanted to use image generatation it always failed me in exactly these aspects. I always put if off as a lack of prompting skill on my end. Once you start to keep an eye out for these inconsistencies they turn out to be very common.

ryanthedev 1 hour ago|||

I believe most detailed prompts are AI generated.

IdiotSavage 1 hour ago||

That's funny if it's true. I'd like to see the prompt which generates the prompt.

ErroneousBosh 1 hour ago||

I wonder how long it took to come up with all this?

Because if I wanted a spiral of little "buttons" like the last one at the end (and they don't look very much like sweets) I'd be able to knock that out in Blender in an afternoon, and I'm not very good at Blender.

Brendinooo 20 minutes ago|||

I remember opening Blender for the first time years ago and thinking it had the steepest learning curve of any software I'd ever used.

HotHotLava 1 hour ago|||

I think you're vastly overestimating the average persons ability to use Blender if you can do that in an afternoon; just figuring out how to place a colored cube and the camera probably takes an afternoon if you pick up Blender for the first time.

ErroneousBosh 1 hour ago||

I guess I'm coming at it from having used Blender for an afternoon or so, and already knowing Python.

If you were good at GLSL you could do it in that maybe.

Someone somewhere is going to write something that directly draws it to a framebuffer in Brainfuck, you just know it, don't you?

danpalmer 9 hours ago||

I'm glad that we're making progress towards a deeper understanding of what LLMs are inherently good at and what they're inherently bad at (not to say incapable of doing, but stuff that is less likely to work due to fundamental limitations).

There's similarity here with, for example, defining the architecture of software, but letting an LLM write the functions. Or asking an LLM to write you the SQL query for your data analysis, rather than asking it to do your data analysis for you.

What I'd really like to see is a more well defined taxonomy of work and studies on which bits work well with LLMs and which don't. I understand some of this intuitively, but am still building my intuition, and I see people tripping up on this all the time.

p-e-w 5 hours ago||

> due to fundamental limitations

People keep throwing this phrase around in relation to LLMs, when not a single “fundamental limitation” has been rigorously demonstrated to exist, and many tasks that were claimed to be impossible for LLMs two years ago supposedly due to “fundamental limitations” (e.g. character counting or phonetics) are non-issues for them today even without tools.

dijit 5 hours ago|||

Character counting remains a huge issue without tools.

Are you using only frontier models that are gated behind openai/anthropic/google APIs? Those use tools to help them out behind the scenes. It remains no less impressive, but I think we should be clear.

girvo 1 hour ago||||

The literal best public models still fail to count characters consistently in practice so I’m not sure what you mean. It’s literally a problem we’re still trying to solve at work

outofpaper 56 minutes ago||

What's amazing is that they even can fairly reliably appear to count characters. I mean we're talking about systems that infer sequences not character counters or calculators. They are amazing in unrelated ways and we need to accept this so we can use them effectively.

coldtea 3 hours ago||||

>People keep throwing this phrase around in relation to LLMs, when not a single “fundamental limitation” has been rigorously demonstrated to exist

Some limitations are not rigorously demonstrated to be fundamental, but continuously present from the first early LLMs yes. Shouldn't the burden of proof be on those who say it can be done?

And some limitations are fundamental, and have been rigorously demonstrated, e.g.:

https://arxiv.org/abs/2401.11817?utm_source=chatgpt.com

p-e-w 2 hours ago||

That paper’s abstract doesn’t carry its title, to put it mildly.

coldtea 1 hour ago||

What part of "Specifically, we define a formal world where hallucination is defined as inconsistencies between a computable LLM and a computable ground truth function. By employing results from learning theory, we show that LLMs cannot learn all the computable functions and will therefore inevitably hallucinate if used as general problem solvers. " doesn't carry the title, to ask mildly?

3form 2 hours ago||||

Is character counting actually not an issue anymore? Do you know somewhere where I can read more about this?

mrob 1 hour ago||||

Character counting errors are a side effect of tokenization, which is a performance optimization. If we scaled the hardware big enough we could train on raw bytes and avoid it.

3form 1 hour ago||||

Your comment, after removing the particulars, has a shape of:

People have an <opinion> which hasn't been rigorously proven, while <not rigorously proven counteropinion>.

As such, I am not sure what you're trying to achieve here.

raincole 1 hour ago||||

Drawing five fingered humans was a fundamental limitation... until it's not.

danpalmer 4 hours ago||||

This is kind of my point, we need to get better at describing the limitations and study them. It seems extremely clear that there are limitations, and not just temporary ones, but structural limitations that existed at the beginning and continue to persist.

ijidak 1 hour ago||

Yeah I think it was the word "fundamental" he took issue with.

Marazan 3 hours ago||||

If you remove the auxiliary tools and just leave the core LLM then strawberry still has an undefined number of `r`s in it.

p-e-w 2 hours ago||

That’s false. Larger LLMs learn token decompositions through their training, and in fact modern training pipelines are designed to occasionally produce uncommon tokenizations (including splitting words into individual characters) for this reason. Frontier models have no trouble spelling words even without tools. Even many mid-sized models can do that.

kilpikaarna 2 hours ago||

Wait, where can I learn more about this? I don't doubt that varying the tokenization during training improves results, but how does/would that enable token introspection?

rimliu 4 hours ago|||

of course, if you choose to ignore all the limitations they indeed have no limitations.

mkbosmans 4 hours ago||

Nobody says they have no limitations. The question is are those limitation fundamental, i.e. can we expect improvement, say within a year.

coldtea 3 hours ago|||

As a general architecture, an LLM also has limitations that can't be improved unless we switch to another, fundamentally different AI design that's non LLM based.

There are also limitations due to maths and/or physics that aren't fixable under any design. Outside science fiction, there is no technology whose limitations are all fixable.

Here's one: https://arxiv.org/abs/2401.11817?utm_source=chatgpt.com

danpalmer 4 hours ago|||

When I talk about fundamental limitations, I mean limitations that can't be solved, even if they could be improved.

We have improved hallucinations significantly, and yet it seems clear that they are inherent to the technology and so will always exist to some extent.

p-e-w 2 hours ago||

“Seems clear” based on what?

pegasus 2 hours ago||

For one, based on continuously frustrated hopes (and promises!) that hallucinations will go away.

locknitpicker 6 hours ago||

> There's similarity here with, for example, defining the architecture of software, but letting an LLM write the functions.

Not so long ago, this was how early adopters of LLM coding assistants claimed was the right way to use them in coding tasks: prompt to draft the outline, and then prompt to implement each function. There were even a few posts in HN on blogposts showing off this approach with terms inspired in animation work.

Sammi 2 hours ago|||

In short, LLMs are pretty great at working at a single level of abstraction at a time.

You can go from the highest level and all the way down to the lowest level with LLMs, you just have to work at it iteratively one level at a time.

danpalmer 5 hours ago||||

I'm not necessarily suggesting always getting down to literally the function level, although I think that gives you excellent quality control, but having a code-level understanding is clearly an important factor.

nullsanity 5 hours ago|||

[dead]

petercooper 11 minutes ago||

This seems analogous to how a human would do it accurately. If you asked an artist to paint stones in a large circular arrangement with the numbers in order in one shot, with no fixes or sketching allowed, it wouldn't be surprising to end up with problems in the arrangement.

samcollins 2 days ago||

I found a simple technique to get reliable text and numbers in AI generated images.

I’m surprised the image models aren’t already doing this, so wanted to share since I’m finding this so useful

samcollins 10 hours ago|

TLDR: use SVG to outline image correctly first, then send that image with your text prompt to get Gemini 3.0 Pro to render with correct numbers and text

smusamashah 7 hours ago||

This is just img2img where first image with correct structure was generated by code.

vunderba 5 hours ago||

Yup, that’s exactly what this is. If you’ve been using generative models since the early Stable Diffusion days, it’s a pretty common (and useful!) technique: using a sketch (SVG, drawn, etc) as an ad-hoc "controlnet" to guide the generative model’s output.

Example: In the past I'd use a similar approach to lay out architectural visualizations. If you wanted a couch, chair, or other furniture in a very specific location, you could use a tool like Poser to build a simple scene as an approximation of where you wanted the major "set pieces". From there, you could generate a depth map and feed that into the generative model, at the time SDXL, to guide where objects should be placed.

jasonjmcghee 7 hours ago|||

Pretty much what the author said- just gave some context for the uninitiated

philsnow 6 hours ago||

Right, but you can use a different (codegen) model to make that code.

Geonode 3 hours ago||

We've been doing this for a long time now, it's similar to using a depth map or a line drawing to control the silhouette.

sparuchuri 2 days ago||

This hack definitely falls in the “duh, why didn’t I think of that” category of tricks, but glad to now have it next time imagegen comes up short

manmal 7 hours ago|

Even the original stable diffusion app had image 2 image. It just didn’t work as well. I‘m not sure why this is supposed to be novel.

ludwik 6 hours ago|||

It’s obviously not a new model capability. But using this well-known, existing capability to solve this particular issue is only obvious after the fact.

It’s a useful trick to have in one’s toolbox, and I’m grateful to the author for sharing it.

Finbel 6 hours ago|||

It's not novel in the sense that nobody knew about img2img. It's novel in the sense that nobody thought of using img2img to solve this problem in this way.

manmal 23 minutes ago|||

Ok it might just be me then. I view Nvidia‘s DLSS as a similar thing. There was even this meme that video games will in the future only output basic geometry and the AI layer transforms it into stunning graphics.

TeMPOraL 3 hours ago|||

It's novel if you never played with img2img, including especially several forms of (text+img)2img. Or, if you never tried editing images by text prompt in recent multimodal LLMs.

That said, I spent plenty of time doing both, and yet it would probably take me a while to arrive at this approach. For some reason, the "draw a sketch, have a model flesh it out" approach got bucketed with Stable Diffusion in my mind, and multimodal LLMs with "take detailed content, make targeted edits to it". So I'm glad the OP posted it.

xigoi 4 hours ago||

The standard objection: if the LLM is supposedly intelligent, why can’t it figure out on its own that this two-step process would achieve a better result?

petercooper 9 minutes ago||

Because image models at the basic level are just text tokens in, image tokens out. You'd need an agentic process on top to come up with a strategy, review output, try again, and so on.

I believe Nano Banana and gpt-image-2 have a little of this going on, but it's like asking a model to one-shot some code vs having an agentic harness with tools do it. Even the most basic agent can produce better code than ChatGPT can.

Sharlin 1 hour ago|||

Because the LLM is more or less hardcoded to just pass "create image" style prompts to a separate model, possibly with some embellishment.

nine_k 4 hours ago|||

Nobody asked it to!

xigoi 4 hours ago||

If it’s asked to generate an image, it should to everything in its powers to make the image good.

andruby 3 hours ago|||

> it should do everything in its powers

That's a scary thought.

Hey Claude, why haven't you finished yet? ... Because the human I'm holding hostage hasn't finished the drawing yet.

lacksjoian 2 hours ago|||

LLMs have no concept of what makes the output "good". Or to put it another way, if the LLM generates an image with jumbled numbers it's because that was the most likely output, hence it was a "good" image according to its weights.

pyrolistical 3 hours ago|||

You don’t know what you don’t know

cubefox 4 hours ago|||

Part of the problem is that it isn't the LLM making the image directly itself, it's the LLM repeatedly prompting edits for a separate edit diffusion model. The Gemini reasoning summary shows part of this. The style of some of the images makes it also clear that it uses an Imagen 4 derived diffusion model underneath.

jstanley 4 hours ago||

[flagged]

xigoi 4 hours ago||

Every decent human artist knows to draw a sketch before painting something.

jrapdx3 3 hours ago|||

Of course many, even most, painters do sketch what they intend to paint, likely that's the predominant technique.

But it's not universally true, particularly among artists working in the last 100 years or so. Certainly Jackson Pollock (whether one regards his work as good or not) didn't sketch out how he was going to distribute paint onto canvas. Another example is Morris Luis (and other "stain painters") who didn't sketch out how he applied paint to canvas.

You're comment is largely correct, just pointing out that more than a few "decent artists" didn't (or don't) work that way.

hirako2000 3 hours ago|||

Humans even have the creativity to come up with sketching.

Models don't have intelligence, even less so creative thinking.

xigoi 2 hours ago||

Exactly, that’s my point.

elil17 4 hours ago||

I wonder whether this could be used to fine-tune image models to provide better outputs. Something like this:

1. Algorithmically generate a underdrawing (e.g. place numbers and shapes randomly in the underdrawing)

2. Algorithmically generate a description of the underdrawing (e.g. for each shape, output text like "there is a square with the number three in the top left corner). You might fuzz this by having an LLM rewrite the descriptions in a variety of ways.

3. Generate a "ground truth" image using the underdrawing and an image+text-to-image model.

4. Use the generated description and the generated "ground truth" image as training data for a text-to-image model.

hirako2000 3 hours ago||

That would complexity the architecture of a model, to solve a finite set of cases. That's an argument for specialised/fine tuned models though.

slickytail 3 hours ago||

[dead]

dllu 4 hours ago|

I was thinking about doing the opposite for the common task of "SVG of a pelican riding a bike". Obviously, directly spitting out the SVG is gonna be bad. But image gen can produce a really stunning photorealistic image easily. Probably a good way to get an LLM to produce a decent bike-pelican SVG is to generate an image first and then get the model to trace it into an SVG. After all, few human beings can generate SVG works of art by just typing out numbers into Notepad. At the core of it, we still rely on looking at it and thinking about it as an image.

More comments...