Using “underdrawings” for accurate text and numbers

samcollins.blog - 260 poäng - 83 kommentarer - 234203 sekunder sedan

Kommentarer (28)

IdiotSavage - 9598 sekunder sedan
> Transform this image into a photographed claymation diorama of assorted artisan chocolates and candies […] viewed from a low-angle
Side note: whenever I read prompts for image generation, I notice very specific details which the model obviously ignored. Here the chocolates / candies in the last two images look anything but artisanal. They look very "sterile" and mass-produced. The viewing angle is also not accurate.
Why do we even bother writing such elaborate prompts, when the model ignores most of it anyway?
danpalmer - 32655 sekunder sedan
I'm glad that we're making progress towards a deeper understanding of what LLMs are inherently good at and what they're inherently bad at (not to say incapable of doing, but stuff that is less likely to work due to fundamental limitations).
There's similarity here with, for example, defining the architecture of software, but letting an LLM write the functions. Or asking an LLM to write you the SQL query for your data analysis, rather than asking it to do your data analysis for you.
What I'd really like to see is a more well defined taxonomy of work and studies on which bits work well with LLMs and which don't. I understand some of this intuitively, but am still building my intuition, and I see people tripping up on this all the time.
samcollins - 234203 sekunder sedan
I found a simple technique to get reliable text and numbers in AI generated images.
I’m surprised the image models aren’t already doing this, so wanted to share since I’m finding this so useful
smusamashah - 25894 sekunder sedan
This is just img2img where first image with correct structure was generated by code.
Geonode - 9622 sekunder sedan
We've been doing this for a long time now, it's similar to using a depth map or a line drawing to control the silhouette.
sparuchuri - 231120 sekunder sedan
This hack definitely falls in the “duh, why didn’t I think of that” category of tricks, but glad to now have it next time imagegen comes up short
xigoi - 16794 sekunder sedan
The standard objection: if the LLM is supposedly intelligent, why can’t it figure out on its own that this two-step process would achieve a better result?
elil17 - 13607 sekunder sedan
I wonder whether this could be used to fine-tune image models to provide better outputs. Something like this:
1. Algorithmically generate a underdrawing (e.g. place numbers and shapes randomly in the underdrawing)
2. Algorithmically generate a description of the underdrawing (e.g. for each shape, output text like "there is a square with the number three in the top left corner). You might fuzz this by having an LLM rewrite the descriptions in a variety of ways.
3. Generate a "ground truth" image using the underdrawing and an image+text-to-image model.
4. Use the generated description and the generated "ground truth" image as training data for a text-to-image model.
dllu - 14543 sekunder sedan
I was thinking about doing the opposite for the common task of "SVG of a pelican riding a bike". Obviously, directly spitting out the SVG is gonna be bad. But image gen can produce a really stunning photorealistic image easily. Probably a good way to get an LLM to produce a decent bike-pelican SVG is to generate an image first and then get the model to trace it into an SVG. After all, few human beings can generate SVG works of art by just typing out numbers into Notepad. At the core of it, we still rely on looking at it and thinking about it as an image.
nottorp - 15026 sekunder sedan
LLMs are like a box of chocolates...
utopiah - 6582 sekunder sedan
Love the concluding note : it works, but not really.
So LLM/GenAI crave. An entire article to show that it's nearly there, yet it's not, despite convoluted effort to make it just so on a very very niche example.
docheinestages - 8902 sekunder sedan
And what happens if the model can't come up with a good enough SVG to begin with?
nine_k - 16563 sekunder sedan
It's normal to first create a plan, then allow agents to write code. But it seems to be surprising for many to first create a draft / outline of a picture, then go for a final render.
cheekyant - 12170 sekunder sedan
Has anyone built a platform which has image to image pipelines and lets you use prompt to SVG generation from SOTA LLMs?
BobbyTables2 - 28244 sekunder sedan
How is it that LLMs aren’t good at rendering the sequence of numbers but can reliably put the supplied pieces all in the right order?
wg0 - 19839 sekunder sedan
Has anyone had good luck with making consistent game art and assets?
choeger - 23381 sekunder sedan
Transformers are great translators. So, yeah, starting with structured output like SVG is probably the best way to start.
It should be fairly trivial to fix any logic errors in the structured output, too.
- 28144 sekunder sedan
tracerbulletx - 33501 sekunder sedan
Ive been doing charts for slides like this for a while. Noticed html viz was super reliable, but I could style it with diffusion model. Its very useful for data viz.
SomaticPirate - 17983 sekunder sedan
inb4 this technique is subsumed into the next MoE model release
LLMs are evolving so fast I wouldn’t be surprised if this technique was not needed in <6 months
globular-toast - 13717 sekunder sedan
Wait, where did it get the "Sweet Path//Trail of treats" thing from in the SVG? It wasn't about sweets at that point. Something missing here, I think.
Melamune - 17179 sekunder sedan
I wondered why I was losing all passion for creating. These tips and tricks are part of the answer.
jeffrallen - 22562 sekunder sedan
I wish the opposite was true: that when I tell Gemini I want "a diagram of X" that it immediately breaks out Python and mathplotlib, instead of wasting my time with Nano Banana.
nullc - 25527 sekunder sedan
Inpainting/guiding from a sketch is how I've always used diffusion models. I thought everyone did that, or at least everyone who wasn't just trying to get some arbitrary filler material without much care of what the output looked like.
foxes - 13364 sekunder sedan
I feel sorry for the recipient.
psychoslave - 14593 sekunder sedan
A few months ago I tried to make Le-chat Mistral output a French poetry in Alexandrin (12 vowels). Disastrous at first. Then adding in specifications that each line had to also be transposed in IPA and each syllable counted, it went better.
Still emotionally unrelatable, but definitely was providing something that match the specifications of there are explicit and systematically enforced through deterministitic means. For now I retain that LLM limitations are thus that they can't seize the ineffable and so untrustworthy they can only be employed under very clear and inescapable constraints or they will go awry just as sure as water is wet.
gwern - 29591 sekunder sedan
tldr: do a standard img2img workflow where you lay out a skeleton or skeleton or low-res version, and then turn it into the final high-quality photorealistic version, instead of trying to zeroshot it purely from a text prompt.
brentcrude - 6574 sekunder sedan
[dead]