Posted by simedw 2 days ago
This is what I think is missing in most AI (broad sense) learning resources. They focus too much on the math that I miss the intuitive process behind it.
It’s purposefully high level and non-technical for a general audience - my theory was that most people who aren’t into tech/AI don’t care too much about training, or how the system got to be the way that it is.
But they do have some interest in how it actually operates once you’ve typed in a prompt.
Happy to answer any questions or take on board feedback
Right now we are only seeing the denoising process after it's been morphed by the latent decoder, which looks a lot less intuitive than actual pixel diffusion.
If you can't find a suitable pixel-space model, then you can just trivially generate a forward process and play it backwards.
Found the manual latent space exploration part really interesting.
Too many LLM/diffusion explanations fall in the proverbial “how to draw an owl” meme without giving a taste as to what’s going on.
The interpolations between butterfly and snail were pretty horrifying. But something like Z-Image you could basically concatenate the text and end up with a normal image of both. Is the latent space for "butterfly and snail" just well off the path between the two individually?
It's hard to imagine what is nearby in latent space and how text contributes, so I did really like the section adding words to the prompt 1-by-1.
So different seeds lead to slightly different end points, because you’re just moving closer to the “consistent region” at each step, but approaching from a different angle.