Posted by danboarder 7 days ago
It's a different set of trade-offs.
* Theoretically; I don't own an iPhone.
Server side means shared resources, shared upgrades and shared costs. The privacy aspect matters, but at what cost?
The cost, so far, is greater.
How so if efficiency is key for datacenters to be competitive? If anything it's the other way around.
[0] https://interestingengineering.com/innovation/elon-musk-xai-...
I don’t see this happening in the next 5 years.
The Mac mini being shrunk down to phone size is probably the better bet. We’d have to bring down the power consumption requirements too by a lot. Edge hardware is a ways off.
Local doesn't cost the company anything, and increases the minimum hardware customers need to buy.
Not completely true: those models are harder to develop. The logistics are a hassle.
Their ImageNet model (4_1024_8_8_0.05[0]) is ~820M while AFHQ is ~472M. Prior to that there is DenseFlow[1] and MaCow[2], which are both <200M parameters. For more comparison, that makes DenseFlow and MaCow smaller than iDDPM[3] (270M params) and ADM[4] (553M for 256 unconditional). And now, it isn't uncommon for modern diffusion models to have several billion parameters![5] (from this we get some numbers on ImageNet-256, which allows a direct comparison, making TarFlow closer to MaskDiT/2 and much smaller than SimpleDiffusion and VDM++, both of which are in billions. But note that this is 128 vs 256!)
Essentially, the argument here is that you can scale (Composable) Normalizing Flows just as well as diffusion models. There's a lot of extra benefits you get too in the latent space, but that's a much longer discussion. Honestly, the TarFlow method is simple and there's probably a lot of improvements that can be made. But don't take that as a knock on this paper! I actually really appreciated it and it really set out to show what they tried to show. The real thing is just no one trained flows at this scale before and this really needs to be highlighted.
The tldr: people have really just overlooked different model architectures
[0] Used a third party reproduction so might be different but their AFHQ-256 model matches at 472M params https://github.com/encoreus/GS-Jacobi_for_TarFlow
[1] https://arxiv.org/abs/2106.04627
[2] https://arxiv.org/abs/1902.04208
[3] https://arxiv.org/abs/2102.09672
[4] https://arxiv.org/abs/2105.05233
[5] https://arxiv.org/abs/2401.11605
[Side note] Hey, if the TarFlow team is hiring, I'd love to work with you guys
See here: https://github.com/homerjed/transformer_flow
I'm happy to see the return of normalising flows - exact likelihood models have many benefits. I found the model needed soft-clipping on some operations to ensure numerical stability.
I wonder if adding transformers can be done for the GLOW algorithm since attention and 1x1 convolutions could be made to do the same operation.
Claude's Summary: "Normalizing flows aren't dead, they just needed modern techniques"
My Summary: "Transformers aren't just for text"
1. SOTA model for likelihood on ImageNet 64×64, first ever sub 3.2 (Bits Per Dimension) prev was 2.99 by a hybrid diffusion model
2. Autoregressive (transformers) approach, right now diffusion is the most popular in this space (it's much faster but a diff approach)
tl;dr of autoregressive vs diffusion (there's also other approaches)
Autoregression: step based, generate a little then more then more
Diffusion: generate a lot of noise then try to clean it up
The diffusion approach that is the baseline for sota is Flow Matching from Meta: https://arxiv.org/abs/2210.02747 -- lots of fun reading material if you throw both of these into an LLM and ask it to summarize the approaches!
> Diffusion: generate a lot of noise then try to clean it up
You could say this about Flows too. The history of them is shared with diffusion and goes back to the Whitening Transform. Flows work by a coordinate transform so we have an isomorphism where diffusion works through, for easier understanding, a hierarchical mixture of gaussians. Which is a lossy process (more confusing when we get into latent diffusion models, which are the primary type used). The goal of a Normalizing Flow is to turn your sampling distribution, which you don't have an explicit representation of, into a probability distribution (typically Normal Noise/Gaussian). So in effect, there are a lot of similarities here. I'd highly suggest learning about Flows if you want to better understand Diffusion Models. > The diffusion approach that is the baseline for sota is Flow Matching from Meta
To be clear, Flow Matching is a Normalizing Flow. Specifically, it is a Continuous and Conditional Normalizing Flow. If you want to get into the nitty gritty, Ricky has a really good tutorial on the stuff[0]StarFlow and the AR models are fixed but DiT is being compared at different amount of steps and we don't really care if we generate garbage at blazing speeds[0]. Go look at... also Figure 10 (lol) from the DiT paper[1], it compares FID to model sizes and sampling steps. It looks like StarFlow is comparing to DiT-XL/2-G. In [1] they do {16,32,64,128,256,1024} steps which corresponds to (roughly) 10k-FID of 60, 35, 25, 22, 21, 20. Translating to StarFlow's graph we'll guesstimate 21,23,50. There's a big difference between 50 and 23 but what might surprise you is that there's a big difference between 25 and 20. Remember that this is a metric that is lower bounded, and that lower bound is not 0... You also start running into the limitations of the metric the closer you get to its lower bound, adding another layer of complexity when comparing[2]
The images from the paper (I believe) are all at 250 steps, which StarFlow is beating at a batch size of 4. So let's look at batches and invert the data. It is imgs/sec so let's do (1/<guestimate of y-value>) * batch. We get this
Batch DiT SF
1 10s 20s
2 20s 30s
4 40s 30s
8 80s 30s
16 160s 30s
...
So what's happening here is that StarFlow is invariant to the batch size while DiT is not. Obviously this won't hold forever, but DiT doesn't get advantage from batching. You could probably make up these differences by caching the model because it looks like there's a turn from model loading dominating to actual generation dominating. Whereas StarFlow has that turnover at batch 2.And batching (even small batches) is going to be pretty common, especially when talking about industry. The scaling here is a huge win to them. It (roughly) costs you just as much to generate 64 images as it does 2. Worst case scenario, you hand your customers batched outputs and they end up happier because frankly, generating images is still an iterative process and good luck getting the thing you want on just the first shot even if you got all your parameters dialed in. So yeah, that makes a much better product.
I'll also add 2 things. 1) you can get WAY more compression out of Normalizing Flows 2) there's just a ton you can do with Flows that you can't with diffusion. The explicit density isn't helpful only for the math nerds. It is helpful for editing, concept segmentation, interpolation, interpretation, and so much more.
[0] https://tvgag.com/content/quotes/6004-jpg.jpg
[1] https://arxiv.org/abs/2212.09748
[2] Basically, place exponentially growing importance on FID gaps as they lower and then abandon the importance completely because it doesn't matter. As an example, take FFHQ-256 with FID-50k. Image quality difference between 50 and 20 is really not that big, visually. But there's a *HUGE** difference between 10 and 5. Visually, probably just as big as the difference between 5 and 3. But once you start going below 3 you really shouldn't rely on the metric anymore and comparing a 2.5 model to 2.7 is difficult.
I've decided to keep this thread on the front page, move the on-topic comments from that other thread to this one, and leave the rest of it in the past.
To get deterministic results, you fix the seed for your pseudorandom number generator and make sure not to execute any operations that produce different results on different hardware. There's no difference between the approaches in that respect.