How large are large language models?

Posted by rain1 1 day ago

How large are large language models?(gist.github.com)

253 points | 138 commentspage 2

simonw 1 day ago|

> There were projects to try to match it, but generally they operated by fine tuning things like small (70B) llama models on a bunch of GPT-3 generated texts (synthetic data - which can result in degeneration when AI outputs are fed back into AI training inputs).

That parenthetical doesn't quite work for me.

If synthetic data always degraded performance, AI labs wouldn't use synthetic data. They use it because it helps them train better models.

There's a paper that shows that if you very deliberately train a model in its own output in a loop you can get worse performance. That's not what AI labs using synthetic data actually do.

That paper gets a lot of attention because the schadenfreude of models destroying themselves through eating their own tails is irresistible.

rybosome 1 day ago|

Agreed, especially when in this context of training a smaller model on a larger model’s outputs. Distillation is generally accepted as an effective technique.

This is exactly what I did in a previous role, fine-tuning Llama and Mistral models on a mix of human and GPT-4 data for a domain-specific task. Adding (good) synthetic data definitely increased the output quality for our tasks.

rain1 1 day ago||

Yes but just purely in terms of entropy, you can't make a model better than GPT-4 by training it on GPT-4 outputs. The limit you would converge towards is GPT-4.

simonw 1 day ago|||

A better way to think about synthetic data is to consider code. With code you can have an LLM generate code with tests, then confirm that the code compiles and the tests pass. Now you have semi-verified new code you can add to your training data, and training on that will help you get better results for code even though it was generated by a "less good" LLM.

unwind 1 day ago||

Meta: The inclusion of the current year ("(2025)") in the title is strange, even though it's in the actual title of the linked-to post, repeating it here makes me look around for the time machine controls.

bobsmooth 19 hours ago||

There's got to be tons of books that remain undigitized that can be mined for training data, hasn't there?

christianqchung 1 day ago|

This is a bad article. Some of the information is wrong, and it's missing lots of context.

For example, it somehow merged Llama 4 Maverick's custom Arena chatbot version with Behemoth, falsely claiming that the former is stopping the latter from being released. It also claims 40B of internet text data is 10B tokens, which seems a little odd. Llama 405B was also trained on more than 15 trillion tokens[1], but the post claims only 3.67 trillion for some reason. It also doesn't mention Mistral large for some reason, even though it's the first good European 100B+ dense model.

>The MoE arch. enabled larger models to be trained and used by more people - people without access to thousands of interconnected GPUs

You still need thousands of GPUs to train a MoE model of any actual use. This is true for inference in the sense that it's faster I guess, but even that has caveats because MoE models are less powerful than dense models of the same size, though the trade-off has apparently been worth it in many cases. You also didn't need thousands of GPUs to do inference before, even for the largest models.

The conclusion is all over the place, and has lots of just weird and incorrect implications. The title is about how big LLMs are, why is there such a focus on token training count? Also no mention of quantized size. This is a bad AI slop article (whoops, turns out the author accidentally said it was AI generated, so it's a bad human slop article).

[1] https://ai.meta.com/blog/meta-llama-3-1/

rain1 1 day ago|

I can correct mistakes.

> it somehow merged Llama 4 Maverick's custom Arena chatbot version with Behemoth

I can clarify this part. I wrote 'There was a scandal as facebook decided to mislead people by gaming the lmarena benchmark site - they served one version of llama-4 there and released a different model' which is true.

But it is inside the section about the llama 4 model behemoth. So I see how that could be confusing/misleading.

I could restructure that section a little to improve it.

> Llama 405B was also trained on more than 15 trillion tokens[1],

You're talking about Llama 405B instruct, I'm talking about Llama 405B base. Of course the instruct model has been traiend on more tokens.

> why is there such a focus on token training count?

I tried to include the rough training token count for each model I wrote about - plus additional details about training data mixture if available. Training data is an important part of an LLM.