Posted by gpjt 9/2/2025
This all turned out to be mostly irrelevant in my subsequent programming career.
Then LLMs came along and I wanted to learn how they work. Suddenly the physics training is directly useful again! Backprop is one big tensor calculus calculation, minimizing… entropy! Everything is matrix multiplications. Things are actually differentiable, unlike most of the rest of computer science.
It’s fun using this stuff again. All but the tensor calculus on curved spacetime, I haven’t had to reach for that yet.
The intro says that it "...serves a dual purpose: on one hand, it provides a common mathematical framework to study the most successful neural network architectures, such as CNNs, RNNs, GNNs, and Transformers. On the other hand, it gives a constructive procedure to incorporate prior physical knowledge into neural architectures and provide principled way to build future architectures yet to be invented."
Working all the way through that, besides relearning a lot of my undergrad EE math (some time in the previous century), I learned a whole new bunch of differential geometry that will help next time I open a General Relativity book for fun.
Thank you for sharing this paper!
https://www.khanacademy.org/math/linear-algebra
And any prereqs you need. I also find the math-is-fun site to be excellent when I need to brush up on something from long ago and want a concise explanation. i.e. A 10 minute review, more than a few pithy sentences, yet less than a dozen-hour diatribe.
The link is broken though and you may want to remove the `:` at the end.
Thanks Andrej for the time and effort you put into your videos.
https://youtube.com/playlist?list=PLAqhIrjkxbuWI23v9cThsA9Gv...
[0] https://www.manning.com/books/build-a-large-language-model-f...
For the man in the street, inclined to view "AI" as some kind of artificial brain or sentient thing, the best explanation is that basically it's just matching inputs to training samples and regurgitating continuations. Not totally accurate of course, but for that audience at least it gives a good idea and is something they can understand, and perhaps gives them some insight into what it is, how it works/fails, and that it is NOT some scary sentient computer thingy.
For anyone in the remaining 1% (or much less - people who actually understand ANNs and machine learning), then learning about the Transformer architecture and how a trained Transformer works (induction heads etc) is what they need to learn to understand what an (Transformer-based, vs LSTM-based) LLM is and how it works.
Knowing about the "math" of Transformers/ANNs is only relevant to people who are actually implementing them from ground up, not even those who might just want to build one using PyTorch or some other framework/lbrary where the math has already been done for you.
Finally, embeddings aren't about math - they are about representation, which is certainly important to understanding how Transformers and other ANNs work, but still a different topic.
* US population of ~300M has ~1M software developers, of which a large fractions are going to be doing things like web development and only at a marginal advantage over someone smart outside of development in terms of learning how ANNs/etc work.
An LLM is, at the end of day, a next-word predictor, trying to predict according to training samples. We all understand that it's the depth/sophistication of context pattern matching that makes "stochastic parrot" an inadequate way to describe an LLM, but conceptually it is still more right than wrong, and is the base level of understanding you need, before beginning to understand why it is inadequate.
I think it's better for a non-technical person to understand "AI" as a stochastic parrot than to have zero understanding and think of it as a black box, or sentient computer, especially if that makes them afraid of it.
There's no magic here. Most of people's awestruck reactions are due to our brain's own pattern recognition abilities and our association of language use with intelligence. But there's really no intelligence here at all, just like the "face on Mars" is just a random feature of a desert planet's landscape, not an intelligent life form.
Do we understand the emergent properties of almost-intelligence they appear to present, and what that means about them and us, etc. etc.?
No.
And it happens to do something weirdly useful to our own minds based on the values in the registers.
The distinction I want to emphasize is that they don't just predict words statistically. They model the world, understand different concepts and their relationships, can think on them, can plan and act on the plan, can reason up to a point, in order to generate the next token. It learns all of these via that training scheme. It doesn't learn just the frequency of word relationships, unlike the old algorithms. Trillions are parameters do much more than that.
This sounds way over-blown to me. What we know is that LLMs generate sequences of tokens, and they do this by clever ways of processing the textual output of millions of humans.
You say that, in addition to this, LLMs model the world, understand, plan, think, etc.
I think it can look like that, because LLMs are averaging the behaviours of humans who are actually modelling, understanding, thinking, etc.
Why do you think that this behaviour is more than simply averaging the outputs of millions of humans who understand, think, plan, etc.?
This is why it’s important to make the distinction that Machine Learning is a different field than Statistics. Machine Learning models does not “average” anything. They learn to generalize. Deep Learning models can handle edge cases and unseen inputs very well.
In addition to that, OpenAI etc. probably use a specific post-training step (like RLHF or better) for planning, reasoning, following instructions step by step etc. This additional step doesn’t depend on the outputs of millions of humans.
An LLM is a language model, not a world model. It has never once had the opportunity to interact with the real world and see how it responds - to emit some sequence of words (the only type of action it is capable of generating), predict what will happen as a result, and see if it was correct.
During training the LLM will presumably have been exposed to some second person accounts (as well as fictional stories) of how the world works, mixed up with sections of stack overflow code and Reddit rantings, but even those occasional accounts of real world interactions (context, action + result) are only at best teaching it about the context that someone else, at that point in their life, saw relevant to mention as causal/relevant to the action outcome. The LLM isn't even privvy to the world model of the raconteur (let alone the actual complete real world context in which the action was taken, or the detailed manner in which it was performed), so this is a massively impoverished source of 2nd hand experience from which to learn.
It would be like someone who had spent their whole life locked in a windowless room reading randomly ordered paragraphs from other peoples diaries of daily experience (also randomly interpersed with chunks of fairy tales and python code), without themselves ever having actually seen a tree or jumped in a lake, or ever having had the chance to test which parts of the mental model they had built, of what was being described, were actually correct or not, and how it aligned with the real outside world they had never laid eyes on.
When someone builds an AGI capable of continual learning, and sets it loose in the world to interact with it, then it'll be reasonable to say it has it's own world model of how the world works, but as as far as pre-trained language models go, it seems closer to the mark to say they they are indeed just language models, modelling the world of words which is all they know, and the only kind of model for which they had access to feedback (next word prediction errors) to build.
Given the widely different natures of a theoretical "book smart" model vs a hands-on model informed by the dynamics of the real world and how it responds to your own actions, it doesn't seem useful to call these the same thing.
For sure the LLM has, in effect, some sort of distributed statistical model of it's training material, but this is not the same as knowledge represented by someone/something that has hands-on world knowledge. You wouldn't train a autonomous car to drive by giving it an instruction manual and stories of peoples near-miss experiences - you'd train it in a simulator (or better yet real world), where it can learn a real world model - a model of the world you want it to know about and be effective in, not a WORD model of how drivers are likely to describe their encounters with black ice and deer on the road.
> The distinction I want to emphasize is that they don't just predict words statistically. They model the world, understand different concepts and their relationships, can think on them, can plan and act on the plan, can reason up to a point, in order to generate the next token.
You replied:
> How can an LLM model the world, in any meaningful way, when it has no experience of the world?
> An LLM is a language model, not a world model.
No one in this discussion has claimed that LLM's are effective general purpose agents, able to throw a curve ball, or drive a vehicle. The claim is that they do model the world in a meaningfull sense.
You may be able to make a case for that being false, but the assumption that direct experience is required to form a model of a certain domain is not an assumption we make of humans. Some domains, such as mathematics, can only be accessed through abstract reasoning, but it's clear that mathematicians form models of mathematical objects and domains that cannot be directly experienced.
I feel like you are arguing against a claim much stronger than what is being made. No one is arguing that LLM's understand the world in the same way human's do. But they do form models of the world.
I think "The Platonic Representation Hypothesis" is also related: https://phillipi.github.io/prh/
Unfortunately, large LLMs like ChatGPT and Claude are blackbox for researchers. They can't probe what is going on inside those things.
* So we find ourselves over and over again explaining that that might have been true once, but now there are (imperfect, messy, weird) models of large parts of the world inside that neutral network.
* At the same time, the vector embedding math is still useful to learn if you want to get into LLMs. It’s just that the conclusions people draw from the architecture are often wrong.
Completely pointless to anyone who is not writing the lowest level ML libraries (so basically everyone). This does now help anyone understand how LLMs actually work.
This is as if you started explaining how an ICE car works by diving into chemical properties of petrol. Yeah that really is the basis of it all, but no it is not where you start explaining how a car works.
But wouldn't explaining the chemistry actually be acceptable if the title was, "The chemistry you need to start understanding Internal Combustion Engines"
That's analogous to what the author did. The title was "The maths ..." -- and then the body of the article fulfills the title by explaining the math relevant to LLMs.
It seems like you wished the author wrote a different article that doesn't match the title.
You don't need that math to start understanding LLMs. In fact, I'd argue its harmful to start there unless your goal is to 'take me on a epic journey of all the things mankind needed to figure out to make LLMs work from the absolute basics'.
maybe this is the target group of people who would need particular "maths" to start understanding LLMS.
All that is kind of missing the point though. I think people being curious and sharpening their mental models of technology is generally a good thing. If you didn't know an LLM was a bunch of linear algebra, you might have some distorted views of what it can or can't accomplish.
Also: nobody who wants to run LLMs will write their own matrix multiplications. Nobody doing ML / AI comes close to that stuff ... its all abstracted and not something anyone actually thinks about (except the few people who actually write the underlying libraries ie. at Nvidia).
Is the barrier to entry to the ML/AI field really that low? I think no one seasoned would consider fundamental linear algebra 'low level' math.
The barrier to entry is probably epicly high because to be actually useful you need to understand how to actually train a model in practice, how it is actually designed, how existing practices (ie. at OpenAI or wherever) can be built upon further ... and you need to be cutting edge at all of those things. This is not taught anywhere, you can't read about it in some book. This has absolutely nothing to do with linear algebra ... or more accurately you don't get better at those things by understanding linear algebra (or any math) better than the next guy. It is not as if 'If I were better at math, I would have been better AI researcher or programmer or whatever' :-). This is just not what these people do or how that process works. Even the foundational research that sparked rapid LLM development ('Attention Is All You Need' paper) is not some math heavy stuff. The whole thing is a conceptual idea that was tested and turned out to be spectacular.
This is the first time I've seen someone claim this. I don't if it's display of anti-intellectualism or plain ignorance. Otoh, most AI/ML papers' quality has deteriorated so much over the years, publications in different venues are essentially beautified PyTorch notebook by people who just play around randomly with different parameters.
It appears to me some people have this special kind of naïveté about how foundational type of knowledge such as math gets actually used in practical applications. In practice, it just gets used (in software usually through some library) and never gets thought about again. They are not trying to invent new ways to do math, they are trying to invent AI :-).
Also, those people understand LLMs already :-).
Most people’s educations right here probably didn’t even involve Linear Algebra (this is a bold claim, because the assumption is that everyone here is highly educated, no cap).
At some point I tried to create an introduction step-by-step, where people can interact with these concepts and see how to express it in PyTorch:
https://github.com/stared/thinking-in-tensors-writing-in-pyt...
Then it is able to work at different levels of abstraction and being able to find analogies. But at this point, in my understanding, "understanding" is a never-ending well.
[0] https://www.coursera.org/specializations/mathematics-for-mac... [1] https://www.manning.com/books/math-and-architectures-of-deep...