I'm a developer but not very good at maths and I still don't understand any of it.
A LLM clearly has some "visual" capacity. You ask Gemini to build something with Canvas and it's able to reason about the shape of things. Like recently I waanted a checkbox that has like a gradient flowing around the edge. It figured out it could use a radial gradient from the center of the checkbox, and overlay that with a small inner div so you only see the edge that looks like the gradient is circling around the checkbox.
How is that "predicting the next word"?
Not saying AI is intelligent or conscious or anything like that, but the algorithm clearly is far more complex than "predicting words".
What I mean, is the LLM is able to represent things in space . That part I don't understand.
I also still dont understand the relationship between the chat based LLM and the multi modal stuff. I think I read somewhere when image is generated it is also tokens?
At all times the LLM is, indeed, predicting the next token. Anything it does emerges from that.
It did not "figure anything out". It predicted that text describing the use of a radial gradient was likely to follow text describing your problem.
The point is that saying they're just "predicting the next token" is not at all explanatory nor providing insight. Saying the brain is just firing action potentials gives you no understanding about how the brain does what it does or what the space of its capabilities are. Similarly, predicting the next token tells you nothing about the capabilities of LLMs.
If you train the LLM on a corpus that shows people saying the sky is red, you get an LLM that is predisposed to say the sky is red. This is true even if it's also trained on all of the science that explains how and why the sky is blue.
If it were to "figure out" or "reason", it would not have such a predisposition to emit "red" after "the sky is" just because that matches the reward during training.
In other words, the token prediction is important because it both explains the successes AND the failures of the LLM. If there were situations in which a bird could fail to fly, then how it tried to fly would also be crucial knowledge.
You're caught up on the mechanics of token processing (floating point matrix ALU math) and ignoring the context that p(next token) as a function being "computed" is doing so over a trillion parameters. You can poorly train a model, sure, but assuming you don't indoctrinate it too much, properties like cognition emerge - it learns to reason; why? Reasoning is more efficient and compact than memorizing answers.
Why do you think this is mutually exclusive to "LLM predicts the next token"?
If you tell someone from 19th century that bytes (just 0s and 1s!) can represent an opera, a song, or even a whole interactive experience, they might be really confused. But there is no reason they can't.
If you tell someone without math background that the sums of smaller and smaller sin waves can represent pretty much anything in our universe, they might be really confused. But there is no reason they can't.
There is simply no reason that a next-token predicator can't generate a nice-looking checkbox.
Multi-modal models that can understand visual input do exists, but no such visual reasoning process happened in the example you mentioned. Not unless you have a visual feedback loop in the coding harness.
I'm not dismissing the capability of "predicting the next word" however. The vast amount of training data enable extremely complex and useful behavior you just described.
For instance I’ve written a few custom languages to learn how to write a VM and the lexer/parser/compiler/etc. that it had never seen before and then just gave it the syntax which is different than what it had ever seen before. Simply due to the fact I made it and it had never been trained on it.
After giving it my documentation, it was able to write the language just like a language that it had been trained on. I’ve also seen this behavior at work where there are weird quirks to do things and definitely not standard and it can handle it.
But I think it will have difficulty in crossing paradigm boundaries, by simply using documentation.
The exact syntax does not matter, only the grammar. If you give it the grammar, and then the keywords, it can find something that has similar grammar and then use your keywords.
As a for instance, back in the day some academics wrote a paper that compared GPT 3.5 to a couple of inductive programming systems (including one of mine) on solving programming problems in a certain well-known esoteric language which I shall call "L". The task was to solve those programming problems one-shot. The authors asserted that the "L" problem sets were unlikely to be in 3.5's training set, but I found them without much search in a public github repo. I mean the entire dataset was right there. In this case the researchers are colleagues and friends and I know they weren't simply negligent or malicious, they just missed the fact that their "unlikely to be in the training set" data was on the web.
So I'd always assume that if an LLM can perform a task that's because it's seen examples of the task during its training.
Without forgetting that LLMs have this really shockingly powerful ability to interpolate between examples and they can improve their performance on say Task A by training on Task B, where A and B are different but similar.
e.g. they seem to get better at translating between language pairs of which they have few examples of parallel text by training on other pairs of languages for which they have more parallel text; they seem to learn something about language translation in general by training on more examples of translation. I haven't got a good reference on that handy but it's well-known (and of course over-hyped and exaggerated by tech CEOs).
So without wanting to diminish your work, I'd guess that your new language's syntax is different and novel but everything else about it is more ordinary and the similarities are such that an LLM can wing it and write you a lexer etc. After all, the whole point about parser generators and similar tools is that the task can be abstracted and separated from syntax in the first place.
In fact LLMs are very good at that sort of thing, filling in the blanks as it were. I'm old enough to remember the excitement about GPT 3.5 being able to form syntactically correct sentences with nonsensical words give to it.
For example, I just asked Chat [1]:
Hey chat. The gostak distims the doshes. What happens to the doshes?
And it promptly answered: The doshes get distimmed.
See, it even got the spelling right!_________________
[1] https://chatgpt.com/c/6a242b65-e248-83ed-9a6e-f238a1e871b6
Emergent properties of complex systems should not be diminished just because the underlying operating principle is simple.
It is imitating the text written by humans who can represent things in space.
If I can do my best to answer, Gemini is a multi-modal system. That means it's trained not only on text but also still images, video and also sound. The training happens in parallel and the representation of each modality is usually different, so the image recognition part is not trained on text tokens but pixels, the video part (probably) on video frames etc. There is some kind of integrated training that goes on so that text can be generated that is correlated to an image and so on, but I don't know the specifics about Gemini in particular. This kind of thing is not exactly new either, you can find systems that captioned images before the rise of LLMs simply by training on examples of images coupled to their textual descriptions.
In that sense it's not entirely correct to call Gemini an "LLM" because it's not only a "language" (or, more precisely, text) model. But LLM I guess becomes a bit of a shorthand for everything based on, or combined with, an LLM.
Anyway that's what's going on: it's not just predicting the next word. It's also predicting the next image frame or the next set of pixels etc associated with the next word.
It has read all of stackoverflow, so it has seen your kind of problem before. Try asking it something really unusual and it will shit the bed.
Can stochastic parrots understand irony?
No, they generate grammatically coherent text. That is because human language grammars are fundamentally mathematical structures that can be approximated with matrix operations.
They don't generate meaningful text because they have no inherent knowledge of the world.
If you've used LLMs for any amount of time you've already noticed how often they get confused about numeric quantities - like confusing notions of "bigger than" and "less than" or being unable to count letters in words.
This is because any meaning in their output is only accidental.
int n_tokens = 0;
while (n_tokens < TOKENS_MAX) {
int next_token = decode(context, ++position);
print(token_to_text(next_token));
++n_tokens;
}
If you don't believe me then just download llama.cpp and see for yourself.But how does it learn this token-relationship?
All it has is many text samples, but still, nowhere it says how the tokens relate to each other, so where does this information come from?
The model could just as well learn to predict next token from gibberish text as long as there were some statistical gibberish regularities to learn. However, if you train it on real meaningful text then the statistical regularities it needs to learn (and will, thanks to gradient descent, and the capable architecture) will be those reflecting "token relationships" - grammar, semantics, etc.
So, you can say the "token relationships" (incl word meanings) are reflected in the statistical regularities of the training data, and the model architecture and training algorithm are just very capable of learning those regularities whatever they may be.
You can consider it related to Word2Vec word embeddings, which are based on the idea that the meaning of words comes from how they are used, which to a first approximation can be implemented by considering the meaning of words to be defined by the words they appear next to(!), which is what the Word2Vec embedding training algorithm does, and famous examples such as "(king - man) + woman = queen" prove that this is in fact learning the meanings of words.
It's the same thing here, you randomly try various token-relationship values and the ones which are slightly better will be favoured.
it goes all over the place.
i'm not actually sure who your target audience is.
there's too many side tangents.
just like, structure it plz.
1. customer feels bad cuz they don't understand how llms work
2. provide high level abstracted explanation (don't dive into concepts yet)
3. provide breakdown guide of overall set of components.
4. walk through each component. don't side track. no need to explain, ROPE,GQA etc... it just distracts.
i.e. customers don't know how llms work, leading them to feel bad about their own intelligence.
at a high level llms take in words, do some math on them, and then produce words, one by one.
inside llms have these different components. we walk through them step by step.
1. tokenizer
2. embedding
3. attention
4. heads
5. ffn
6. sampling
## tokenizer