How LLMs work - Hacker News

Posted by 0xkato 2 days ago

680 points | 188 comments

malwrar 13 hours ago|

Back when ChatGPT came out, I was so shocked by how _good_ it was for an “AI” product that I simply had to know how it worked. Over the next month I ended up drawing out a block diagram on a whiteboard I have in my office, with the math involved next to each step in the blackboard. I’d puzzle about each step along the way, and the triumph of completing the drawing was also that of this sense of deep understanding. I kept that drawing up for many months after, and would gaze at it often during meetings and idle moments in wonder.

This is to say: the autoregressive decoder-only transformer llm architecture as pioneered by openai is wildly simple for how revolutionary its results are. I was reading about non-learned classical SLAM systems (uses video + handcrafted math to produce 3d mappings of physical spaces while also locating the camera in those spaces) at the time, and comparatively speaking I’d say the math is about as complicated as ONE of the components in those complex formulations. The only reason frontier LLMs need 6-figure computers to run is because the model designers made the middle bit in those models REALLY BIG, dimensionally speaking. They just took the steam engine, made a few gargantuan versions of it, and are selling them as the ultimate source of power.

This was openai’s entire breakthrough. Making this particular model architecture larger leads to emergent capabilities like being able to pick the best ending to a story/set of instructions or answer questions about broad factual knowledge. I’ve been meanwhile watching these AI companies attempt, successfully, to sell this capability as some sort of robot consciousness hand-crafted by supergeniuses. The fact that they are getting away with it is almost as shocking to me as the discovery itself.

ekunazanu 9 hours ago||

> This was openai’s entire breakthrough. Making this particular model architecture larger leads to emergent capabilities

Basically, the bitter lesson: https://www.cs.utexas.edu/~eunsol/courses/data/bitter_lesson...

williamstein 1 hour ago|||

This interview https://youtu.be/oWOz2htozfI?si=qdQ0uZRoZOYeThOn from 2 days ago with a top researcher from OpenAI directly addresses the bitter lesson argument and the importance of scaling for the history of their models.

xnx 1 hour ago|||

Isn't the bitter lesson basically the same as "The Unreasonable Effectiveness of Data" from 2009?

jfim 12 hours ago|||

Indeed. It's pretty interesting to realize after implementing GPT-2 that the frontier models are scaled up versions of that, with various tweaks to improve performance, model-wise.

The secret sauce though is all the datasets, RL training, knowledge of what works from doing all kinds of ablation experiments, and a massive compute moat.

gobdovan 10 hours ago|||

The secret sauce is also having the necessary 'creativity' to not get ceased and desisted into oblivion and jail from all the copyrighted material you trained your model on. Btw, not making a moral judgement, [0] shows Michael and Dalton from YC discussing why Ilya Sutskever had to leave Google to pursue what's now ChatGPT

[0] https://youtu.be/E8pvgN1j-Ck?t=748

root-parent 3 hours ago|||

There is a whole moral judgement to be made here...lets hope Ilya wont get too pissed off if somebody leaks the work of his new initiative...information wants to be free and all that...

Also would love to know if the same Legal team advised on Gemini...

someguyiguess 3 hours ago||||

And to make anyone who threatened to expose them “commit suicide”

miltonlost 2 hours ago|||

He's a massive massive thief that people who have stolen far less from a convenience store have gone to prison for. The man is a villain.

achrono 11 hours ago||||

How do we know that today's frontier models are merely scaled up versions of that? Genuine question, since the labs have narrowed what they share over the years to now almost nothing, in terms of how the model was trained and how it works under the hood.

HarHarVeryFunny 3 hours ago|||

We know for sure the architecture of the open weights models since llama.cpp understands the architecture it needs to build to plug the weights into to run them. It's always possible that the latest closed model is doing something architecturally different than the open weights ones we know about, but judging by how close the large open weight models such as DeepSeek are to SOTA performance, this seems unlikely. When OpenAI first came out with their near-mythical "Strawberry" (aka "o1") thinking model there was all sorts of speculation that they had made some sort of architectural breakthough, but then DeepSeek replicated the capability and published how they did it, proving that it was just better training, not any architectural change.

There have been minor changes to the architecture over the years, but these are basically all efficiency tweaks such as various types of attention (some pioneered in the open by DeepSeek) that better scale to large context lengths, and the confusingly named "mixture of experts" architecture, but what's more notable really is how little the architecture has changed. The capability gains have been coming from better training and better data.

gobdovan 10 hours ago||||

DeepSeek research:

- V3 https://arxiv.org/abs/2412.19437

- V2 https://arxiv.org/abs/2405.04434

- R1 https://arxiv.org/abs/2501.12948 (RL applied to ML models was well-known beforehand, but they show it in the open, at scale, on big models)

Then, there's the incentive analysis. If you can see that these models empirically get better with scale, why would you swap the main architecture? Those events will be pretty rare. I'm not saying there's noone cooking a new architecture, just that it is a pretty rare event. And it would have to come from some researchers that would be happy to not publish their findings, which is not really what a sizable portion of elite researchers (obviously not all) are incentivized to do.

Of course, it's a bit of a verbal compression to claim simply 'scaled up'. They are recognisable scaled up transformers, but most new models come with a few tricks, but we're at the point where those usually are not an architectural rewrite and added to solve an explicit problem, like hallucination, not for big new capability gains.

matusp 10 hours ago||||

There are thousands of people working in top level labs. Somebody would leak it

ai_slop_hater 11 hours ago|||

No they are clearly not just scaled up versions of gpt 2; there are different LLM architectures like mixture of experts etc that appeared relatively recently. I am not an expert though, far from it.

otabdeveloper4 11 hours ago||

MoE and such are basically performance enhancements, they don't make the model smarter.

jmalicki 4 hours ago|||

Performance enhancements are huge though.

If you can make the existing model faster, you can then save your inference budget to then make your model bigger, which then makes it smarter.

A lot of how smart the models can be comes down to budget. If you can make your existing thing cheaper, you can instead make it bigger for the same price.

otabdeveloper4 3 hours ago|||

> to then make your model bigger, which then makes it smarter

There's diminishing returns and at some point making a model bigger makes it dumber.

TheHalfDeafChef 4 hours ago|||

Not really “smarter” though? It’s just a big probability engine.

(Not trying to flame bait or anything. I just wouldn’t call LLM as exhibiting intelligence. It is great at making connections based on probability but doesn’t have a semantic understanding of what it is doing)

yababa_y 10 hours ago|||

separately trained experts can surpass performance in their activated regime and DOES result in a smarter model, the Claude system cards talk about this and eg there is https://openreview.net/forum?id=iydmH9boLb to read...

locknitpicker 2 hours ago|||

> The secret sauce though is all the datasets, RL training, knowledge of what works from doing all kinds of ablation experiments, and a massive compute moat.

ReAct loops and tool-calling are the critical development feature. They turn a model from something that generates text into something that can independently influence the world around them.

Without agent features, you have just a chatbot.

antirez 9 hours ago|||

There is a different way to look at this: that is, actually the Transformer is a minimal complication of what the based model is: in theory the neural network could be just a huge FFN, which is anyway the part of the Transformer that does the heavy lifting. But this would be impossibile to train both numerically and computationally, so the Transformer encodes enough priors for it to work: the causal attention, and the math tricks like the residuals and so forth. But the bottom line of all this is that the Transformer works because of the incredible semantical power of simple/huge FFNs.

dist-epoch 6 hours ago|||

Isn't that over-simplifying it a bit too much?

You can go another step - a FFN can be simulated on a Turing machine, thus it just exemplifies the incredible semantical power of the Turing machine model of computation. (in fact you don't even need a Turing machine, since there is no looping in one forward pass).

In theory you can run a huge FFN on the tiniest Turing machine, in practice it's much better to run a Transformer on the latest NVIDIA hardware. Or as they say "quantity (performance) has a quality all its own"

musebox35 5 hours ago|||

I was about to post your last point / quote. Going multigpu is relatively not so though but once you go multi-node you have distributed storage/io/compute system which is highly non trivial. Add that the long training times now you have robustness/fault-tolerantness concerns with hardware failures and restarts. Today’s training systems are engineering marvels.

zbendefy 6 hours ago|||

Good point!

There is also the case for Markov chains being theoretically able to do these if tuned well. Or even SAT problem.

CGMthrowaway 2 hours ago||

"LLM is just fancy autocomplete"

slickytail 7 hours ago|||

[dead]

forestsitter 1 hour ago|||

Same. I recall reading a paper by Stephen Wolfram after ChatGPT came out where he goes over how it works and what it does. Such a good piece and really got me going with this stuff. https://writings.stephenwolfram.com/2023/02/what-is-chatgpt-...

crossroadsguy 9 hours ago|||

What hopes/paths does a mere CS bachelor (not deep into stats/maths), and mid level dev (native mobile only; 10-15 years exp.), have about not only understanding it (maybe not fully) but getting possibly into this as a career? Not expecting churning out models and AI systems from the first weeks/months but entry/employment into this field?

(If I can be honest, and I am not being disparaging about anything lest it might seem so, I am looking at it from a career breakthrough/move perspective rather than an intellectual pursuit.)

malwrar 43 minutes ago|||

Im also a mere mortal, and after putting a few years into it IMO I’d say people make it much more complicated than it actually is. I failed most of my math courses for lack of interest, but found passion later with the aforementioned SLAM stuff. I have no doubt you or any other programmer could learn this stuff, especially since you can ask ChatGPT clarifying questions.

I have no idea about careers at this point, I’m still doing fancy IT work as my day job I and look away from the future with dread. I also haven’t been looking for new roles on the open job market, so who knows maybe there’s multimillion pay packages for anyone who can articulate how attention works in an interview.

2muchcoffeeman 7 hours ago||||

I think you need to ask what you actually want to do with the AI.

If you want to be a researcher and come out with the next breakthrough, get ready to go back to school and learn some math.

If you just need to learn how to use it well and build things with it, then you probably just need to have a high level understanding.

Same as programming. I’d bet most programmers have no idea about the physics that makes computers work.

bluerooibos 5 hours ago|||

> I think you need to ask what you actually want to do with the AI.

What about improving the efficiency of token consumption, etc., basically opportunities for improving cost/performance?

I keep thinking there has to be a better way to share context with models than dumping entire gigantic skill files of raw text or otherwise into them - I'm betting there's a bunch of low-hanging fruit there.

coliveira 4 hours ago||

There may be some low hanging fruit, but they're not available to people without deep understanding of how the math works. Well paid people already spend a lot of time thinking about this.

sirsinsalot 7 hours ago|||

You missed the third and most important reason to learn: fun.

Which sums up HN these days.

LatencyKills 4 hours ago|||

I have a BS in CS (and have been in the field for 25 years). I couldn't understand the transformer architecture until I built a few myself. Here are the books I worked through. I now feel I have a very good understanding of modern LLMs.

https://www.amazon.com/Build-Large-Language-Model-Scratch/dp...

https://www.amazon.com/Build-DeepSeek-Scratch-Abhijit-Dandek...

root-parent 3 hours ago|||

I had the same reaction as you, when I learned in detail, how all this works. But then I also learned about superposition and compressed sensing, and now...I am not so sure anymore...

"Beating Nyquist with Compressed Sensing" - https://youtu.be/A8W1I3mtjp8

wuschel 11 hours ago|||

Could you perhaps cite the core papers for LLMs beyond „Attention is all you need“?

sigmoid10 11 hours ago|||

"Attention is all you need" is actually a bad paper if you want to learn about autoregressive LLMs specifically, because it describes a more complicated encoder-decoder architecture while modern LLMs are decoder only. So it's an unnecessarily hard way to get into the subject. "Language Models are Unsupervised Multitask Learners" is probably what you are looking for (aka the GPT-2 paper). This was the first time LLMs really showed what is possible, i.e. they can learn to generalize very well from unstructured data. So no more human labelling necessary, which until then was the primary bottleneck in ML. The paper also lists several key ingredients beyond transformers that are mostly still in place today. This also highlights that there was more to it than just "scaling the transformer algorithm" like many people claim. Most developments since then were about improving training data, until "Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer" drastically changed the architecture landscape again. Later big developments like thinking/reasoning/chain of thought/inference time compute (whatever you want to call it nowadays) are actually all about training again. They work using the exact same architecture.

redox99 8 hours ago||

Chain of Thought was kind of an obvious solution that everybody knew was necessary by the time chatgpt / gpt4 came out. It was just a matter of time that frontier labs actually shipped it.

MoE was also pretty straightforward, just a bit surprising how well it worked (that you can get away with just 1/32 active parameters), but most researchers would have come up with it on their own probably.

The true ground breaking papers are the first two you mentioned (transformers and gpt2), and InstructGPT was also very surprising that it worked so well.

blackbear_ 10 hours ago||||

The GPT3 paper is a good starting point

Language Models are Few-Shot Learners https://arxiv.org/abs/2005.14165

I also enjoyed the papers for DeepSeek and GLM for an overview of all the tricks you need to make these things work

DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models https://arxiv.org/abs/2512.02556

GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models https://arxiv.org/abs/2508.06471

sharma-arjun 10 hours ago|||

Not a core paper, but I found Formal Algorithms for Transformers [1] (a Google paper from 2022) to have a great pedagogical style.

[1] https://arxiv.org/abs/2207.09238

barrenko 10 hours ago||

I'll add in here https://web.stanford.edu/~jurafsky/slp3/, "Speech and Language Processing", with chapters that deal specifically with LLMs and transformers.

10GBps 12 hours ago|||

Yep. It's nearly identical to the neural nets we were using in the 90s. Back then even a supercomputer wasn't big enough or fast enough to do what we do today.

I have to wonder though. Is this all a human brain is? A similar thing to an LLM just scaled exponentially larger. I mean a brain is not just neurons with simple connections to each other. The neurons, axons, dendrites, <insert_unexplained_thing>, etc in a brain are all holding and processing information in different ways and doing it nearly 100% in parallel. That's a really big model.

The biological discoveries show how complex a biological brain actually is. Even the tiny brains in a bee or spider are able to solve puzzles and use tools. That's crazy.

ctolsen 11 hours ago|||

No, it’s definitely not what a human brain is. That makes very little sense. The ways we interact with language (and thus conceptual memory) is completely and fundamentally different.

rfv6723 10 hours ago|||

Is it different though?

If we look beyond written languages which are late inventions of human civilization, oral languages are continuous and build with blocks not words.

Chomskyan school misled the entire field of linguistics for decades by ignoring spoken languages.

uoaei 9 hours ago||

It is different, but there may be some universal principles that are relevant more abstractly among both cases. Of particular interest is the empirical notion that statistical models of a certain form will always tend to "average out noise" and "learn meaningful patterns" up to the capacity that those models have for representing said patterns. A parallel notion to this is the hypothesis dubbed "thermodynamic origins of life". The universal principle binding these two seemingly disparate topics is one that seems to underlie any sense of "learning" in physical systems: that semantics of those systems depend on their representational power, and the semantics they do come to represent are the results of adding up many pushes in one "direction" (phase space / state space / etc.) encoding a pattern, and adding up many random noise jiggles will cancel out but give you a first-order sense of variance of those semantic features as expressed by the environment.

As this description is so overly abstract, an exercise for the reader is to try to work through an explanation of how, say, a river delta comes to "learn" about its environment by "reacting" to the influences at its borders, and how it "encodes" whatever it is that it learns in the substrate that it inhabits.

zaphirplane 4 hours ago|||

But … how close a simulation is it. I can see why people are wondering

redox99 8 hours ago||||

In the 90s you didn't have norm layers, residuals, attention, and some more.

So you're missing a lot of the building blocks that make LLMs. It's not a matter of just having the compute.

sirsinsalot 7 hours ago||

I think the attention mechanism is so simple but so revolutionary that people forget it.

Like the best leaps in thinking, once it is made, is is immediately obvious and intuitive.

redox99 5 hours ago||

Almost everything in ML is like that. It seems so obvious in hindsight. It's maybe what I love most.

Residual connections are so simple, so obvious and so vital. Yet nobody came up with them until 2015?

sirsinsalot 2 hours ago||

I suspect it was considered many times, but the sheer computation scale would make it feel like obscene brute force. It feels like the right shape but too wild to think about implementing.

I think as time went on, and hardware got better, it seemed more reasonable to actually think about a viable implementation of what I think was a widespread intuition anyone in ML had that everything's context is everything.

It just seemed like a theoretical thing until hardware caught up. Maybe. Perhaps I'm applying a retrospective excuse to why it took so long.

bonoboTP 11 hours ago||||

Attention layers were not used in the 90s.

spacebacon 10 hours ago||||

LLMs are semiotic infrastructure. You won’t find a better analogy. The cognitive frame won’t hold.

otabdeveloper4 11 hours ago||||

> I mean a brain is not just neurons with simple connections to each other.

No, it's not. There are many animals that have extremely complex and even learned behaviour that have literally zero neurons.

Clearly "neurons" is an oversimplification just-so story, not a scientific theory.

adammarples 8 hours ago|||

Apparently even single-celled protozoa can show learned trial and error behaviour.

formerly_proven 9 hours ago|||

Do you consider fungi animals or do you perhaps mean animals that don't have a brain/CNS?

otabdeveloper4 3 hours ago||

Yes, protozoans don't have brains and yet they exhibit complex behavior.

foxes 12 hours ago|||

Probably better to not simply reduce it by just saying X is Y then if it has all that extra complexity and capacity.

GardenLetter27 9 hours ago|||

It's not just the architecture but also the data - the decoder only approach lets you train in parallel over blocks of text (no RNN serial waiting), that allows you train on much, much more data.

sesm 7 hours ago|||

I would argue that those are not emergent property of the model, but a property of how humans find insights in a plausible guess.

bluerooibos 5 hours ago|||

Since you spent a month digging into this, can you recommend any materials/projects to look into to get a decent grasp of how they work?

malwrar 33 minutes ago|||

I’d recommend my method of just drawing out the block diagram and drawing out + digging into the math at each step! I’m the kind of person who needs to take time to ask lots of questions before stuff clicks, and if you are too I strongly recommend it.

I picked it up from trying to teach myself that SLAM stuff. The papers are very short, but highly information dense and at the time there was no ChatGPT to help me. I got through them by just creeping my way through the math with a whiteboard, and something about drawing it out and having it there in my office made it all click. Trying to watch piecemeal lectures on YouTube or grind through foundational books like MVG just didn’t work for me, I used them instead as references for my drawings.

Same happened when I tried learning this GPT stuff. karpathy’s videos were out at the time, but I couldn’t really stay focused on them or connect the math with the code. Most other descriptions I could find were focused on getting you to use their inference library or harness. Assembling the picture together on my whiteboard by focusing on drawing out the block diagram continues to be my personal favorite method for deep understanding of complex systems.

LatencyKills 4 hours ago|||

Not OP but I worked through Sebastian Raschka's "Build a Large Language Model (From Scratch)" [0] and Raj Abhijit Dandekar's "Build a DeepSeek Model (From Scratch)" [1] books.

I don't think there is anything in a transformer I couldn't explain in the smallest detail now.

[0]: https://www.amazon.com/Build-Large-Language-Model-Scratch/dp...

[1]: https://www.amazon.com/Build-DeepSeek-Scratch-Abhijit-Dandek...

hackinthebochs 4 hours ago||

>I don't think there is anything in a transformer I couldn't explain in the smallest detail now.

If you're up for it I would love to know how and why positional encodings work

root-parent 3 hours ago|||

Learn about superposition and then you will see nobody really know why this stuff works. Its actually a good interview question to set the bar....

LatencyKills 4 hours ago|||

Well, as I suggested, working through the implementation yourself will give you that intuition. That said, I think the simplest way to explain why positional encodings are useful is that it gives the transformer just enough information to make attention meaningful without negatively impacting any parallel, content-based comparisons.

A vanilla self-attention layer is just a set of token vectors. Without positional info, swapping two identical embeddings changes very little about what attention can compute. We can "fix" this problem by using positional encodings. Text that has meaning isn't just a set of characters; the location and order of those characters is what provides meaning.

darksim905 12 hours ago|||

For anyone who is curious about the first paragraph here, this is actually a great video overview of how LLM works and the tokenization part.

Tangentially related: This part always seemed fuzzy to me, especially when dealing with data scientists and how they talk about how 'ML' looks at problems. I had this issue when working at a SIEM vendor where they kept going on about use case development having to be designed a certain way to catch things. It was all very frustrating.

cloche 1 hour ago||

> this is actually a great video overview of how LLM works and the tokenization part

Did you mean to link to the video? I would be interested.

pkoird 12 hours ago|||

aka "the bitter lesson"

Gmolomo 9 hours ago|||

Sooooo just because you are able to understand it, it's not worth anything?

It doesn't has any impact?

Ah wait it does. Mh weird.

Why are you not creating a startup and get rich?

sarjann 8 hours ago||

I mean there is a little something called compute. And other complexity that comes like writing code to efficiently distribute a model across machines.

dominotw 4 hours ago|||

> Over the next month I ended up drawing out a block diagram on a whiteboard I have in my office, with the math involved next to each step in the blackboard. I’d puzzle about each step along the way, and the triumph of completing the drawing was also that of this sense of deep understanding. I kept that drawing up for many months after, and would gaze at it often during meetings and idle moments in wonder.

how did you know about the steps and there was math involved. i am curious about your process and you came up with what exactly to learn to unravel the mystery.

coliveira 4 hours ago|||

Don't forget the stolen data from books and papers. You'll never get anything intelligent without using the stolen data they had access to.

giardini 17 minutes ago||

WTF?

golergka 8 hours ago|||

After building some toy LLMs on my own I came to realise that architecture is not the hard part. Train is.

dist-epoch 6 hours ago||

That's easy to say AFTER you know the architecture.

Einstein special relativity is taught these days in high-schools. Doesn't mean it wasn't the very hard part at some point in time.

As they say, shoulders of giants.

faurroar 12 hours ago|||

Architectures have evolved significantly since then. DeepSeek v4 =/= GPT-3. Even then, a great deal of complexity lies in everything surrounding the architectures e.g. how do you implement them performantly on modern accelerators, how do you distribute the model across a set of accelerators, how do you post-train, etc. And pre-training itself is a dark art. If you legitimately think that frontier labs are doing something equivalent to whatever you wrote on your whiteboard, you’re clueless.

jumploops 12 hours ago||

Those are all just optimizations.

We still don’t really know why they work, we just know how to build them.

trollbridge 12 hours ago|||

We don't really know why language works with humans, either. If you raise a baby from birth, you kind of observe how it is learning language, but the process is also rather mysterious. My eldest son's first word was to actually imitate a cow mooing, and then after that to imitate a motor noise of a tractor or truck. And then after that a meow. (His first complete sentence was "King Graham fell"...)

My next child took a completely different path to language, including skipping all the non-verbal imitations.

And then at some point, you just suddenly can two-way communicate with them when you couldn't before, and then after that, they can engage in reasoning.

jumploops 11 hours ago|||

Completely agree!

It’s interesting to me how similar attempting to understand LLMs is to neuroscience.

“When we turn this bit off, this other thing happens… if we change these weights the Eiffel Tower is now in Rome”

We’re basically just probing around and trying to reverse engineer an emergent system.

To your point, this system may be quite different from model to model (human to human) although some similarities likely occur.

The comment I was responding to tried to belittle the OP’s understanding of transformers, by mentioning that running an LLM at scale is much harder than the simple white board diagram.

My point was simply that we don’t know why they work, and all the extra optimizations isn’t the “thing” that makes it emergent.

Simply scaling the “GPT” is good enough to see it, so the OP’s awe should stand.

(On a side note, what other architectures can we scale to find similar emergent behavior?)

trollbridge 7 hours ago||

Computer vision ends up displaying emergent behaviour. It just "figures out" things.

ai_slop_hater 11 hours ago|||

Human brain capabilities are truly amazing, imagine if people didn’t treat their children as if they are stupid and didn’t constantly lie to them, because kids are stupid right, they wouldn’t understand. What heights could be reached.

baq 11 hours ago|||

We don’t treat children like they’re stupid, we treat children like they’re children. A stupid adult is treated very differently than any child.

Adults are expected to have their world models approximately correct in terms of physical environment so they won’t accidentally kill themselves by falling off a cliff; then there are the social norms which adults are expected to conform to so everyone is kinda predictable to everyone else so adults don’t kill each other too often over food or mates. Understanding of neither is expected from children.

ai_slop_hater 10 hours ago||

You may have been raised properly since you don’t get what I mean. I really envy kids with “Chinese parents” that had them learn math early on and not some bullshit like that if you put your tooth under your pillow, then a tooth fairy will come.

mejutoco 10 hours ago|||

I think those 2 are orthogonal. Math still works with Santa or the tooth fairy.

ai_slop_hater 9 hours ago||

Maybe math works but critical thinking doesn’t. There are people who have lived for many decades without ever questioning insane b.s. they were taught as kids.

beezlewax 10 hours ago||||

It is possible to have learned both things you know.

skydhash 5 hours ago|||

I had to learn maths early (not chinese or asian) and also a bunch of scary stories to make me behave. I would have been glad to learn about fairies.

trollbridge 7 hours ago||||

They aren't stupid, but they aren't quite ready to handle the full responsibilities of the world and worry about things they don't need to worry about.

My son is very worried about black holes lately when he learned anything that goes into one can't get out. He's pretty concerned astronauts could get stuck in one some day. So I explained to him that Hawking radiation does actually mean you can eventually get out; it just takes some time.

I didn't think it pertinent to mention spaghettification, the fact anywhere near a black hole will be really hot, or that cosmic censorship means whatever Hawking-radiates from a black hole wouldn't be an astronaut anymore.

It was also fun to hear Hawking speak. He wanted to know if Hawking was a robot. I said no, but he has a robot talk for him. Not quite true, but close enough.

pmg101 10 hours ago|||

Because god forbid that childhood, the one time in your life when you don't have any responsibilities, should be fun.

ai_slop_hater 10 hours ago||

Waste 22 years of life without learning anything and then slave away at a 9-5 job you hate. Brilliant strategy. At least you had “fun”. Then blame billionaires or something.

skydhash 5 hours ago||

Childhood only lasts 13 to 15 years where I am. By the time you’re in high school, you can be expected to be responsible in some matters. By 22 you have 7 years of experience in making decisions for yourself.

slopinthebag 11 hours ago||||

Hm, I wonder if it's more that we're shocked such a simple thing (relatively speaking) can work so well.

malwrar 27 minutes ago||

It was precisely that for me! Another commenter captures it well; “the bitter lesson” indeed.

otabdeveloper4 10 hours ago|||

We do know how they work. They predict the next statistically most likely token.

The "bitter lesson" is that fake-it-till-you-make-it is a valid way of doing knowledge work.

(Or not make it, then people will just claim you're holding the LLM wrong and it's not the AI's fault.)

klempner 4 hours ago|||

Sufficiently good iterated next token prediction is an AI hard problem.

throw310822 10 hours ago||||

> statistically most likely token.

Statistically most likely in what context, given which preconditions? Because each prompt sequence is unique so the probability of any token following it is unknown.

skydhash 5 hours ago||

It’s not unknown because that’s what the model computes. It’s matrix multiplication just like shaders.

throw310822 5 hours ago||

And how do you know that the model computes it correctly?

skydhash 4 hours ago||

Correctness is based on axioms and rules. You need to define your axioms and rules first before you can determine correctness.

If you’re talking about matrix multiplication, I can use mathematical rules and axioms and proves formally that the multiplication is correct. For next token prediction, I can prove that the set of tokens is finite and that the next token is always part of that set.

But things like grammar correctness, or semantic consistency over a few sentences are not hardcoded rules in the model. They’re emergent properties, mostly due to the amount and quality of data available for training. Quantization is mostly about how much we can shed without loosing a particular emergent properties (like dithering or psycho acoustic audio compression)

perching_aix 3 hours ago|||

This "they just predict the next statistically most likely token" is such an handwavey and willfully misleading explanation, it's unreal, and I'm so fucking tired of seeing it so incessantly repeated. It's beyond asinine.

You know it perfectly damn well that a typical person's idea of statistics is not some insanely high cardinality stateful prediction, but a "well a coin toss is a 50:50, and a lottery win is a 1:100000000". You also know it perfectly damn well that as a result, people will just think that all the sentences chatbots ever produced to them were then just somewhere in the massive training set, letter by letter. This insinuation is often even explicitly appealed to.

And that picture is outright false. It's a statistical process, yes, so saying that it does what it does by "just doing statistics" is gonna be a generally correct description, but that's not at all inquisitive to how exactly does it do it, nor is it the zinger you think it is. If you did the aforementioned, you'd just get milquetoast nonsense, like you can see in the countless Markov-chain primers. And while the models do have a lot of the training set lossily captured, they do also absolutely generalize (that's how they can do that lossy compression), and you can quite literally find representations of those generalizations in them, and also see them activate.

It's like summarizing how any program works by just saying "well it just manipulates ones and zeroes". Not very informative, is it? Or how programs are written by just programmers sitting in a cushy office, ryhtmically pressing keys on a keyboard. Not a very fair or insightful description, which you'll know if you've done any amount of programming in your life on your own. Extends to all other white collar jobs too.

It's also not even true in the most literal sense: models can and do absolutely choose a less than maximally likely next token, that's what the various decoding parameters are for. "Maximally likely next token" further conviently skipping over how that likelihood is established in the first place, i.e. the literal point of the question, going in a cute little circle.

I'm so over this "stochastic parrot" bullshit.

firemelt 8 hours ago|||

fucking well said

lowken10 8 hours ago|||

[dead]

robwwilliams 3 hours ago||

Great, and won’t we all be just as surprised when human self-attentional control turns out to be just as simple or just as complex! Our minds as a strange fabric built of threads of recursions without the benefit of any explicit clock.

miki123211 3 hours ago||

There's one thing I wish people understood about LLMs, and it doesn't really have anything to do with what's inside the neural network part. It's the fact that LLMs can only write in one direction — forward.

When you are writing an essay and realize midway through a sentence that what you've written doesn't make sense, you go back and edit. An LLM can't do that, the only thing it can do is keep on generating. Because training data typically contains full essays and not half-finished sentences which were then edited, LLMs have a strong preference for "saving face" and producing grammatically correct, internally coherent outputs. They will often do so even if the only way to write themselves out of the corner they wrote themselves into is to lie. To maintain internal coherence, they'll then repeat that lie for the rest of the response.

This is also why changing response structure used to affect LLM performance so dramatically. If you asked an LLM to solve a math problem and all-but-forced it to start with the answer, it would have had to calculate that answer before emitting any tokens, something which it very often wasn't able to do. If it was told to follow up the answer with an explanation, it would produce a plausible-sounding explanation to maintain coherence.

If, on the other hand, it was told to start by "thinking step by step", it would often be able to solve the first step, and then the next one given the results of the first, and so on, until it was able to reach the answer. Because the answer came last, it wasn't committing to anything, so had no reason to "save face" and lie.

This part of the problem is basically solved now with reasoning; reasoning is where all the step-by-step stuff happens, even if users aren't always able to see it. In the process of RLVR, models even train themselves into outputting phrases like "let me check my answer once again" in the chain-of-thought; those serve as their "life rafts" which they can use to both save face and change their answer.

chris_money202 1 hour ago|

In terms of our brains though we can only think forward as well (if forward is time). Our brain in the future says something we did in the past was wrong (part of the sentence we wrote) and that informs our body (the agent) to go back and fix it

helloplanets 9 hours ago||

The part about positional encoding is not correct.

> The intuition: instead of adding position info to each token’s vector, RoPE rotates the vector by an angle that depends on its position

You can't rotate the token's entire vector (or all three vectors, whatever is being implied is unclear). You rotate each token's Query and Key vectors only, so dot product can be used to tell how far apart the tokens are when comparing token 1's Query vector to token 2's Key vector.

Positional embedding should just be explained after explaining the Query, Key and Value vectors. When the article explains those only after that, the reader is building up on a wrong intuition and it gets confusing.

giardini 10 minutes ago|

Could you restate this another way: I don't follow.

oceansky 1 hour ago||

Out of curiosity, I wondered if you could break a tokenizer by introducing weird characters not mapped to an id.

But apparently, they either just emit a [UNK] token or translate the unrecognized character into raw UTF-8 bytes.

10GBps 13 hours ago||

I learned TCP/IP by watching and reading raw packets over packet radio at 1200 baud.

I've noticed the same thing is possible if you watch the output of a slow LLM. Eventually you start to see the machinery. input tokens = output tokens, it's math. I can't exactly predict the tokens generated but I can see how they are formed. It's a lot like chess. You can't see every possible move but the mechanism is understandable.

trollbridge 12 hours ago||

Comment <-> username synergy.

helloplanets 7 hours ago|||

It's basically possible build an LLM using just routers+packets, and then hook them up to Wireshark to see it compute!

Maledictus 10 hours ago|||

How would I set this up?

barrenko 9 hours ago||

I'd recommend to maybe also specifically watching Karpathy's videos and focusing on the early parts where he specifically deals with tokenization / embeddings generation (which gets really overlooked), and he does this in most of his videos.

fragmede 11 hours ago||

https://distill.pub/2019/activation-atlas/

I can only imagine what sort of visualizations are going on today inside of the AI labs.

alecco 6 hours ago||

A better blog on Transformers: https://www.aleksagordic.com/blog/transformer

vocram 10 hours ago||

Saying an article is of inferior quality just because editing was AI-assisted is like saying a book is lower quality just because it was printed rather than written by hand

Ampersander 8 hours ago||

You are exactly right! People do not find the writing obnoxious, they are backwards technophobes getting brought down by their superstitions.

lateral_cloud 9 hours ago|||

AI assisted is a stretch. And that analogy isn't even close to being relevant

bspammer 9 hours ago|||

No? One affects the actual text and the other doesn’t.

possibleworlds 4 hours ago|||

This analogy makes absolutely no sense.

Laurel1234 9 hours ago|||

Rather interesting than clanker slop defenders downplay the clanker aspect and highlight the human by calling it "ai-assisted", which defeats their entire point.

I hope you do some introspection and start consciously recognizing that the human input and the clanker slop is just debasing it.

janalsncm 9 hours ago||

Not just that, I think a lot of people are going to waste their time losing the battle (and make no mistake, they will lose) fighting against AI writing without ever asking themselves what makes writing good in the first place.

There’s good AI writing and bad organic writing. But it’s easier to point out a few LLM-isms than to actually identify the problems with text.

blharr 3 hours ago||

> There's good AI writing

Sure, but the LLM-isms in AI writing are mentally exhausting to see in every way at this point.

The whole point of reading, frankly, is to understand the voice of other people. When you pass that through a distorted filter that makes everyone sound the same... its bad, lossy, frustrating communication

It's also dishonest. When you publish something that is direct output without your wording. Digital catfishing at best.

The only good AI writing is providing the prompt, because the question is way more interesting, and way more constructive to learning than the answer

zenfoxai 2 hours ago||

Nice article but chain of thought is what makes frontier LLMs smart, not really the token loop

brcmthrowaway 52 minutes ago|

Is chain of thought same as test time compute?

agumonkey 1 hour ago||

Nice intro, gonna help me dig further a lot now. Thanks a ton.

andai 14 hours ago|

I couldn't load the article directly due to an SSL issue, so here's the archive link:

https://archive.ph/aWtFG

More comments...