The Illustrated Transformer

Posted by auraham 12/22/2025

The Illustrated Transformer(jalammar.github.io)

500 points | 88 commentspage 2

zkmon 12/22/2025|

I think the internal of transformers would become less relevant like internal of compilers, as programmers would only care about how to "use" them instead of how to develop them.

rvz 12/22/2025||

Their internals are just as relevant (now even more relevant) as any other technology as they always need to be improved to the SOTA (state of the art) meaning that someone has to understand their internals.

It also means more jobs for the people who understand them at a deeper level to advance the SOTA of specific widely used technologies such as operating systems, compilers, neural network architectures and hardware such as GPUs or TPU chips.

Someone has to maintain and improve them.

crystal_revenge 12/23/2025|||

Have you written a compiler? I ask because for me writing a compiler was absolutely an inflection point in my journey as a programmer. Being able to look at code and reason about it all the way down to bytecode/IL/asm etc absolutely improved my skill as a programmer and ability to reason about software. For me this was the first time I felt like a real programmer.

zkmon 12/23/2025||

Writing a compiler is not a requirement or good use of time for a programmer. Same as why driving a car should not require you to build the car engine. Driver should stick to their role and learn how to drive properly.

crystal_revenge 12/24/2025||

I'm guessing the answer then is "no".

That's a ridiculous metaphor as well because building a compiler is a massive software engineering project that covers a huge range of essential skills. That metaphor would work for building a computer, but not a compiler.

Clearly it shouldn't be a requirement, but it is an excellent use of a programmer's time. I can think of no software project over my career that has improved my skills more than writing a compiler.

esafak 12/22/2025||

Practitioners already do not need to know about it to run let alone use LLMs. I bet most don't even know the fundamentals of machine learning. Hands up if you know bias from variance...

edge17 12/23/2025||

Maybe I'm out of touch, but have transformers replaced all traditional deep learning architectures? (U-nets, etc)?

D-Machine 12/23/2025|

No, not at all. There is a transformer obsession that is quite possibly not supported by the actual facts (CNNs can still do just as well: https://arxiv.org/abs/2310.16764), and CNNs definitely remain preferable for smaller and more specialized tasks (e.g. computer vision on medical data).

If you also get into more robust and/or specialized tasks (e.g. rotation invariant computer vision models, graph neural networks, models working on point-cloud data, etc) then transformers are also not obviously the right choice at all (or even usable in the first place). So plenty of other useful architectures out there.

menaerus 12/23/2025|||

Using transformers does not mutually exclude other tools in the sleeve.

What about DINOv2 and DINOv3, 1B and 7B, vision transformer models? This paper [1] suggests significant improvements over traditional YOLO-based object detection.

[1] https://arxiv.org/html/2509.20787v2

D-Machine 12/23/2025||

Indeed, there are even multiple attempts to use both self-attention and convolutions in novel architectures, and there is evidence this works very well and may have significant advantages over pure vision transformer models [1-2].

IMO there is little reason to think transformers are (even today) the best architecture for any deep learning application. Perhaps if a mega-corp poured all their resources into some convolutional transformer architecture, you'd get something better than just the current vision transformer (ViT) models, but, since so much optimizations and work on the training of ViTs has been done, and since we clearly still haven't maxed out their capacity, it makes sense to stick with them at scale.

That being said, ViTs are still currently clearly the best if you want something trained on a near-entire-internet of image or video data.

[1] https://arxiv.org/abs/2103.15808

[2] https://scholar.google.ca/scholar?hl=en&as_sdt=0%2C5&q=convo...

edge17 12/23/2025|||

Is there something I can read to get a better sense of what types of models are most suitable for which problems? All I hear about are transformers nowadays, but what are the types of problems for which transformers are the right architecture choice?

D-Machine 12/23/2025||

Just do some basic searches on e.g. Google Scholar for your task (e.g. "medical image segmentation", "point cloud segmentation", "graph neural networks", "timeseries classification", "forecasting") or task modification (e.g. "'rotation invariant' architecture") or whatever, sort by year, make sure to click on papers that have a large number of citations, and start reading. You will start to get a feel for domains or specific areas where transformers are and are not clearly the best models. Or just ask e.g. ChatGPT Thinking with search enabled about these kinds of things (and then verify the answer by going to the actual papers).

Also check HuggingFace and other model hubs and filter by task to see if any of these models are available in an easy-to-use format. But most research models will only be available on GitHub somewhere, and in general you are just deciding between a vision transformer and the latest convolutional model (usually a ConvNext vX for some X).

In practice, if you need to work with the kind of data that is found online, and don't have a highly specialized type of data or problem, then you do, today, almost always just want some pre-trained transformer.

But if you actually have to (pre)train a model from scratch on specialized data, in many cases you will not have enough data or resources to get the most out of a transformer, and often some kind of older / simpler convolutional model is going to give better performance at less cost. Sometimes in these cases you don't even want a deep-learner at all, and just classic ML or algorithms are far superior. A good example would be timeseries forecasting, where embarrassingly simple linear models blow overly-complicated and hugely expensive transformer models right out of the water (https://arxiv.org/abs/2205.13504).

Oh, right, and unless TabPFNv2 (https://www.nature.com/articles/s41586-024-08328-6) makes sense for your use-case, you are still better off using boosted decision trees (e.g. XGBoost, LightGBM, or CatBoost) for tabular data.

bearsortree 12/23/2025||

i found this much more intuitive to follow, https://poloclub.github.io/transformer-explainer/

profsummergig 12/22/2025|

Haven't watched it yet...

...but, if you have favorite resources on understanding Q & K, please drop them in comments below...

(I've watched the Grant Sanderson/3blue1brown videos [including his excellent talk at TNG Big Tech Day '24], but Q & K still escape me).

Thank you in advance.

roadside_picnic 12/22/2025||

It's just a re-invention of kernel smoothing. Cosma Shalizi has an excellent write up on this [0].

Once you recognize this it's a wonderful re-framing of what a transformer is doing under the hood: you're effectively learning a bunch of sophisticated kernels (though the FF part) and then applying kernel smoothing in different ways through the attention layers. It makes you realize that Transformers are philosophically much closer to things like Gaussian Processes (which are also just a bunch of kernel manipulation).

0. http://bactra.org/notebooks/nn-attention-and-transformers.ht...

leopd 12/22/2025|||

I think this video does a pretty good job explaining it, starting about 10:30 minutes in: https://www.youtube.com/watch?v=S27pHKBEp30

oofbey 12/22/2025|||

As the first comment says "This aged like fine wine". Six years old, but the fundamentals haven't changed.

andoando 12/22/2025|||

This wasn't any better than other explanation I've seen.

red2awn 12/22/2025|||

Implement transformers yourself (ie in Numpy). You'll never truly understand it by just watching videos.

D-Machine 12/22/2025|||

Seconding this, the terms "Query" and "Value" are largely arbitrary and meaningless in practice, look at how to implement this in PyTorch and you'll see these are just weight matrices that implement a projection of sorts, and self-attention is always just self_attention(x, x, x) or self_attention(x, x, y) in some cases, where x and y are are outputs from previous layers.

Plus with different forms of attention, e.g. merged attention, and the research into why / how attention mechanisms might actually be working, the whole "they are motivated by key-value stores" thing starts to look really bogus. Really it is that the attention layer allows for modeling correlations and/or multiplicative interactions among a dimension-reduced representation.

tayo42 12/23/2025|||

>the terms "Query" and "Value" are largely arbitrary and meaningless in practice

This is the most confusing thing about it imo. Those words all mean something but they're just more matrix multiplications. Nothing was being searched for.

D-Machine 12/23/2025||

Better resources will note the terms are just historical and not really relevant anymore, and just remain a naming convention for self-attention formulas. IMO it is harmful to learning and good pedagogy to say they are anything more than this, especially as we better understand the real thing they are doing is approximating feature-feature correlations / similarity matrices, or perhaps even more generally, just allow for multiplicative interactions (https://openreview.net/forum?id=rylnK6VtDH).

profsummergig 12/22/2025|||

Do you think the dimension reduction is necessary? Or is it just practical (due to current hardware scarcity)?

D-Machine 12/23/2025||

Definitely mostly just a practical thing IMO, especially with modern attention variants (sparse attention, FlashAttention, linear attention, merged attention etc). Not sure it is even hardware scarcity per se / solely, it would just be really expensive in terms of both memory and FLOPs (and not clearly increase model capacity) to use larger matrices.

Also for the specific part where you, in code for encoder-decoder transformers, call the a(x, x, y) function instead of the usual a(x, x, x) attention call (what Alammar calls "encoder-decoder attention" in his diagram just before the "The Decoder Side"), you have different matrix sizes, so dimension reduction is needed to make the matrix multiplications work out nicely too.

But in general it is just a compute thing IMO.

roadside_picnic 12/22/2025||||

I personally don't think implementation is as enlightening as far as really understanding what the model is doing as this statement implies. I had done that many times, but it wasn't until reading about the relationship to kernel methods that it really clicked for me what is really happening under the hood.

Don't get me wrong, implementing attention is still great (and necessary), but even with something as simple as linear regression, implementing it doesn't really give you the entire conceptual model. I do think implementation helps to understand the engineering of these models, but it still requires reflection and study to start to understand conceptually why they are working and what they're really doing (I would, of course, argue I'm still learning about linear models in that regard!)

krat0sprakhar 12/22/2025|||

Do you have a tutorial that I can follow?

jwitthuhn 12/22/2025|||

If you have 20 hours to spare I highly recommend this youtube playlist from Andrej Karpathy https://www.youtube.com/watch?v=VMj-3S1tku0&list=PLAqhIrjkxb...

It starts with the fundamentals of how backpropagation works then advances to building a few simple models and ends with building a GPT-2 clone. It won't taech you everything about AI models but it gives you a solid foundation for branching out.

roadside_picnic 12/22/2025|||

The most valuable tutorial will be translating from the paper itself. The more hand holding you have in the process, the less you'll be learning conceptually. The pure manipulation of matrices is rather boring and uninformative without some context.

I also think the implementation is more helpful for understanding the engineering work to run these models that getting a deeper mathematical understanding of what the model is doing.

throw310822 12/22/2025|||

Have you tried asking e.g. Claude to explain it to you? None of the usual resources worked for me, until I had a discussion with Claude where I could ask questions about everything that I didn't get.

sakesun 12/22/2025||

Perhaps we have already reached ASI. :)

throw310822 12/23/2025||

In some respects, yes. There is no single human being with a general knowledge as vast as that of a SOTA LLM, or able to speak as many languages. Claude knows about transformers more than enough to explain them to a layperson, elucidating specific points and resolving doubts. As someone who learns more easily by prodding other people's knowledge rather than from static explanations, I find LLMs extremely useful.

machinationu 12/22/2025|||

Q, K and V are a way of filtering the relevant aspects for the task at hand from the token embeddings.

"he was red" - maybe color, maybe angry, the "red" token embedding carries both, but only one aspect is relevant for some particular prompt.

https://ngrok.com/blog/prompt-caching/

oedemis 12/23/2025|||

there is also very good explanation from Luis Serrano, https://youtu.be/fkO9T027an0

bobbyschmidd 12/22/2025||

tldr: recursively aggregating packing/unpacking 'if else if (functions)/statements' as keyword arguments that (call)/take them themselves as arguments, with their own position shifting according to the number "(weights)" of else if (functions)/statements needed to get all the other arguments into (one of) THE adequate orders. the order changes based on the language, input prompt and context.

if I understand it all correctly.

implemented it in html a while ago and might do it in htmx sometime soon.

transformers are just slutty dictionaries that Papa Roach and kage bunshin no jutsu right away again and again, spawning clones and variations based on requirements, which is why they tend to repeat themselves rather quickly and often. it's got almost nothing to do with languages themselves and requirements and weights amount to playbooks and DEFCON levels