Top
Best
New

Posted by tambourine_man 20 hours ago

Microgpt(karpathy.github.io)
1622 points | 284 comments
hkbuilds 39 minutes ago|
The "micro" trend in AI is fascinating. We're seeing diminishing returns from just making models bigger, and increasing returns from making them smaller and more focused.

For practical applications, a well-tuned small model that does one thing reliably is worth more than a giant model that does everything approximately. I've been using Gemini Flash for domain-specific analysis tasks and the speed/cost ratio is incredible compared to the frontier models. The latency difference alone changes what kind of products you can build.

grey-area 1 minute ago|
This is micro for pedagogy reasons, it's not something you would really use.
teleforce 15 hours ago||
Someone has modified microgpt to build a tiny GPT that generates Korean first names, and created a web page that visualizes the entire process [1].

Users can interactively explore the microgpt pipeline end to end, from tokenization until inference.

[1] English GPT lab:

https://ko-microgpt.vercel.app/

camkego 1 hour ago|
I have no affiliation with the website, but the website is pretty neat if you are learning LLM internals. It explains: Tokenization, Embedding, Attention, Loss & Gradient, Training, Inference and comparison to "Real GPT"

Pretty nifty. Even if you are not interested in the Korean language

verma7 16 hours ago||
I wrote a C++ translation of it: https://github.com/verma7/microgpt/blob/main/microgpt.cc

2x the number of lines of code (~400L), 10x the speed

The hard part was figuring out how to represent the Value class in C++ (ended up using shared_ptrs).

WithinReason 12 hours ago|
I made an explicit reverse pass (no autodiff), it was 8x faster in Python
geokon 12 hours ago||
> What’s the deal with “hallucinations”? The model generates tokens by sampling from a probability distribution. It has no concept of truth, it only knows what sequences are statistically plausible given the training data.

Extremely naiive question.. but could LLM output be tagged with some kind of confidence score? Like if I'm asking an LLM some question does it have an internal metric for how confident it is in its output? LLM outputs seem inherently rarely of the form "I'm not really sure, but maybe this XXX" - but I always felt this is baked in the model somehow

andy12_ 12 hours ago||
The model could report the confidence of its output distribution, but it isn't necessarily calibrated (that is, even if it tells you that it's 70% confident, it doesn't mean that it is right 70% of the time). Famously, pre-trained base models are calibrated, but they stop being calibrated when they are post-trained to be instruction-following chatbots [1].

Edit: There is also some other work that points out that chat models might not be calibrated at the token-level, but might be calibrated at the concept-level [2]. Which means that if you sample many answers, and group them by semantic similarity, that is also calibrated. The problem is that generating many answer and grouping them is more costly.

[1] https://arxiv.org/pdf/2303.08774 Figure 8

[2] https://arxiv.org/pdf/2511.04869 Figure 1.

geokon 12 hours ago||
In absolute terms sure, but the token stream's confidence changes as it's coming out right? Consumer LLMs typically have a lot window dressing. My sense is this encourages the model to stay on-topic and it's mostly "high confidence" fluff. As it's spewing text/tokens back at you maybe when it starts hallucinating you'd expect a sudden dip in the confidence?

You could color code the output token so you can see some abrupt changes

It seems kind of obvious, so I'm guessing people have tried this

throwthrowuknow 8 hours ago||
Look up “dataloom”. People have been playing with this idea for a while. It doesn’t really help with spotting errors because they aren’t due to a single token (unless the answer is exactly one token) and often you need to reason across low probability tokens to eventually reach the right answer.
chongli 59 minutes ago|||
Having a confidence score isn't as useful as it seems unless you (the user) know a lot about the contents of the training set.

Think of traditional statistics. Suppose I said "80% of those sampled preferred apples to oranges, and my 95% confidence interval is within +/- 2% of that" but then I didn't tell you anything about how I collected the sample. Maybe I was talking to people at an apple pie festival? Who knows! Without more information on the sampling method, it's hard to make any kind of useful claim about a population.

This is why I remain so pessimistic about LLMs as a source of knowledge. Imagine you had a person who was raised from birth in a completely isolated lab environment and taught only how to read books, including the dictionary. They would know how all the words in those books relate to each other but know nothing of how that relates to the world. They could read the line "the killer drew his gun and aimed it at the victim" but what would they really know of it if they'd never seen a gun?

radarsat1 36 minutes ago||
I think your last point raises the following question: how would you change your answer if you know they read all about guns and death and how one causes the other? What if they'd seen pictures of guns? And pictures of victims of guns annotated as such? What if they'd seen videos of people being shot by guns?

I mean I sort of understand what you're trying to say but in fact a great deal of knowledge we get about the world we live in, we get second hand.

There are plenty of people who've never held a gun, or had a gun aimed at them, and.. granted, you could argue they probably wouldn't read that line the same way as people who have, but that doesn't mean that the average Joe who's never been around a gun can't enjoy media that features guns.

Same thing about lots of things. For instance it's not hard for me to think of animals I've never seen with my own eyes. A koala for instance. But I've seen pictures. I assume they exist. I can tell you something about their diet. Does that mean I'm no better than an LLM when it comes to koala knowledge? Probably!

chongli 5 minutes ago||
It’s more complicated to think about, but it’s still the same result. Think about the structure of a dictionary: all of the words are defined in terms of other words in the dictionary, but if you’ve never experienced reality as an embodied person then none of those words mean anything to you. They’re as meaningless as some randomly generated graph with a million vertices and a randomly chosen set of edges according to some edge distribution that matches what we might see in an English dictionary.

Bringing pictures into the mix still doesn’t add anything, because the pictures aren’t any more connected to real world experiences. Flooding a bunch of images into the mind of someone who was blind from birth (even if you connect the images to words) isn’t going to make any sense to them, so we shouldn’t expect the LLM to do any better.

Think about the experience of a growing baby, toddler, and child. This person is not having a bunch of training data blasted at them. They’re gradually learning about the world in an interactive, multi-sensory and multi-manipulative manner. The true understanding of words and concepts comes from integrating all of their senses with their own manipulations as well as feedback from their parents.

Children also are not blank slates, as is popularly claimed, but come equipped with built-in brain structures for vision, including facial recognition, voice recognition (the ability to recognize mom’s voice within a day or two of birth), universal grammar, and a program for learning motor coordination through sensory feedback.

danlitt 49 minutes ago|||
Can it generate one? Sure. But it won't mean anything, since you don't know (and nobody knows) the "true" distribution.
jorvi 3 hours ago|||
> I'm not really sure, but maybe this XXX

You never see this in the response but you do in the reasoning.

DavidSJ 12 hours ago|||
Yes, the actual LLM returns a probability distribution, which gets sampled to produce output tokens.

[Edit: but to be clear, for a pretrained model this probability means "what's my estimate of the conditional probability of this token occurring in the pretraining dataset?", not "how likely is this statement to be true?" And for a post-trained model, the probability really has no simple interpretation other than "this is the probability that I will output this token in this situation".]

mr_toad 9 hours ago|||
It’s often very difficult (intractable) to come up with a probability distribution of an estimator, even when the probability distribution of the data is known.

Basically, you’d need a lot more computing power to come up with a distribution of the output of an LLM than to come up with a single answer.

podnami 12 hours ago|||
What happens before the probability distribution? I’m assuming say alignment or other factors would influence it?
DavidSJ 12 hours ago||
In microgpt, there's no alignment. It's all pretraining (learning to predict the next token). But for production systems, models go through post-training, often with some sort of reinforcement learning which modifies the model so that it produces a different probability distribution over output tokens.

But the model "shape" and computation graph itself doesn't change as a result of post-training. All that changes is the weights in the matrices.

podnami 12 hours ago|||
I would assume this is from case to case, such as:

- How aligned has it been to “know” that something is true (eg ethical constraints)

- Statistical significance and just being able to corroborate one alternative in Its training data more strongly than another

- If it’s a web search related query, is the statement from original sources vs synthesised from say third party sources

But I’m just a layman and could be totally off here.

Lionga 12 hours ago||
The LLM has an internal "confidence score" but that has NOTHING to do with how correct the answer is, only with how often the same words came together in training data.

E.g. getting two r's in strawberry could very well have a very high "confidence score" while a random but rare correct fact might have a very well a very low one.

In short: LLM have no concept, or even desire to produce of truth

sharperguy 11 hours ago|||
Still, it might be interesting information to have access to, as someone running the model? Normally we are reading the output trying to build an intuition for the kinds of patterns it outputs when it's hallucinating vs creating something that happens to align with reality. Adding in this could just help with that even when it isn't always correlated to reality itself.
alexwebb2 12 hours ago||||
Huge leap there in your conclusion. Looks like you’re hand-waving away the entire phenomenon of emergent properties.
amelius 10 hours ago|||
> In short: LLM have no concept, or even desire to produce of truth

They do produce true statements most of the time, though.

jaen 10 hours ago||
That's just because true statements are more likely to occur in their training corpus.
red75prime 57 minutes ago|||
The overwhelming majority of true statements isn't in the training corpus due to a combinatorial explosion. What it means that they are more likely to occur there?
amelius 9 hours ago|||
The training set is far too small for that to explain it.

Try to explain why one shotting works.

jaen 7 hours ago||
Uh, to explain what? You probably read something into what I said while I was being very literal.

If you train an LLM on mostly false statements, it will generate both known and novel falsehoods. Same for truth.

An LLM has no intrinsic concept of true or false, everything is a function of the training set. It just generates statements similar to what it has seen and higher-dimensional analogies of those .

subset 17 hours ago||
I had good fun transliterating it to Rust as a learning experience (https://github.com/stochastical/microgpt-rs). The trickiest part was working out how to represent the autograd graph data structure with Rust types. I'm finalising some small tweaks to make it run in the browser via WebAssmebly and then compile it up for my blog :) Andrej's code is really quite poetic, I love how much it packs into such a concise program
amelius 6 hours ago||
Storing the partial derivatives into the weights structure is quite the hack, to be honest. But everybody seems to do it like that.
hei-lima 10 hours ago||
Great work! Might do it too in some other language...
thomasmg 2 hours ago|||
I got a convertion to Java. It worked (at least I think...) in the first try.

Then I want to convert this to my own programming language (which traspiles to C). I like those tiny projects very much!

pmarreck 9 hours ago|||
Zig, here.

Anything but Python

O5vYtytb 7 hours ago||
At least python can do this exercise without pulling 3rd party dependencies :)
justinhj 5 hours ago||
What's missing from Zig and its std lib for this?
moderation 2 hours ago||
Zig version [0] doesn't need any external dependencies.

0. https://tangled.org/m17e.co/microgpt

red_hare 17 hours ago||
This is beautiful and highly readable but, still, I yearn for a detailed line-by-line explainer like the backbone.js source: https://backbonejs.org/docs/backbone.html
tomjakubowski 2 hours ago||
I believe that Backbone's annotated source is generated with Docco, another project from the creator of CoffeeScript.

https://ashkenas.com/docco/

It's really neat. I wish I published more of my code this way.

ashish01 16 hours ago|||
That is really beautiful literate program. Seeing it after a long time. Here is a opus generate version of this code - https://ashish01.github.io/microgpt.html
subset 14 hours ago|||
Andrej Karpathy has a walkthrough blog post here: https://karpathy.github.io/2026/02/12/microgpt/
OJFord 10 hours ago||
That is the article being discussed.
subset 19 minutes ago||
Gosh, tired brain moment apologies. I thought it'd linked to the original code gist.
altcognito 17 hours ago||
ask a high end LLM to do it
la_fayette 12 hours ago||
This guy is so amazing! With his video and the code base I really have the feeling I understand gradient descent, back propagation, chain rule etc. Reading math only just confuses me, together with the code it makes it so clear! It feels like a lifetime achievement for me :-)
mentos 11 hours ago|
Curious if you could try to explain it. It’s my goal to sit down with it and attempt to understand it intuitively.

Karpathy says if you want to truly understand something then you also have to attempt to teach it to someone else ha

la_fayette 11 hours ago||
Yes, that’s true! That could be my next step… though I have to admit, writing this in a HN comment feels like a bit of a challenge.
growingswe 15 hours ago||
Great stuff! I wrote an interactive blogpost that walks through the code and visualizes it: https://growingswe.com/blog/microgpt
O4epegb 2 hours ago||
> By the end of training, the model produces names like "kamon", "karai", "anna", and "anton". None of them are copies from the dataset.

All 4 are in the dataset, btw

evntdrvn 9 hours ago|||
You should totally submit that to HN as an article, if you haven't already.
dang 2 hours ago||
We've put https://news.ycombinator.com/item?id=47205208 in the second-chance pool (https://news.ycombinator.com/pool, explained at https://news.ycombinator.com/item?id=26998308), so it will get a random placement on HN's front page.
joenot443 8 hours ago|||
This is awesome! Normally I'm pretty critical of LLM-assisted-blogging, but this one's a real winner.
spinningslate 11 hours ago|||
That’s beautifully done, thanks for posting. As helpful again to an ML novice like me as Karpathy’s original.
hei-lima 10 hours ago|||
Great!
evntdrvn 9 hours ago||
really nice, thanks
astroanax 3 hours ago||
I feel its wrong to call it microgpt, since its smaller than nanogpt, so maybe picogpt would have been a better name? nice project tho
kuberwastaken 14 hours ago|
I'm half shocked this wasn't on HN before? Haha I built PicoGPT as a minified fork with <35 lines of JS and another in python

And it's small enough to run from a QR code :) https://kuber.studio/picogpt/

You can quite literally train a micro LLM from your phone's browser

dang 2 hours ago||
Wow I agree - surprising that it took 2 weeks to make HN's frontpage.

We do generally like HN to be a bit uncorrelated with the rest of the internet, but it feels like a miss to me that neither https://news.ycombinator.com/item?id=47000263 nor https://news.ycombinator.com/item?id=47018557 made the frontpage.

cootsnuck 14 hours ago|||
It was: https://news.ycombinator.com/item?id=47000263
iberator 14 hours ago||
[flagged]
dang 2 hours ago|||
Please don't be a jerk on HN, and especially not when responding to someone's work. This is in the site guidelines: https://news.ycombinator.com/newsguidelines.html.
lelandfe 9 hours ago||||
https://github.com/Kuberwastaken/picogpt/blob/main/picogpt.j...
kuberwastaken 13 hours ago|||
lol there is source code as a gist
More comments...