Top
Best
New

Posted by quesomaster9000 12/29/2025

Show HN: Z80-μLM, a 'Conversational AI' That Fits in 40KB(github.com)
How small can a language model be while still doing something useful? I wanted to find out, and had some spare time over the holidays.

Z80-μLM is a character-level language model with 2-bit quantized weights ({-2,-1,0,+1}) that runs on a Z80 with 64KB RAM. The entire thing: inference, weights, chat UI, it all fits in a 40KB .COM file that you can run in a CP/M emulator and hopefully even real hardware!

It won't write your emails, but it can be trained to play a stripped down version of 20 Questions, and is sometimes able to maintain the illusion of having simple but terse conversations with a distinct personality.

--

The extreme constraints nerd-sniped me and forced interesting trade-offs: trigram hashing (typo-tolerant, loses word order), 16-bit integer math, and some careful massaging of the training data meant I could keep the examples 'interesting'.

The key was quantization-aware training that accurately models the inference code limitations. The training loop runs both float and integer-quantized forward passes in parallel, scoring the model on how well its knowledge survives quantization. The weights are progressively pushed toward the 2-bit grid using straight-through estimators, with overflow penalties matching the Z80's 16-bit accumulator limits. By the end of training, the model has already adapted to its constraints, so no post-hoc quantization collapse.

Eventually I ended up spending a few dollars on Claude API to generate 20 questions data (see examples/guess/GUESS.COM), I hope Anthropic won't send me a C&D for distilling their model against the ToS ;P

But anyway, happy code-golf season everybody :)

514 points | 122 commentspage 3
boznz 12/29/2025|
Great work. What is your timeline to AGI ?
fuzzfactor 12/29/2025||
Can't possibly be further than just around the corner.
RustyRussell 12/29/2025||
I'm thinking early April?
a_t48 12/29/2025||
Nice - that will fit on a Gameboy cartridge, though bank switching might make it super terrible to run. Each bank is only 16k. You can have a bunch of them, but you can only access one bank at a time (well, technically two - bank 0 is IIRC always accessible).
ColonelPhantom 12/29/2025||
Each layer of the LM is also at most 16 KiB, so if you want to minimize bank switching, I think making sure each layer is in one bank would be enough? Bank switching shouldn't give much overhead anyway unless it complicates an inner loop, which would be avoided if no layers are split across banks.
ant6n 12/29/2025||
You have 32KB of ROM, plus 8 Kb of ram on original game boy. Game boy color has more. Bank switching is super fast, as well. Given that models are likely streamed, I doubt the bank switching is a problem.

Biggest pain point is likely the text input.

jasonjmcghee 12/29/2025||
For future projects and/or for this project, there are many LLMs available more than good enough to generate that kind of synthetic data (20 Qs) with permissive terms of use. (So you don’t need to stress about breaking TOS / C&D etc)
Zardoz84 12/29/2025||
Meanwhile, Eliza was ported to BASIC and was run on many home computers in the 80s.
magicalhippo 12/29/2025||
As far as I know, the last layer is very quantization-sensitive, and is typically not quantized, or quantized lightly.

Have you experimented with having it less quantized, and evaluated the quality drop?

Regardless, very cool project.

kouteiheika 12/29/2025|
(Not OP)

It depends on the model, but from my experiments (quantizing one layer of a model to 2-bit and then training the model with that layer in 2-bit to fix the damage) the first layer is the most sensitive, and yes, the last layer is also sensitive too. The middle layers take the best to quantization.

Different components of a layer also have a different sensitivity; e.g. the MLP downscale block damages the model the most when quantized, while quantizing the Q projection in self attention damages the model the least.

coolius 12/29/2025||
This is impressive, those are some very restrictive requirements. I wonder what we are able to run on more powerful hardware such as ESP32 or RP2040, has anyone tried this?
pdyc 12/29/2025||
interesting, i am wondering how far can it go if we remove some of these limitations but try to solve some extremely specific problem like generating regex based on user input? i know small models(270M range) can do that but can it be done in say < 10MB range?
Waterluvian 12/29/2025|
Generate an LLM that is designed to solve one extremely specific problem: answering the ultimate question of life, the universe, and everything.

Even with modern supercomputing the computation would be outpaced by the heat death of the universe, so token output must be limited to a single integer.

nrhrjrjrjtntbt 12/29/2025||
00101010
dirkt 12/29/2025||
Eliza's granddaughter.
lostmsu 12/30/2025||
Did you train the model with quantization awareness? How?
DrNosferatu 12/29/2025|
Awesome! Anyone for a port to the MSX?

A web version would also be cool.

More comments...