NanoChat – The best ChatGPT that $100 can buy

Posted by huseyinkeles 1 day ago

NanoChat – The best ChatGPT that $100 can buy(github.com)

https://x.com/karpathy/status/1977755427569111362

1448 points | 299 commentspage 3

mips_avatar 1 day ago|

Thanks Andrej for putting this up. Your videos gave me the confidence to work full time on LLMs last year after I left Microsoft

samus 1 day ago||

Andrej Karpathy slays again by spreading knowledge about this important subject to the people!

wyldfire 1 day ago||

I would love to take an existing open-weight model and fine-tune it with specific training data along these lines. Can I do that with Qwen or GLM? Is there a ~simple recipe for doing that?

desaiguddu 19 hours ago||

I am building a product similar to DataGPT https://datagpt.com/ and Julius.ai - will this help in that?

simonw 17 hours ago|

Not at all. This project is for learning how LLMs work and how to build them from first principles. If you want to solve problems that aren't "how do I build an LLM from scratch" this isn't the right path for you.

jumski 20 hours ago||

100$ to train a sort of talkable model in 4 hours? wow

RobGR 1 day ago||

This is an LLM trained using a $100 budget to RENT access to graphics cards. It's not about what you could do BUYING hardware for $100.

danielmarkbruce 1 day ago||

Nowhere does he suggest he is buying hardware.

HelloMcFly 18 hours ago||

Once the LLM is trained you don't need the rented hardware anymore.

Havoc 1 day ago||

>If your GPU(s) have less than 80GB, you'll have to tune some of the hyperparameters or you will OOM / run out of VRAM. Look for --device_batch_size in the scripts and reduce it until things fit. E.g. from 32 (default) to 16, 8, 4, 2, or even 1.

That sounds like it could run on a 24gb GPU. Batch size of 8 would imply 20gb mem, no?

...presumably just takes forever

JonathanFly 1 day ago||

> Batch size of 8 would imply 20gb mem, no?

I'm running it now and I had to go down to 4 instead of 8, and that 4 is using around 22-23GB of GPU memory. Not sure if something is wrong or if batch is only scaling part of the memory requirements. (Edit: I restarted running the training script directly instead of torch run, and 8 still doesn't fit, but 4 is now using 16-17 instead.)

On my 4090 the tok/sec is 523, which is 1/2000 of the 1,000,000 tok/sec of the 8 80GB H100s. That feels too slow so maybe something is wrong. The 4090 is about 1/3 of the raw compute. I'm sure there's other losses from less batching but even if it were 1/10ths as fast, I'd expected something more like 1,000,000 / 10 / 8 so at least 10,000 tok/sec.

Havoc 19 hours ago||

Thanks for investigating. Sounds like throwing some dollars at a cloud gpu makes more sense then

zipy124 1 day ago||

Yes, you can always stream data when training or doing inference on models when vram is lacking but the slow down is extremely noticeable. This is the case for CPU code too and is why optimising for bandwidth is so critical in high-performance computing. Your ability to compute is almost always substantially larger than your bandwidth. An Avx512 capable CPU with a suitable amount of cores is easily capable of doing multiple terabytes of fp64 operations per second, but is typically limited by memory bandwidth, GPUs with LLMs have just broadened this knowledge to more people.

A fun consequence of the fact that CPUs got faster at a rate quicker than memory is look up tables of pre-computed values used to be common optimisations in code, but now it is almost always quicker to re-compute them than to retrieve a pre-computed value from memory for common use-cases.

markr1 1 day ago|

$100 to teach us all how to build an LLM, this is what open education should look like.

More comments...