Posted by huseyinkeles 1 day ago
64 hours isn’t too bad at all!
(An RTX 2080 can only do 10 TFLOPS for fp32, so that would be again 3x as long.)
I guess it’s still a work in progress? Couldn’t find any other information elsewhere.
I was really excited, too, until I looked through the readme files and the code.
I am clueless and don't understand this. Where is the $100 being spent? Some sort of API you have to pay to access? Some sort of virtual hardware you have to rent access to?
You need that much hardware because each H100 provides 80GB of GPU-accessible RAM, but to train this model you need to hold a LOT of model weights and training data in memory at once. 80*8 = 640GB.
~$24/hour is how much it costs to rent that machine from various providers.
Which is derived from HuggingFaceFW/fineweb-edu: https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu
HuggingFaceTB/smol-smoltalk: https://huggingface.co/datasets/HuggingFaceTB/smol-smoltalk
And extra fine-tuning on portions of:
cais/mmlu: https://huggingface.co/datasets/cais/mmlu
openai/gsm8k: https://huggingface.co/datasets/openai/gsm8k
allenai/ai2_arc: https://huggingface.co/datasets/allenai/ai2_arc