Posted by nnx 3 days ago
I also have yet to see any of these at a larger scale. For example, can you try one of these at 100 billion parameters?
I don't see any mWh/token figures in that chart.
If you got that into a couple gigs--what could you stuff into 20 gigs?
That'll be the real game changer.
Unfortunately my mental model doesn't contain anything to even guess if that's possible or not, my AI times were at the falling flank of symbolic. Funny how one bit models feel a bit like approaching an approximation of symbolic again (until you read about the grouped scale factors and then the illusion is gone)
One thought that suggests rearranging is not involved,a thought that does not require any knowledge at all: if it did involve rearranging, someone would certainly have added some order by scale factor tricks with linear interpolation by address offset to lose even less precision.
https://proceedings.neurips.cc/paper_files/paper/2024/hash/7...
They train directly in the 1 bit domain, without any floating point weights. They don't use the classical Newton-Leibniz derivative (which operates on approximations of real numbers) for gradient descent / backpropagation. Instead they invented a binary version called "Boolean variation".
I don't know why this paper didn't get more attention.
Nonetheless, the Prism Bonsai models are impressive for their size. Where it falls apart is with knowledge. It has good prose/logic for a tiny model, and it's fast even on modest hardware, but it hallucinates a lot. Which makes sense. You can't fit the world's data in a couple of gigabytes. But, as a base model for fine-tuning for use cases where size matters, it's probably a great choice.
>> What are some names like Llewelyn?
> Some names like Llewelyn are Llewelyn, Llewelyn, Llewelyn, (repeats several times), and Llewelyn.
Can it be run on browsers with WASM/WebGPU?
Wow, if this is true, I am extremely impressed and excited!
I wonder about kv cache how much better it is as well!