Top
Best
New

Posted by redm 10 hours ago

BitNet: 100B Param 1-Bit model for local CPUs(github.com)
267 points | 132 commentspage 3
janalsncm 5 hours ago|
They have a demo video in the readme. I think they are trying to convey that BitNet is fast, which it is. But it is worth taking a moment to pause and actually see what the thing is doing so quickly.

It seems to keep repeating that the water cycle is the main source of energy for all living things on the planet and then citing Jenkins 2010. There are also a ton of sentence beginning with “It also…”

I don’t even think it’s correct. The sun is the main source of energy for most living things but there’s also life near hydrothermal vents etc.

I don’t know who Jenkins is, but this model appears to be very fond of them and the particular fact about water.

I suppose fast and inaccurate is better than slow and inaccurate.

simonw 9 hours ago||
Anyone know how hard it would be to create a 1-bit variant of one of the recent Qwen 3.5 models?
regularfry 8 hours ago||
There are q2 and q1 quants, if you want an idea of how much performance you'd drop. Not quite the same implementation-wise, but probably equivalent in terms of smarts.
nikhizzle 9 hours ago|||
Almost trivial using open source tools, the question is how it performs without calibration/fine tuning.
wongarsu 8 hours ago||
The results would probably be underwhelming. The bitnet paper doesn't give great baselines to compare to, but in their tests a 2B network trained for 1.58bits using their architecture was better than Llama 3 8B quantized to 1.58bits. Though that 2B network was about on par with a 1.5B qwen2.5.

If you have an existing network, making an int4 quant is the better tradeoff. 1.58b quants only become interesting when you train the model specifically for it

On the other hand maybe it works much better than expected because llama3 is just a terrible baseline

knodi123 5 hours ago||
Why would they film a demo video of it spewing out barely-coherent rambling repetitive drivel? If your model sucks at writing essays, maybe just tell us that, and film a demo of it doing something it IS good at?
naasking 7 hours ago||
I think the README [1] for the new CPU feature is of more interest, showing linear speedups with number of threads. Up to 73 tokens/sec with 8 threads (64 toks/s for their recommended Q6 quant):

https://github.com/microsoft/BitNet/blob/main/src/README.md

itsthecourier 9 hours ago||
https://github-production-user-asset-6210df.s3.amazonaws.com...

demo shows a huge love for water, this AI knows its home

_fw 9 hours ago|
Also, very influenced by the literature of Jenkins (2010).
logicallee 8 hours ago||
It might interest you to know that one or two months ago, I had Claude port BitNet to WebGPU from the reference implementation, so that it runs right in your browser as a local model. After some debugging, the port seemed to work, but the model didn't function as well as the reference implementation so I'll have to work on it for a while. You can see a debugging session livestreamed here[1]. The released model file was about a gigabyte, it fits in most people's GPU's. We were also able to successfully fine-tune it right in the browser.

There's a lot that you can do when the model size is that small, yet still powerful.

Our next step is that we want to put up a content distribution network for it where people can also share their diffs for their own fine-tuned model. I'll post the project if we finish all the parts.

[1] https://www.youtube.com/live/x791YvPIhFo?is=NfuDFTm9HjvA3nzN

rarisma 8 hours ago||
No 100b model.

My disappointment is immeasurable and my day is ruined.

patchnull 5 hours ago||
[flagged]
perfmode 5 hours ago||
The quality cliff question is the right one to be asking. There's a pattern in systems work where something that scales cleanly in theory hits emergent failure modes at production scale that weren't visible in smaller tests. The loss landscape concern is exactly that kind of thing, and nobody has actually run the experiment.

That said, I think the comparison to improving GGUF quantization isn't quite apples to apples. Post-training quantization is compressing a model that already learned its representations in high precision. Native ternary training is making an architectural bet that the model can learn equally expressive representations under a much tighter constraint from the start. Those are different propositions with different scaling characteristics. The BitNet papers suggest the native approach wins at small scale, but that could easily be because the quantization baselines they compared against (Llama 3 at 1.58 bits) were just bad. A full-precision model wasn't designed to survive that level of compression.

The real tell will be whether anyone with serious compute (not Microsoft, apparently) decides the potential inference cost savings justify a full training run. The framework existing lowers one barrier, but the more important barrier is that a failed 100B training run is extremely expensive, and right now there's not enough evidence to derisk it. Two years of framework polish without a flagship model is a notable absence.

andai 5 hours ago||
>Meanwhile GGUF Q2 and Q3 quantizations on llama.cpp keep getting better

Can you tell me more about this? It's been about a year since I looked into it, but it looked like performance dropped hard below Q4. I'd love to see more about this.

Also what's a good way to run them? I mostly use Ollama which only goes down to Q4. I think it supports HF urls though?

password4321 3 hours ago||
This recent discussion is still open and may provide some helpful info:

How to run Qwen 3.5 locally https://news.ycombinator.com/item?id=47292522

aplomb1026 5 hours ago||
[dead]
ilovesamaltman 4 hours ago|
[flagged]