Top
Best
New

Posted by tosh 1 day ago

Qwen3-Next(qwen.ai)
551 points | 223 commentspage 3
pveierland 1 day ago|
> "The content loading failed."

It's amazing how far and how short we've come with software architectures.

yekanchi 1 day ago||
how much vram it requires?
NitpickLawyer 1 day ago||
A good rule of thumb is to think that one param is one unit of storage. The "default" unit of storage these days is bf16 (i.e. 16 bits for 1 weight). So for a 80B model that'll be ~160GB of weights. Then you have quantisation, usually in 8bit and 4bit. That means each weight is "stored" in 8bits or 4bits. So for a 80B model that'll be ~80GB in fp8 and ~40GB in fp4/int4.

But in practice you need a bit more than that. You also need some space for context, and then for kv cache, potentially a model graph, etc.

So you'll see in practice that you need 20-50% more RAM than this rule of thumb.

For this model, you'll need anywhere from 50GB (tight) to 200GB (full) RAM. But it also depends how you run it. With MoE models, you can selectively load some experts (parts of the model) in VRAM, while offloading some in RAM. Or you could run it fully on CPU+RAM, since the active parameters are low - 3B. This should work pretty well even on older systems (DDR4).

johntash 17 hours ago|||
Can you explain how context fits into this picture by any chance? I sort of understand the vram requirement for the model itself, but it seems like larger context windows increases the ram requirement by a lot more?
theanonymousone 1 day ago|||
But the RAM+VRAM can never be less than the size of the total (not active) model, right?
NitpickLawyer 1 day ago||
Correct. You want everything loaded, but for each forward pass just some experts get activated so the computation is less than in a dense model.

That being said, there are libraries that can load a model layer by layer (say from an ssd) and technically perform inference with ~8gb of RAM, but it'd be really really slow.

theanonymousone 1 day ago||
Can you give me a name please? Is that distributed llama or something else?
skirmish 1 day ago||
I have not used it but this is probably it: https://github.com/lyogavin/airllm
DiabloD3 1 day ago||
Thats not a meaningful question. Models can be quantized to fit into much smaller memory requirements, and not all MoE layers (in MoE models) have to be offloaded to VRAM to maintain performance.
yekanchi 1 day ago||
i mean 4bit quantized. i can roughly calculate vram for dense models by model size. but i don't know how to do it for MOE models?
DiabloD3 1 day ago|||
Same calculation, basically. Any given ~30B model is going to use the same VRAM (assuming loading it all into VRAM, which MoEs do not need to do), is going to be the same size
EnPissant 1 day ago|||
MoE models need just as much VRAM as dense models because every token may use a different set of experts. They just run faster.
regularfry 1 day ago||
This isn't quite right: it'll run with the full model loaded to RAM, swapping in the experts as it needs. It has turned out in the past that experts can be stable across more than one token so you're not swapping as much as you'd think. I don't know if that's been confirmed to still be true on recent MoEs, but I wouldn't be surprised.
mcrutcher 1 day ago|||
Also, though nobody has put the work in yet, the GH200 and GB200 (the NVIDIA "superchips" support exposing their full LPDDR5X and HBM3 as UVM (unified virtual memory) with much more memory bandwidth between LPDDR5X and HBM3 than a typical "instance" using PCIE. UVM can handle "movement" in the background and would be absolutely killer for these MoE architectures, but none of the popular inference engines actually allocate memory correctly for these architectures: cudaMallocManaged() or allow UVM (CUDA) to actually handle movement of data for them (automatic page migration and dynamic data movement) or are architected to avoid pitfalls in this environment (being aware of the implications of CUDA graphs when using UVM).

It's really not that much code, though, and all the actual capabilities are there as of about mid this year. I think someone will make this work and it will be a huge efficiency for the right model/workflow combinations (effectively, being able to run 1T parameter MoE models on GB200 NVL4 at "full speed" if your workload has the right characteristics).

EnPissant 1 day ago|||
What you are describing would be uselessly slow and nobody does that.
DiabloD3 1 day ago|||
I don't load all the MoE layers onto my GPU, and I have only about a 15% reduction in token generation speed while maintaining a model 2-3 times larger than VRAM alone.
EnPissant 23 hours ago||
The slowdown is far more than 15% for token generation. Token generation is mostly bottlenecked by memory bandwidth. Dual channel DDR5-6000 has 96GB/s and A rtx 5090 has 1.8TB/s. See my other comment when I show 5x slowdown in token generation by moving just the experts to the CPU.
DiabloD3 10 hours ago||
I suggest figuring out what your configuration problem is.

Which llama.cpp flags are you using, because I am absolutely not having the same bug you are.

EnPissant 10 hours ago||
It's not a bug. It's the reality of token generation. It's bottlenecked by memory bandwidth.

Please publish your own benchmarks proving me wrong.

furyofantares 1 day ago||||
I do it with gpt-oss-120B on 24 GB VRAM.
EnPissant 23 hours ago||
You don't. You run some of the layers on the CPU.
furyofantares 22 hours ago||
You're right that I was confused about that.

LM Studio defaults to 12/36 layers on the GPU for that model on my machine, but you can crank it to all 36 on the GPU. That does slow it down but I'm not finding it unusable and it seems like it has some advantages - but I doubt I'm going to run it this way.

EnPissant 15 hours ago||
FWIW, that's a 80GB model and you also need kv cache. You'd need 96GBish to run on the GPU.
furyofantares 15 hours ago||
Do you know if it's doing what was described earlier, when I run it with all layers on GPU - paging an expert in every time the expert changes? Each expert is only 5.1B parameters.
furyofantares 13 hours ago|||
^ Er, misspoke, each expert is at most .9 B parameters there's 128 experts. 5.1 B is number of active parameters (4 experts + some other parameters).
EnPissant 14 hours ago|||
It makes absolutely no sense to do what OP described. The decode stage is bottlenecked on memory bandwidth. Once you pull the weights from system RAM, your work is almost done. To then gigabytes of weights PER TOKEN over PCIE to do some trivial computation on the GPU is crazy.

What actually happens is you run some or all of the MoE layers on the CPU from system RAM. This can be tolerable for smaller MoE models, but keeping it all on the GPU will still be 5-10x faster.

I'm guessing lmstudio gracefully falls back to running _soemthing_ on the CPU. Hopefully you are running only MoE on the CPU. I've only ever used llama.cpp.

furyofantares 13 hours ago||
I tried a few things and checked CPU usage in Task Manager to see how much work the CPU is doing.

KV Cache in GPU and 36/36 layers in GPU: CPU usage under 3%.

KV Cache in GPU and 35/36 layers in GPU: CPU usage at 35%.

KV Cache moved to CPU and 36/36 layers in GPU: CPU usage at 34%.

I believe you that it doesn't make sense to do it this way, it is slower, but it doesn't appear to be doing much of anything on the CPU.

You say gigabytes of weights PER TOKEN, is that true? I think an expert is about 2 GB, so a new expert is 2 GB, sure - but I might have all the experts for the token already in memory, no?

EnPissant 11 hours ago||
gpt-oss-120b chooses 4 experts per token and combines them.

I don't know how lmstudio works. I only know the fundamentals. There is not way it's sending experts to the GPU per token. Also, the CPU doesn't have much work to do. It's mostly waiting on memory.

furyofantares 10 hours ago||
> There is not way it's sending experts to the GPU per token.

Right, it seems like either experts are stable across sequential tokens fairly often, or there's more than 4 experts in memory and it's stable within the in-memory experts for sequential tokens fairly often, like the poster said.

zettabomb 1 day ago||||
llama.cpp has built-in support for doing this, and it works quite well. Lots of people running LLMs on limited local hardware use it.
EnPissant 23 hours ago||
llama.cpp has support for running some of or all of the layers on the CPU. It does not swap them into the GPU as needed.
regularfry 1 day ago||||
It's neither hypothetical nor rare.
EnPissant 23 hours ago||
You are confusing running layers on the CPU.
bigyabai 1 day ago||||
I run the 30B Qwen3 on my 8GB Nvidia GPU and get a shockingly high tok/s.
EnPissant 23 hours ago||
For contrast, I get the following for a rtx 5090 and 30b qwen3 coder quantized to ~4 bits:

- Prompt processing 65k tokens: 4818 tokens/s

- Token generation 8k tokens: 221 tokens/s

If I offload just the experts to run on the CPU I get:

- Prompt processing 65k tokens: 3039 tokens/s

- Token generation 8k tokens: 42.85 tokens/s

As you can see, token generation is over 5x slower. This is only using ~5.5GB VRAM, so the token generation could be sped up a small amount by moving a few of the experts onto the GPU.

littlestymaar 1 day ago|||
AFAIK many people on /r/localLlama do pretty much that.
pzo 1 day ago||
would be interesting how they compare to gpt-oss-120b. The latter one runs also very fast and pricing is currently much better than qwen3-next on many providers. Would expect that if this model is such fast pricing should be similar or even lower.
Western0 1 day ago||
where is gguf?
gre 23 hours ago||
https://github.com/ggml-org/llama.cpp/issues/15940
daemonologist 1 day ago||
Patience, it just came out yesterday and has some architectural changes.
boxboxbox4 1 day ago||
[dead]
pollre 1 day ago||
[dead]
keyle 1 day ago||
For a model that can run offline, they've nailed how the website can too.

And it appears like it's thinking about it! /s

croemer 1 day ago||
ERR_NAME_NOT_RESOLVED
siliconc0w 1 day ago|
All these new datacenters are going to be a huge sunk cost. Why would you pay OpenAI when you can host your own hyper efficient Chinese model for like 90% less cost at 90% of the performance. At that is compared to today's subsidized pricing, which they can't keep up forever.
hadlock 1 day ago||
Eventually Nvidia or a shrewd competitor will release 64/128gb consumer cards; locally hosted GPT 3.5+ is right around the corner, we're just waiting for consumer hardware to catch up at this point.
mft_ 7 hours ago||
I think we're still at least an order of magnitude away (in terms of affordable local inference, or model improvements to squeeze more from less, or a combination of the two) from local solutions being seriously competitive for general purpose tasks, sadly.

I recently bought a second-hand 64GB Mac to experiment with. Even with the biggest recent local model it can run (llama3.3:70b just about runs acceptably; I've also tried an array of Qwen3 30b variants) the quality is lacking for coding support. They can sometimes write and iterate on a simple Python script, but sometimes fail, and for general-purpose models, often fail to answer questions accurately (not unsurprisingly, considering the model is a compression of knowledge, and these are comparatively small models). They are far, far away from the quality and ability of currently available Claude/Gemini/ChatGPT models. And even with a good eBay deal, the Mac cost the current equivalent of ~6 years of a monthly subscription to one of these.

Based on the current state of play, once we can access relatively affordable systems with 512-1024GB fast (v)ram and sufficient FLOPs to match, we might have a meaningfully powerful local solution. Until then, I fear local only is for enthusiasts/hobbyists and niche non-general tasks.

GaggiX 1 day ago||
>to today's subsidized pricing, which they can't keep up forever.

The APIs are not subsidized, they probably have quite the large margin actually: https://lmsys.org/blog/2025-05-05-large-scale-ep/

>Why would you pay OpenAI when you can host your own hyper efficient Chinese model

The 48GB of VRAM or unified memory required to run this model at 4bits is not free either.

siliconc0w 23 hours ago||
I didn't say its free but it is about 90% cheaper. Sonnet is $15 per million token output, this just dropped and is available at OpenRouter at $1.40. Even compared to Gemini Flash which is probably the best price-to-performance API is generally ranked lower than Qwen's models and is $2.50 so still %44 cheaper.