Qwen3 30B A3B Hits 13 token/s on 4xRaspberry Pi 5

Posted by b4rtazz 7 days ago

Qwen3 30B A3B Hits 13 token/s on 4xRaspberry Pi 5(github.com)

347 points | 161 commentspage 2

rldjbpin 5 days ago|

how would llm-d [1] work compared to distributed-llama? is the overhead or configuration too much to work with for simple setups?

[1] https://github.com/llm-d/llm-d/

kosolam 7 days ago||

How is this technically done? How does it split the query and aggregates the results?

magicalhippo 7 days ago|

From the readme:

More devices mean faster performance, leveraging tensor parallelism and high-speed synchronization over Ethernet.

The maximum number of nodes is equal to the number of KV heads in the model #70.

I found this[1] article nice for an overview of the parallelism modes.

[1]: https://medium.com/@chenhao511132/parallelism-in-llm-inferen...

varispeed 7 days ago||

So would 40x RPi 5 get 130 token/s?

SillyUsername 7 days ago||

I imagine it might be limited by number of layers and you'll get diminishing returns as well at some point caused by network latency.

reilly3000 7 days ago|||

It has to be 2^n nodes and limited to one per attention head that the model has.

VHRanger 7 days ago||

Most likely not because of NUMA bottlenecks

ab_testing 6 days ago||

Would it work better on a used GPU?

ineedasername 6 days ago||

This is highly usable in an enterprise setting when the task benefits from near-human level decision making and when $acceptable_latency < 1s meets decisions that can be expressed in natural language <= 13tk.

Meaning that if you can structure a range of situations and tasks clearly in natural language with a pseudo-code type of structure and fit it in model context then you can have an LLM perform a huge amount of work with Human-in-the-loop oversight & quality control for edge cases.

Think of office jobs, white colar work, where, business process documentation and employee guides and job aids already fully describe 40% to 80% of the work. These are the tasks most easily structured with scaffolding prompts and more specialized RLHF enriched data, and then perform those tasks more consistently.

This is what I decribe when I'm asked "But how will they do $X when they can't answer $Y without hallucinating?"

I explain the above capability, then I ask the person to do a brief thought experiment: How often have you heard, or yourself thought something like, "That is mindnumbingly tedious" and/or "a trained monkey could do it"?

In the end, I don't know anyone whose is aware of the core capabilities in the structured natural-language sense above, that doesn't see at a glance just how many jobs can easily go away.

I'm not smart enough to see where all the new jobs will be or certain there will be as many of them, if I did I'd start or invest in such businesses. But maybe not many new jobs get created, but then so what?

If the net productivity and output-- essentially the wealth-- of the global workforce remains the same or better with AI assistance and therefore fewer work hours, that means... What? Less work on average, per capita. More wealth, per work hour worked per Capita than before.

Work hours used to be longer, they can shorten again. The problem is getting there. To overcoming not just the "sure but it will only be the CEOs get wealthy" side of things to also the "full time means 40 hours a week minimum." attitude by more than just managers and CEOs.

It will also mean that our concept of the "proper wage" for unskilled labor that can't be automated will have to change too. Wait staff at restaurants, retail workers, countless low end service-workers in food and hospitality? They'll now be providing-- and giving up-- something much more valuable than white colar skills that are outdated. They'll be giving their time to what I've heard, and the term is jarring to my ears but it is what it is, I've heard it described as "embodied work". And I guess the term fits. And anyway I've long considered my time to be something I'll trade with a great deal more reluctance than my money, and so demand a lot money for it when it's required so I can use that money to buy more time (by not having to work) somewhere in the near future, even if it's just by covering my costs for getting groceries delivered instead of the time to go shopping myself.

Wow, this comment got away from me. But seeing Qwen3 30B level quality with 13tk/s on dirt cheap HW struck a deep chord of "heck, the global workforce could be rocked to the core for cheap+quality 13tk/s." And that alone isn't the sort of comment you can leave as a standalone drive-by on HN and have it be worth the seconds to write it. And I'm probably wrong on a little or a lot of this and seeing some ideas on how I'm wrong will be fun and interesting.

shaaca 7 days ago||

[dead]

YJfcboaDaJRDw 7 days ago||

[dead]

mehdibl 7 days ago||

[flagged]

hidelooktropic 7 days ago|

13/s is not slow. Q4 is not bad. The models that run on phones are never 30B or anywhere close to that.

lostmsu 7 days ago|||

It is very slow and totally unimpressive. 5060Ti ($430 new) would do over 60, even more in batched mode. 4x RPi 5 are $550 new.

magicalhippo 7 days ago||

So clearly we need to get this guy hooked up with Jeff Geerling so we can have 4x RPi5s with a 5060 Ti each...

Yes, I'm joking.

misternintendo 7 days ago|

At this speed this is only suitable for time insensitive applications..

layer8 7 days ago||

I’d argue that chat is a time-sensitive application, and 13 tokens/s is significantly faster than I can read.

daveed 7 days ago||

I mean it's a raspberry pi...