Lemonade by AMD: a fast and open source local LLM server using GPU and NPU

Posted by AbuAssar 6 days ago

Lemonade by AMD: a fast and open source local LLM server using GPU and NPU(lemonade-server.ai)

570 points | 112 commentspage 2

cpburns2009 5 days ago|

Just in case anyone isn't aware. NPUs are low power, slow, and meant for small models.

jcgrillo 5 days ago|

I wonder what was the imagined use case? TBH I was seriously thinking about buying a framework desktop but the NPU put me off.. I don't get why I should have to pay money for a bunch of silicon that doesn't do anything. And now that there's some software support... it still doesn't do anything? Why does it even exist at all then?

ThatPlayer 5 days ago|||

At least part of it is probably Microsoft's 40 TOPS NPU requirement for their Copilot+ badge. Intel also have NPUs in their modern CPUs. Phones CPU manufacturers have been doing it even longer, though Google calls theirs TPU.

I use an older Google Coral TPU running in my home lab being used by Frigate NVR for object detection for security cameras. It's more efficient, but less flexible than running it on the GPU.

Don't know if I need an NPU for my daily driver computer, but I would want one for my next home server.

naasking 5 days ago||||

Small models aren't entirely useless, and the NPU can run LLMs up to around 8B parameters from what I've seen. So one way they could be useful: Qwen3 text to speech models are all under 2B parameters, and Open AI's whisper-small speech to text model is under 1B parameters, so you could have an AI agent that you could talk to and could talk back, where, in theory, you could offload all audio-text and text-audio processing to the low power NPU and leave the GPU to do all of the LLM processing.

jcgrillo 5 days ago|||

That seems like a really niche use case, and probably not worth the surface area? The power savings would have to be truly astonishing to justify it, given what a small fraction of compute time your average device spends processing voice input. I'd wager the 90th percentile siri/ok google/whatever user issues less than 10 voice queries per day. How much power can they use running on normal hardware and how much could it possibly matter?

naasking 5 days ago||

It's just an example where it fits perfectly, and it's exactly what something like Alexa or Google home needs for low power machine learning, eg. when sitting idle it needs to consume as little power as possible while waiting for a trigger word.

Any context that needs some limited intelligence while consuming little power would benefit from this.

zozbot234 5 days ago|||

You could always offload some layers to the NPU for lower power use and leave the rest to the GPU. If the latter is power throttled (common for prefill, not for decode) that will be a performance improvement.

naasking 5 days ago||

Routing in a MoE model might fit.

zozbot234 3 days ago||

You want routing to be as quick as possible, because there are dependent loads of expert MoE weights (at least from CPU in most setups, potentially from storage) downstream of it. So that ultimately depends on what the bottleneck on that part of the model is: compute, memory throughput or both? If it's throughput, the NPU might be a bad fit.

cpburns2009 5 days ago|||

The NPU is entirely useless for the Framework Desktop, and really all Strix Halo devices. Where it could be useful is cell phones with the examples mentioned by @naasking (audio-text and text-audio processing), and maybe IoT.

gnarlouse 5 days ago||

Maybe it's a language barrier problem, but "by AMD" makes me think its a project distributed by AMD. Is that actually the case? I'm not seeing any reason to believe it is.

buildbot 5 days ago||

It’s a community project supported and sponsored by AMD according to their GitHub; https://github.com/lemonade-sdk/lemonade

AMD employees work on it/have been making blog posts about it for a bit.

AbuAssar 5 days ago|||

guipsp 5 days ago|||

It is mostly developed by AMD and used to be hosted on the AMD github iirc

hombre_fatal 5 days ago||

> You can reach us by filing an issue, emailing lemonade@amd.com

Found this on the github readme.

freedomben 5 days ago||

Neat, they have rpm, deb, and a companion AppImage desktop app[1]! Surprised I wasn't aware of this project before. Definitely going to give it a try.

[1]: https://github.com/lemonade-sdk/lemonade/releases/tag/v10.0....

bravetraveler 5 days ago||

A fun observation: pulling models sends ~200mbit of progress updates to your browser

pantalaimon 5 days ago||

It's pretty annoying that you need vendor specific APIs and a large vendor specific stack to do anything with those NPUs.

This way software adoption will be very limited.

syntaxing 5 days ago||

Wow this is super interesting. This creates a local “Gemini” front end and all. This is more or less a generative AI aggregator where it installs multiple services for different gen modes. I’m excited to try this out on my strix halo. The biggest issue I had is image and audio gen so this seems like a great option.

kouunji 5 days ago||

I’m looking forward to trying this currently Strix halo’s npu isn’t accessible if you’re running Linux, and previously I don’t think lemonade was either. If this opens up the npu that would be great! Resolute raccoon is adding npu support as well.

dennemark 5 days ago||

Maybe you have seen NPU support via FLM already: https://lemonade-server.ai/flm_npu_linux.html

"FastFlowLM (FLM) support in Lemonade is in Early Access. FLM is free for non-commercial use, however note that commercial licensing terms apply. "

cpburns2009 5 days ago|||

The NPU works on Linux (Arch at least) on Strix Halo using FastFlowLM [1]. Their NPU kernels are proprietary though (free up to a reasonable amount of commercial revenue). It's neat you can run some models basically for free (using NPU instead of CPU/GPU), but the performance is underwhelming. The target for NPUs is really low power devices, and not useful if you have an APU/GPU like Strix Halo.

[1]: https://github.com/FastFlowLM/FastFlowLM

boomskats 5 days ago||

I thought the NPU has been available since something like 6.12?

ilaksh 5 days ago||

Cool but is there a reason they can't just make PRs for vLLM and llama.cpp? Or have their own forks if they take too long to merge?

RealFloridaMan 5 days ago|

They use the latest llama.cpp under the hood but built for specific AMD GPU hardware.

Lemonade is really just a management plane/proxy. It translates ollama/anthropic APIs to OpenAI format for llama.cpp. It runs different backends for sst/tts and image generation. Lets you manage it all in one place.

metalliqaz 5 days ago|

my most powerful system is Ryzen+Radeon, so if there are tools that do all the hard work of making AI tools work well on my hardware, I'm all for it. I find it very frustrating to get LLMs, diffusion, etc. working fast on AMD. It's way too much work.

More comments...