Lemonade by AMD: a fast and open source local LLM server using GPU and NPU

Posted by AbuAssar 5 days ago

Lemonade by AMD: a fast and open source local LLM server using GPU and NPU(lemonade-server.ai)

570 points | 112 comments

dennemark 5 days ago|

I have been using lemonade for nearly a year already. On Strix Halo I am using nothing else - although kyuz0's toolboxes are also nice (https://kyuz0.github.io/amd-strix-halo-toolboxes/)

Nowadays you get TTS, STT, text & image generation and image editing should also be possible. Besides being able to run via rocm, vulkan or on CPU, GPU and NPU. Quite a lot of options. They have a quite good and pragmatic pace in development. Really recommend this for AMD hardware!

Edit: OpenAI and i think nowaday ollama compatible endpoints allow me to use it in VSCode Copilot as well as i.e. Open Web UI. More options are shown in their docs.

UncleOxidant 5 days ago||

How much of a speedup might I get for, say, Qwen3.5-122B if I were to run with lemonade on my Strix Halo vs running it using vulkan with llama.cpp ?

sawansri 5 days ago||

You would get similar performance. Lemonade is designed as a turnkey (optimized for AMD Hardware) for local AI models. The software helps you manage backends (llama.cpp, flm, whispercpp, stable‑diffusion.cpp, etc) for different GenAI modalities from a single utility.

On the performance side, lemonade comes bundled with ROCm and Vulkan. These are sourced from https://github.com/lemonade-sdk/llamacpp-rocm and https://github.com/ggml-org/llama.cpp/releases respectively.

syntaxing 5 days ago|||

Have you used it with any agents or claw? If so, which model do you run?

dennemark 5 days ago|||

I have two Strix Halo devices at hand. Privately a framework desktop with 128gb and at work 64GB HP notebook. The 64GB machine can load Qwen3.5 30B-A3B, with VSCode it needs a bit of initial prompt processing to initialize all those tools I guess. But the model is fighting with the other resources that I need. So I am not really using it anymore these days, but I want to experiment on my home machine with it. I just dont work on it much right now.

Lemonade has a Web UI to set the context size and llama.cpp args, you need to set context to proper number or just to 0 so that it uses the default. If its too low, it wont work with agentic coding.

I will try some Claw app, but first need to research the field a bit. But I am using different models on Open Web UI. GPT 120B is fast, but also Qwen3.5 27B is fine.

cpburns2009 5 days ago||

Qwen3-Coder-Next works well on my 128GB Framework Desktop. It seems better at coding Python than Qwen3.5 35B-A3B, and it's not too much slower (43 tg/s compared to 55 tg/s at Q4).

27B is supposed to be really good but it's so slow I gave up on it (11-12 tg/s at Q4).

UncleOxidant 5 days ago|||

Agreed. Qwen3-coder-next seems like the sweetspot model on my 128GB Framework Desktop. I seem to get better coding results from it vs 27b in addition to it running faster.

vlowther 5 days ago|||

The 8 bit MLX unsloth quant of qwen3-coder-next seems to be a local best on an MBB M5 Max with 128GB memory. With oMLX doing prompt caching I can run two in parallel doing different tasks pretty reasonably. I found that lower quants tend to lose the plot after about 170k tokens in context.

cpburns2009 5 days ago||

That's good to know. I haven't exceeded a 120k context yet. Maybe I'll bite the bullet and try Q6 or Q8. Any of coder-next quants larger than UD-Q4_K_XL take forever to load, especially with ROCm. I think there's some sort of autotuning or fitting going in llama.cpp.

lrvick 5 days ago|||

As another data point.

Running Qwen3.5 122B at 35t/s as a daily driver using Vulcan llama.cpp on kernel 7.0.0rc5 on a Framework Desktop board (Strix Halo 128).

Also a pair of AMD AI Pro r9700 cards as my workhorses for zimageturbo, qwen tts/asr and other accessory functions and experiments.

Finally have a Radeon 6900 XT running qwen3.5 32B at 60+t/s for a fast all arounder.

If I buy anything nvidia it will be only for compatibility testing. AMD hardware is 100% the best option now for cost, freedom, and security for home users.

plagiarist 5 days ago|||

How is the performance for Z-Image on the R9700s?

lrvick 5 days ago||

About 10 seconds for a 1024x1024 on one, but not found a nice way to scale processing a single image across both.

syntaxing 5 days ago|||

Are the dedicated GPU cards on another machine or you’re using eGPU with the framework?

lrvick 5 days ago||

A separate machine.

rizzo94 4 days ago||

[dead]

Caum 5 days ago||

Been running local LLMs on my 7900 XTX for months and the ROCm experience has been... rough. The fact that AMD is backing an official inference server that handles the driver/dependency maze is huge. My biggest question is NPU support - has anyone actually gotten meaningful throughput from the Ryzen AI NPU vs just using the dGPU? In my testing the NPU was mostly a bottleneck for anything beyond tiny models.

mindcrime 5 days ago||

> Been running local LLMs on my 7900 XTX for months and the ROCm experience has been... rough.

Just out of curiosity... how so?

I only ask because I've been running local models (using Ollama) on my RX 7900 XTX for the last year and a half or so and haven't had a single problem that was ROCm specific that I can think of. Actually, I've barely had any problems at all, other than the card being limited to 24GB of VRAM. :-(

I'm halfway tempted to splurge on a Radeon Pro board to get more VRAM, but ... haven't bitten the bullet yet.

skirmish 5 days ago|||

Did you have complete hardware lockups when VRAM is exceeded? I had quite a few on my 7900XTX with llama.cpp (Arch Linux, various driver versions). Once I dial in the quant and context size that never exceed VRAM, it is stable; before that I swear a lot and keep pressing the hardware reset button.

hypercube33 3 days ago|||

This happens on windows as well for the same reasons so it's not isolated to Rocm and Linux

ozgrakkurt 5 days ago||||

Yes, it completely crashes the machine. I didn't even think it was unexpected until I read your comment. I guess this is what I come to expect when using anything except firefox or neovim

mindcrime 5 days ago|||

Nope. I've exceeded available VRAM a few times, and never had to do anything other than maybe restart Ollama. To be fair though, that's "exceed available VRAM" in terms of the initial model load (eg, using a model that would never load in 24GB). I don't know that I've ever started working with a successfully loaded model and then pushed past available VRAM by pushing stuff into the context.

I've had a few of those "model psychosis" incidents where the context gets so big that the model just loses all coherence and starts spewing gibberish though. Those are always fun.

naasking 4 days ago|||

> I only ask because I've been running local models (using Ollama) on my RX 7900 XTX for the last year and a half or so and haven't had a single problem that was ROCm specific that I can think of.

It's probably using the Vulkan backend, that is pretty stable and performance is good.

dlcarrier 5 days ago|||

Aren't NPUs only designed to run on small models? From whast I've seen, most NPUs don't have the architecture to share workloads with a GPU or CPU any better than a GPU or CPU can share workloads with each other. (One exemption being NPU instructions that are executed by the CPU, e.g. RISC-V cores with IME instructions being called NPUs, which speed up operations already happening on the CPU.)

You can share workloads between a GPU, CPU, and NPU, but it needs to be proportionally parceled out ahead of time; it's not the kind of thing that's easy to automate. Also, the GPU is generally orders of magnitude faster than the CPU or NPU, so the gains would be minimal, or completely nullified by the overhead of moving data around.

The largest advantage of splitting workloads is often to take advantage of dedicated RAM, e.g. stable diffusion workloads on a system with low VRAM but plenty of system RAM may move the latent image from VRAM to system RAM and perform VAE there, instead of on the GPU. With unified memory, that isn't needed.

lrvick 5 days ago|||

I have had way better perf with Vulcan than ROCm on kernel 7.0.0. They made some major improvements. 20%+ speedups for me.

cl0ckt0wer 5 days ago|||

the npu is more for power efficiency when on battery. I don't think it's a replacement for gpu.

htrp 5 days ago||

what kind of tps slowdown would you realistically on an npu vs gpu?

dlcarrier 5 days ago||

Microsoft requires a 40 TOPS NPU for Copilot co-branding, which a RTX 3050 can beat.

dietr1ch 4 days ago||

The only annoyance I've faced waiting for nix to compile a local build. I'd have thought that larger distros had no issues with it.

sensitiveCal 5 days ago||

Feels like this is sitting somewhere between Ollama and something like LM Studio, but with a stronger focus on being a unified “runtime” rather than just model serving.

The interesting part to me isn’t just local inference, but how much orchestration it’s trying to handle (text, image, audio, etc). That’s usually where things get messy when running models locally.

Curious how much of this is actually abstraction vs just bundling multiple tools together. Also wondering if the AMD/NPU optimizations end up making it less portable compared to something like Ollama in practice.

RealFloridaMan 5 days ago|

It bundles tools, model selection, and overall management.

It’s portable in the sense it will install on any of the supported OS using CPU or vulkan backends. But it only supports out of the box ROCM builds and AMD NPUs. There is a way to override which llama.cpp version it uses if you want to run it on CUDA, but that adds more overhead to manage.

If you have an AMD machine and want to run local models with minimal headache…it’s really the easiest method.

This runs on my NAS, handles my home assistant setup.

I have a strix halo and another server running various CUDA cards I manage manually by updating to bleeding edge versions of llama.cpp or vllm.

moconnor 5 days ago||

Is... is this named because they have a lemon they're trying to make the most of?

parsimo2010 5 days ago||

I think saying "L-L-M" sounds kind of like "lemon," so this is an LLM-aid (sounds like lemonade).

projektfu 5 days ago|||

Wonder why they didn't call it LLMonade, which would be unique.

metalliqaz 5 days ago|||

so obvious and yet I didn't connect the dots. thank you

ProllyInfamous 5 days ago||

wait until you discover the LuLuleMonade -connection /s

TeMPOraL 5 days ago|||

If life keeps giving it them, they should instead invent a combustible lemon.

eddieroger 5 days ago|||

Do they know who you are? They're the guys who are going to blow your house up ... with the lemons.

LorenDB 5 days ago||

On an unrelated note, do you think this software supports running models from a CD?...

altmanaltman 5 days ago|||

Lemonsqueeze was considered too violent

nathan_douglas 5 days ago||

If you run it in a cluster, does it become a Lemon Party?

speed_spread 4 days ago||

If you run it on someone else's computer it becomes Lemon Stealing

lrvick 5 days ago|||

I exclusively buy AMD hardware for local inference. For open drivers, power efficiency, and cost AMD beats Nvidia easily for consumers.

suprjami 5 days ago|||

You have got to be joking.

My three NVIDIA cards are more power efficient than my one AMD card, both at idle and during usage.

Official ROCm is like pulling teeth with poor support for desktop cards. Debian, a volunteer led project, have better ROCm CI than AMD and support more cards.

Look at any benchmarks. NV midrange cards are faster than AMD and at least a generation in front. Owning a 7900XTX is an embarrassing disappointment.

I like AMD and want them to succeed, but they are way behind NV in this area.

roenxi 5 days ago|||

> Official ROCm is like pulling teeth with poor support for desktop cards...

I agree with most of your post and fled the AMD ecosystem some time ago because of the machine learning situation, but their problem seemed to be more the firmware bugs and memory management of compute shaders than the higher level libraries.

The obvious solution to this one would be not to use ROCm. ROCm has always been a bit of a train wreck for small users and it doesn't seem to do anything special anyway. The way forward would be something more like Vulkan which the server that today's link points to seems to be using. The existence of a badly managed software package doesn't really imply that users have to use it, they can use an alternative.

It would be nice if AMD sorts themselves out though. The NVidia driver situation on linux is painful and if AMD can reliably run LLMs without the hardware locking then I'd much rather move back to using their products.

suprjami 5 days ago||

Yes, AMD themselves even use Vulkan tg numbers in their marketing material, because it's faster than ROCm on everything RDNA2 onwards (seems embarrassing).

However for pp, Vulkan is still nowhere near close to ROCm. That matters for long context and/or quick response. A lot of people really care about that time-to-first-token.

lrvick 5 days ago|||

Have a Strix Halo 128 running Qwen 3.5 122b at 35t/s using Vulkan and kernel 7.0.0 on a 400w PSU. Pretty hard to beat for the price and power consumption IMO. But to be fair I compile everything myself so proprietary drivers required by nvidia are a non starter for me.

javchz 5 days ago|||

Any recommendations in the current market? Love how plug and play and is on Linux from the driver side of things.

lrvick 5 days ago||

Strix Halo 128 w/ linux 7x

zozbot234 5 days ago||

Note that the NPU models/kernels this uses are proprietary and not available as open source. It would be nice to develop more open support for this hardware.

plagiarist 5 days ago||

I bought one of their machines to play around with under the expectation that I may never be able to use the NPU for models. But I am still angry to read this anyway.

zozbot234 5 days ago||

AMD/Xilinx's software support for the NPU is fully open, it's only FFLM's models that are proprietary. See https://github.com/amd/iron https://github.com/Xilinx/mlir-aie https://github.com/amd/RyzenAI-SW/ . It would be nice to explore whether one can simply develop kernels for these NPU's using Vulkan Compute and drive them that way; that would provide the closest unification with the existing cross-platform support for GPU's.

swiftcoder 5 days ago||

Are they? The docs say "You can also register any Hugging Face model into your Lemonade Server with the advanced pull command options"

zozbot234 5 days ago||

That won't give you NPU support, which relies on https://github.com/FastFlowLM/FastFlowLM . And that says "NPU-accelerated kernels are proprietary binaries", not open source.

JSR_FDED 5 days ago||

I’ve read the website and the news announcement, and I still don’t understand what it is. An alternative to LM Studio? Does it support MLX or metal on Macs? I’m assuming it will optimize things for AMD, but are you at a disadvantage using other GPUs?

molticrystal 5 days ago||

>Does it support MLX or metal on Macs?

This is answered from their Project Roadmap over on Github[0]:

Recently Completed: macOS (beta)

Under Development: MLX support

[0] https://github.com/lemonade-sdk/lemonade?tab=readme-ov-file#...

RealFloridaMan 5 days ago|||

It’s an easy way to get started and maintain a local AI stack that concentrates on AMD optimization. It is a one stop install for endpoints for sst, tts, image generation, and normal LLM. It has its own webui for management and interacting with the endpoints.

It also has endpoints that are compatible with OpenAI, Ollama, and Anthropic so you can throw any tool that is compatible with those and it will just run.

zelphirkalt 5 days ago|||

I think LM Studio itself uses other software to actually make use of LLMs. If that other software does not support your NPUs, then you are not going to get much performance out of those. This Lemonade thing I am guessing is one such other software, that LM Studio could be using.

0x457 5 days ago||

It's alternative to LM Studio in a way that it's an abstraction over multiple runtimes. AMD part is that it supports FastFlowML runtime which is the only way to utilize NPU on Ryzen AI CPUs on linux.

rpdillon 5 days ago||

Been running lemonade for some time on my Strix Halo box. It dispatches out to other backends that they include, like diffusion and llama. I actually don't like their combined server, and what I use instead is their llama CPP build for ROCm.

https://github.com/lemonade-sdk/llamacpp-rocm

But I'm not doing anything with images or audio. I get about 50 tokens a second with GPT OSS 120B. As others have pointed out, the NPU is used for low-powered, small models that are "always on", so it's not a huge win for the standard chatbot use case.

zozbot234 5 days ago|

Even small NPUs can offload some compute from prefill which can be quite expensive with longer contexts. It's less clear whether they can help directly during decode; that depends on whether they can access memory with good throughput and do dequant+compute internally, like GPUs can. Apple Neural Engine only does INT8 or FP16 MADD ops, so that mostly doesn't help.

jmillikin 5 days ago||

Surprising that the Linux setup instructions for the server component don't include Docker/Podman as an option, its Snap/PPA for Ubuntu and RPM for Fedora.

Maybe the assumption is that container-oriented users can build their own if given native packages?

freedomben 5 days ago|

They do have some container options, though I definitely think they should be added to the release page: https://lemonade-server.ai/install_options.html#docker

zenoprax 5 days ago||

Why should this be on the "Releases"? Shouldn't that just be for build artifacts? Pre-built containers belong on a registry, no?

I suppose a Dockerfile could be included but that also seems unconventional.

freedomben 5 days ago||

I just meant on the instructions part of the releases page (since they already have some installation instructions), not the artifacts themselves.

steffs 5 days ago||

The multi-modal bundling is the part that stands out more than the raw inference speed. If you are building an app that needs text generation, image generation, and speech recognition, right now the local setup is three separate services with three different APIs and three different model management stories. Having one server handle all of that behind OpenAI-compatible endpoints is a real quality of life improvement for anyone prototyping locally. The NPU angle is interesting but probably overstated for most use cases. The discussion in the thread confirms what I would expect: NPUs shine for small always-on models and prefill offloading, not for the chatbot workloads most people care about. Where this gets genuinely compelling is if AMD can make the combined GPU plus NPU scheduling transparent enough that developers do not need to think about which hardware is running which part of the pipeline. That is not a solved problem on any platform yet, and if Lemonade gets it right for even a subset of workloads, it becomes the default choice on AMD hardware regardless of how it benchmarks against Ollama on pure text generation.

nijave 5 days ago|

Anyone compare to ollama? I had good success with latest ollama with ROCm 7.4 on 9070 XT a few days ago

RealFloridaMan 5 days ago||

It is optimized for compatibility across different APIs as well as has specific hardware builds for AMD GPUs and NPUs. It’s run by AMD.

Under the hood they are both running llama.cpp, but this has specific builds for different GPUs. Not sure if the 9070 is one, I am running it on a 370 and 395 APU.

martin-adams 5 days ago|||

I just compared this on my Mac book M1 Max 64GB RAM with the following:

Model: qwen3.59b Prompt: "Hey, tell me a story about going to space"

Ollama completed in about 1:44 Lemonade completed in about 1:14

So it seems faster in this very limited test.

nezhar 5 days ago|||

I'm also curious about this one, also I want to compare this to vLLM.

iugtmkbdfil834 5 days ago|||

Seconded. Currently on ollama for local inference, but I am curious how it compares.

LumielGR 5 days ago||

Lemonade is using llama.cpp for text and vision with a nightly ROCm build. It can also load and serve multiple LLMs at the same time. It can also create images, or use whisper.cpp, or use TTS models, or use NPU (e.g Strix Halo amdxdna2), and more!

metalliqaz 5 days ago||

better than Vulkan?

cpburns2009 5 days ago|||

In my experience using llama.cpp (which ollama uses internally) on a Strix Halo, whether ROCm or Vulkan performs better really depends on the model and it's usually within 10%. I have access to an RX 7900 XT I should compare to though.

metalliqaz 5 days ago||

Perhaps I should just google it, but I'm under the impression that ollama uses llama.cpp internally, not the other way around.

Thanks for that data point I should experiment with ROCm

naasking 5 days ago|||

From what I understand, ROCm is a lot buggier and has some performance regressions on a lot of GPUs in the 7.x series. Vulkan performance for LLMs is apparently not far behind ROCm and is far more stable and predictable at this time.

cpburns2009 5 days ago|||

I meant ollama uses llama.cpp internally. Sorry for the confusion.

0x457 5 days ago||||

For me Vulkan performs better on integrated cards, but ROCm (MIGraphX) on 7900 XTX.

nijave 4 days ago||||

As I understand it, it depends on your GPU and ROCm version but they're similar-ish

hrmtst93837 5 days ago|||

[flagged]

metalliqaz 5 days ago||

I was talking about ROCm vs Vulkan. On AMD GPUs, Vulkan has been commonly recognized as the faster API for some time. Both have been slower than CUDA due to most of the hosting projects focusing entirely on Nvidia. Parent post seemed to indicate that newer ROCm releases are better.

naasking 5 days ago||

Yes, Vulkan is currently faster due to some ROCm regressions: https://github.com/ROCm/ROCm/issues/5805#issuecomment-414161...

ROCm should be faster in the end, if they ever fix those issues.

More comments...