Posted by AbuAssar 5 days ago
Nowadays you get TTS, STT, text & image generation and image editing should also be possible. Besides being able to run via rocm, vulkan or on CPU, GPU and NPU. Quite a lot of options. They have a quite good and pragmatic pace in development. Really recommend this for AMD hardware!
Edit: OpenAI and i think nowaday ollama compatible endpoints allow me to use it in VSCode Copilot as well as i.e. Open Web UI. More options are shown in their docs.
On the performance side, lemonade comes bundled with ROCm and Vulkan. These are sourced from https://github.com/lemonade-sdk/llamacpp-rocm and https://github.com/ggml-org/llama.cpp/releases respectively.
Lemonade has a Web UI to set the context size and llama.cpp args, you need to set context to proper number or just to 0 so that it uses the default. If its too low, it wont work with agentic coding.
I will try some Claw app, but first need to research the field a bit. But I am using different models on Open Web UI. GPT 120B is fast, but also Qwen3.5 27B is fine.
27B is supposed to be really good but it's so slow I gave up on it (11-12 tg/s at Q4).
Running Qwen3.5 122B at 35t/s as a daily driver using Vulcan llama.cpp on kernel 7.0.0rc5 on a Framework Desktop board (Strix Halo 128).
Also a pair of AMD AI Pro r9700 cards as my workhorses for zimageturbo, qwen tts/asr and other accessory functions and experiments.
Finally have a Radeon 6900 XT running qwen3.5 32B at 60+t/s for a fast all arounder.
If I buy anything nvidia it will be only for compatibility testing. AMD hardware is 100% the best option now for cost, freedom, and security for home users.
Just out of curiosity... how so?
I only ask because I've been running local models (using Ollama) on my RX 7900 XTX for the last year and a half or so and haven't had a single problem that was ROCm specific that I can think of. Actually, I've barely had any problems at all, other than the card being limited to 24GB of VRAM. :-(
I'm halfway tempted to splurge on a Radeon Pro board to get more VRAM, but ... haven't bitten the bullet yet.
I've had a few of those "model psychosis" incidents where the context gets so big that the model just loses all coherence and starts spewing gibberish though. Those are always fun.
It's probably using the Vulkan backend, that is pretty stable and performance is good.
You can share workloads between a GPU, CPU, and NPU, but it needs to be proportionally parceled out ahead of time; it's not the kind of thing that's easy to automate. Also, the GPU is generally orders of magnitude faster than the CPU or NPU, so the gains would be minimal, or completely nullified by the overhead of moving data around.
The largest advantage of splitting workloads is often to take advantage of dedicated RAM, e.g. stable diffusion workloads on a system with low VRAM but plenty of system RAM may move the latent image from VRAM to system RAM and perform VAE there, instead of on the GPU. With unified memory, that isn't needed.
The interesting part to me isn’t just local inference, but how much orchestration it’s trying to handle (text, image, audio, etc). That’s usually where things get messy when running models locally.
Curious how much of this is actually abstraction vs just bundling multiple tools together. Also wondering if the AMD/NPU optimizations end up making it less portable compared to something like Ollama in practice.
It’s portable in the sense it will install on any of the supported OS using CPU or vulkan backends. But it only supports out of the box ROCM builds and AMD NPUs. There is a way to override which llama.cpp version it uses if you want to run it on CUDA, but that adds more overhead to manage.
If you have an AMD machine and want to run local models with minimal headache…it’s really the easiest method.
This runs on my NAS, handles my home assistant setup.
I have a strix halo and another server running various CUDA cards I manage manually by updating to bleeding edge versions of llama.cpp or vllm.
My three NVIDIA cards are more power efficient than my one AMD card, both at idle and during usage.
Official ROCm is like pulling teeth with poor support for desktop cards. Debian, a volunteer led project, have better ROCm CI than AMD and support more cards.
Look at any benchmarks. NV midrange cards are faster than AMD and at least a generation in front. Owning a 7900XTX is an embarrassing disappointment.
I like AMD and want them to succeed, but they are way behind NV in this area.
I agree with most of your post and fled the AMD ecosystem some time ago because of the machine learning situation, but their problem seemed to be more the firmware bugs and memory management of compute shaders than the higher level libraries.
The obvious solution to this one would be not to use ROCm. ROCm has always been a bit of a train wreck for small users and it doesn't seem to do anything special anyway. The way forward would be something more like Vulkan which the server that today's link points to seems to be using. The existence of a badly managed software package doesn't really imply that users have to use it, they can use an alternative.
It would be nice if AMD sorts themselves out though. The NVidia driver situation on linux is painful and if AMD can reliably run LLMs without the hardware locking then I'd much rather move back to using their products.
However for pp, Vulkan is still nowhere near close to ROCm. That matters for long context and/or quick response. A lot of people really care about that time-to-first-token.
This is answered from their Project Roadmap over on Github[0]:
Recently Completed: macOS (beta)
Under Development: MLX support
[0] https://github.com/lemonade-sdk/lemonade?tab=readme-ov-file#...
It also has endpoints that are compatible with OpenAI, Ollama, and Anthropic so you can throw any tool that is compatible with those and it will just run.
https://github.com/lemonade-sdk/llamacpp-rocm
But I'm not doing anything with images or audio. I get about 50 tokens a second with GPT OSS 120B. As others have pointed out, the NPU is used for low-powered, small models that are "always on", so it's not a huge win for the standard chatbot use case.
Maybe the assumption is that container-oriented users can build their own if given native packages?
I suppose a Dockerfile could be included but that also seems unconventional.
Under the hood they are both running llama.cpp, but this has specific builds for different GPUs. Not sure if the 9070 is one, I am running it on a 370 and 395 APU.
Model: qwen3.59b Prompt: "Hey, tell me a story about going to space"
Ollama completed in about 1:44 Lemonade completed in about 1:14
So it seems faster in this very limited test.
Thanks for that data point I should experiment with ROCm
ROCm should be faster in the end, if they ever fix those issues.