Posted by tamnd 19 hours ago
I'm assuming this is faster, and/or lets you run a bigger, smarter model than just using the generic tool chain, but it doesn't spell out the level of existing improvements over that baseline or expected improvements as far as I can see?
Presumably you can work it out based on the numbers given if you have the relevant comparison values.
Even though the project was meant to be educational, it gave me an idea I can't get out of my head: what if we started building ultra-optimized inference engines tailored to an exact GPU+model combination? GPUs are expensive and harder to get with each day. If you remove enough abstractions and code directly to the exact hardware/model, you can probably optimize things quite a lot (I hope). Maybe run an agent which tries to optimize inference in a loop (like autoresearch), empirically testing speed/quality.
The only problem with this is that once a model becomes outdated, you have to do it all again from scratch.
The inference engines in use already include different backend building blocks optimized for different hardware.
While there are places where you can pick up some low hanging fruit for less popular platforms, there isn't a lot of room to squeeze in super optimized model-runners for specific GPU families and get much better performance. The core computations are already done by highly optimized kernels for each GPU.
There are forks of llama.cpp that have better optimizations for running on CPU architectures, but (barring maintainer disagreements) a better use of time is to target merging these improvements upstream instead of trying to make super specific model+GPU runners.
> DeepSeek made quite a splash in the AI industry by training its Mixture-of-Experts (MoE) language model with 671 billion parameters using a cluster featuring 2,048 Nvidia H800 GPUs in about two months, showing 10X higher efficiency than AI industry leaders like Meta. The breakthrough was achieved by implementing tons of fine-grained optimizations and usage of Nvidia's assembly-like PTX (Parallel Thread Execution) programming instead of Nvidia's CUDA for some functions,
https://www.tomshardware.com/tech-industry/artificial-intell...
Custom code targeting one specific hardware implementation can improve performance quite a bit.
[1] https://codegolf.stackexchange.com/questions/215216/high-thr...
Optimizing things usually means "think of a way to do the same thing with less effort".
I have an older W7900 (RDNA3) which, besides 48GB of VRAM, has some pretty decent roofline specs - 123 FP16 TFLOPS/INT8 TOPS, 864 GB/s MBW, but has had notoriously bad support both from AMD (ROCm) as well as llama.cpp.
Recently I decided I'd like to turn the card into a dedicated agentic/coder endpoint and I started tuning a W8A8-INT8 model. Over the course of a few days of autolooping (about 800 iterations using a variety of frontier/SOTA models, Kimi K2.6 did surprisingly well), and I ended up with prefill +20% and decode +50% faster than the best llama.cpp numbers for Qwen3.6 MoE.
I'm currently grinding MTP and DFlash optimization on it, but I've been pretty pleased with the results, and will probably try Gemma 4 next.
Even if not perfect, if you publish on GH or HF, some other agent can maybe start there and not from zero. I did this for Ling-2.6-flash (107B-A7B4 MoE) that's the biggest llm I can ran for practical use on the other h/w I got for local llms (M2 Max). Even if MTP is not working well, still improvement on the current llama.cpp that does not run Ling-2.6-flash at all. This - https://huggingface.co/inclusionAI/Ling-2.6-flash/discussion.... The 4-bit quants are at https://huggingface.co/ljupco/Ling-2.6-flash-GGUF, the branch is at https://github.com/ljubomirj/llama.cpp/tree/LJ-Ling-2.6-flas....
I think llama.cpp could have done a much better job supporting PC. Sure, some of it us due to bad vendor support but with so many users I am surprised we don't see more optimized inference on standard PCs
### Diagnosing parallelism pathologies (L1)
*Grid occupancy:* - `Grid_Size / Workgroup_Size >= CU count` (W7900 = 96, Strix Halo = 40)? - < 0.3 = massively undersubscribed. Fix grid FIRST. Micro-optimization will NOT help. - 0.3-1.0 = partially utilized; depends on VGPR/LDS pressure. - 1.0-4.0 = healthy; micro-optimization can help.
*Within-block distribution:* - Does the kernel do useful work across all threads, or is there an `if (threadIdx.x == 0)` gate around a serial top-k, reduction, or scan? For c=1 decode, many kernels can't grow the grid, but they can always parallelize inside the block. - `Scratch_Size > 0` from dynamically-indexed per-thread arrays is a strong secondary signal of the within-block pathology.
*Router top-k (within-block fix)*: - Kernel: `qwen35_router_select_kernel` @ c=1 decode - Before: grid=1 (can't help; num_tokens=1), blockDim=512, `if (threadIdx.x == 0)` gated 2048 serial compares. Scratch=144 B from spilled per-thread arrays. - Fix: warp-shuffle parallel argmax across the whole block + `__shared__` top_vals buffer eliminating the spill. - Result: 5.7× kernel speedup, +6.6% on 4K/D4K E2E.
Everyone who's betting their competency on the generosity of billionaires selling tokens for 1/10-1/20th of the cost, or a delusional future where capable OS models fit on consumer grade hardware are actually cooked.
Of course there will always be larger flagship models, but if you can count on decent on-device inference, it materially changes what you can build.
Why?
The trend is heading in the opposite direction, less options for strong consumer hardware and towards cloud based products. This is a memory issue more than anything. Nvidia is done selling their ddr7 to gamers and people with AI girlfriends.
It's not out of the realm of possibility, but I just want to make you aware that this would be a very surprising development in computing history.
> in the next few years a "good enough" model will run on entry-level hardware
And that's for laptops with unified memory. In the desktop space, 8GB discrete GPUs are going to be sticking around for a very long time.
But that's not my main argument is that its delusional for OP thinks its reasonable to expect that soon we'll be able to run models on consumer hardware that will be able to build basically most things,
But I do think there will be many compromises made for consumer electronics, I don't think the powers that be are eager to give consumers all the best memory (that should be clear by now) There's 3 DDR5 DRAM manufactures in the world that have to provide memory to all the world's militaries, governments, datacenters/corporations. Consumers are last priority.
You can argue whether the projection is too optimistic or not, but this project definitely made me a little bit optimistic on that end.
An example is https://blog.can.ac/2026/02/12/the-harness-problem/ for just improving edits.
Or if we could really steer these open source models using well structured plans, could we spend more time planning into a specific way and kick off the build over night (a la the night shift https://jamon.dev/night-shift)
They said the same thing about open source chess engines.
48 gb is enough for a capable LLM.
Doing that on consumer grade hardware is entirely possible. The bottleneck is CUDA and other intellectual property moats.
- Consumers of LLM inference (developers and hobbyists) will be more aware of compute cost, leading them to develop more token-efficient uses of LLM inference and be incentivized to pick the right model for the right job (instead of throwing Sonnet at the wall and follow up with Opus if that doesn't stick)
- A larger market for on-device (and therefore open weight) LLM's will probably result in more research concentrated on those inherently more efficient (because compute/memory-constrained) models.
I think that despite the inefficiencies, shifting the market towards local inference would be a net positive in terms of energy use. Remember that 50W might seem like a lot, but is still much less than what, let's say, a PS5 draws.
Also remember how AWS had the same promise and now we're just deploying stack after stack and need 'FinOps' teams to get us to be more resource-efficient?
They don't usually go into much detail, but the impression I get is that they think data centers are energy monsters full of overheated GPU's that need to be constantly replaced, while your phone is full of mostly unused compute capacity and will barely break a sweat if it's only serving queries for a single user at a time.
They don't seem to give much thought to the energy usage per user (or what this will potentially do to your phone battery), or how different phone-sized vs data center-sized models are in terms of capability.
[1] https://finance.yahoo.com/sectors/technology/articles/nvidia...
Plus, a Mac that's not running inference idles down to 1-5W, only drawing power when it needs to. Datacenters must maximize usage, individuals and their devices don't have to.
A Mac is also the rest of the personal computer!
If DS4 Flash peaks at 50W and is 280B parameters, does that mean DS4 Pro at 1.6T parameters would likely be 300W or so? And the latest GPT 5 and Opus which feel maybe comparable-ish around 500W? Is it fair to say that when I'm using Claude Code and it's "autofellating" or whatever I'm burning 500W in a datacenter somewhere during that time?
Data center energy use isn't simple to calculate because servers are configured to process a lot of requests in parallel. You're not getting an entire GPU cluster to yourself while your request is being processed. Your tokens are being processed in parallel with a lot of other people's requests for efficiency.
This is why some providers can offer a fast mode: Your request gets routed to servers that are tuned to process fewer requests in parallel for a moderate speedup. They charge you more for it because they can't fit as many requests into that server.
Claude Sonnet is probably running on a 8 GPU box that consumes 10 kW while Opus might use more like 50 kW but that's shared by a bunch of users thanks to batching.
I could write an engine that only uses 10W on your machine, but it wouldn't be meaningful if it was also 10X slower.
More power consumption is usually an indicator that the hardware is being fully utilized, all things equal (comparing GPU to GPU or CPU to CPU, not apples to oranges)
Edit: Caching story makes a lot more sense for regular usage: > Claude Code may send a large initial prompt, often around 25k tokens, before it starts doing useful work. Keep --kv-disk-dir enabled: after the first expensive prefill, the disk KV cache lets later continuations or restarted sessions reuse the saved prefix instead of processing the whole prompt again.
Also, can the engine support transparent mmap use for fetching weights from disk on-demand, at least when using pure CPU? (GPU inference might be harder, since it's not clear how page faults would interact with running a shader.)
If the latter test is successful, next would be testing Macs with more limited RAM, first running simple requests (would be quite slow) then larger batches (might be more worthwhile if one can partially amortize the cost of fetching weights from storage, and be bottlenecked by other factors).