later found vllm uses paged kv cache with layout that matches how the GPU wants to read fully coalesced without strided jumps. llama.cpp was using a flat layout that’s fine for single prompt but breaks L2 access patterns when batching.
reshaped kv tensors in llama.cpp to interleave ; made it [head, seq, dim] instead of [seq, head, dim], closer to how vllm feeds data into fused attention kernel. 2x speedup right there w.r.t same ops.
GPU was never the bottleneck. it was memory layout not aligning with SM’s expected access stride. vllm just defaults to layouts that make better use of shared memory and reduce global reads. that’s the real reason it scales better per batch.
this took its own time of say 2+days and had to dig under the nice looking GPU graphs to find real bottlenecks, it was widly trial and error tbf,
> anybody got idea on how to do this kinda experiment in hot reload mode without so much hassle??
ah right so the GPU was the bottleneck then
"The computational power of the cores on the GPU was never the issue-- however the code that I wrote resulted in a memory bandwidth bottleneck that starved the GPU cores of data to work on, which is firmly within my responsibilities as a programmer -- to fully understand the bandwidth and latency characteristics of the device(s) i'm running on"
Even if llamacpp isnt used for batch inference now, this can allow those to finally run llamacpp for batching and on any hardware since vLLM supports only select hardware. Maybe finally we can stop all this gpu api software fragmentation and cuda moat as llamacpp benchmarks have shown Vulkan to be as or more performant than cuda or sycl.
[1] https://miro.medium.com/v2/resize:fit:1400/format:webp/1*lab...
I believe batching is a concept only useful when during the training or fine tuning process.
For local hosting, a more likely scenario where you could use batching is if you had a lot of different data you wanted to process (lots of documents or whatever). You could batch them in sets of x and have it complete in 1/x the time.
A less likely scenario is having enough users that you can make the first user wait a few seconds while you wait to see if a second user submits a request. If you do get a second request, then you can batch them and the second user will get their result back much faster than if they had had to wait for the first user’s request to complete first.
Most people doing local hosting on consumer hardware won’t have the extra VRAM for the KV cache for multiple simultaneous inferences though.
[i1 i2 ]⋅[w1 w3 ; w2 w4 ] = [i1 ⋅w1 +i2 ⋅w3 i1 ⋅w2 +i2 ⋅w4 ]
Cool. Now what happens if we make the input vector a 2x2 matrix with, for some reason, a second set of two input values:
[i1 i2 ; j1 j2 ]⋅[w1 w3 ; w2 w4 ] = [i1 ⋅ w1 +i2 ⋅ w3 i1 ⋅ w2 +i2 ⋅ w4 ; j1 ⋅ w1 +j2 ⋅ w3 j1 ⋅ w2 +j2 ⋅w4 ]
Look at that! The input has 2 rows, each row has an input value for the network and the output matrix has 2 rows, each containing the outputs for the respective inputs. So you can "just" apply your neural network to any number of input values by just putting one to each row. You could do 2, or 1000 this way ... and a number of values would only need to be calculated once.
Self attention premise is exactly that it isn't context free so it is also incorrect to say that batched requests do not mathematically affect each other. They do, and that's by design.
The GPU can’t do anything with weights while they are in VRAM. They have to be moved into the GPU itself first.
So it is about memory round-trips, but not between RAM and VRAM. It’s the round trips between the VRAM and the registers in the GPU die. When batch processing, the calculations for all batched requests can be done while the model parameters are in the GPU registers. Compared to if they were done sequentially, you would multiply the number of trips between the VRAM and the GPU by the number of individual inferences.
Also, batched prompts and outputs are indeed mathematically independent from each other.
Moving data to and from VRAM is ~100ns of latency. Moving data from RAM to VRAM through PCIe 5.0 is 1-10us of latency. So, ~1 to ~2 orders of magnitude of difference.
And this is the reason why batching is used - you don't want to pay the price of that latency for each and every CPU-to-GPU request but you want to push as much data as you can through a single round-trip.
Every weight has to be touched for every forward pass, meaning you have to wait for 16G to transfer from VRAM -> SRAM -> registers. That's not even close to 100ns: on a 4090 with ~1TB/s memory bandwidth that's 16 milliseconds. PCIe latency to launch kernels or move 20 integers or whatever is functionally irrelevant on this scale.
The real reason for batching is it lets you re-use that gigantic VRAM->SRAM transfer across the batch & sequence dimensions. Instead of paying a 16ms memory tax for each token, you pay it once for the whole batched forward pass.
That’s the core point though. If you do batches the cache and registers are already primed and ready. The model runs in steps/layers accessing different weights in VRAM along the way. When batching you take advantage of this.
I’m in agreement that RAM to VRAM is important too but I feel the key speed up for inference batching is my above point.