H100 PCIe – 1.86 TB/s memcpy roofline and 8× uplift

I ran A/B benchmarks on an H100 PCIe 80GB node. Contiguous memcpy sustained ~1.86 TB/s in both baseline and optimized runs, showing no overhead. For strided and misaligned access the baseline was ~230 GB/s, while the optimized version reached ~1.86 TB/s, about an 8× improvement. Large 8–24 GB payloads sustained ~1.86 TB/s as well. Canonical CUDA kernels such as memcpy, strided access, KV-cache, and LayerNorm improved from ~220–330 GB/s to ~1.8–1.86 TB/s, around 7–8× faster with very low jitter.

Using a simple LLM decode cost model (BPT = 1.13 MB/token), throughput improved from ~161.9k tok/s to ~225.1k tok/s (≈1.39×). This suggests memory-bound operations like KV-cache and strided loads can be lifted closer to roofline bandwidth, with direct impact on decode throughput.

I’m interested in feedback on how such memory-bound optimizations might affect LLM training versus inference, and what good public long-context (8k–32k) benchmarks would be to test next ?