Posted by tosh 7 hours ago
First teach what the network does and why, writing it as a loopy, inference-only Python function. Explain training only in an abstract way, E.G. with the "take a random weight, twist it a little and see if the loss improves" algorithm. This lets you focus on the architecture and on why it is what it is.
Then, teach the intuitions behind derivatives and gradient descent. You don't need the entirety of calculus, there's no benefit to knowing how a sequence or limit works if you ) only want to understand neural networks. With autograd, you won't be manually doing derivatives of weird functions either, so intuitive understanding is a lot more important than doing dozens of traditional calculus exercises on paper like it's the 1800s. You could probably explain the little bit of calculus you need in an hour or two, even to somebody with a 12-year-old's understanding of math and a good bit of programming knowledge.
Only when people understand the training and inference, implemented with loops and descriptive variable names, teach the tensor, explain how a modern CPU and GPU works (because many programmers still think a modern computer is just a much faster 6502), and then teach the tricks we use to make it fast.
wild
"The I7-4770K and preform 20k more Flops than C++" is an equally sensible statement (i.e. not)
You are taking the statement too literally and forgetting it's a figure of speech, specifically metonymy.
When the author says it's millions of flops faster in a gpu than in an interpreteted programming language, it's not comparing them directly, but algorithms that run in them, so the substitution is the algorithms for the tools used to implement/run them.
It makes sense if you say "running similar logic -- like multiplying vectors and matrices -- on the CPU is millions of flops slower then on the GPU". There is no category error there.
but I find it illuminating to compare what a certain hardware can do in principle (what is possible) vs what I can "reach" as programmer within a certain system/setup
in this case NVIDIA A100 vs "Python" that does not reach a A100 (without the help of CUDA and PyTorch)
another analogy:
I find it useful to be able to compare what the fastest known way is to move a container from A to B using a certain vehicle (e.g. truck) and how that compares to how fast a person that can not drive that truck can do it + variants of it (on foot, using a cargo bike, using a boat via waterway, …)
I'm also interested in how much energy is needed, how much the hw costs and so on
Often there are many ways to do things, comparing is a great starting point for learning more
that said: Python can get to more FLOPs by changing the representation: https://docs.python.org/3/library/array.html
Okay, but surely you know what they actually mean right, or are you being willfully obtuse? They are comparing CPython (the main python implementation)'s implementation that runs on the CPU with a kernel running on the GPU.
> Overhead is when your code is spending time doing anything that's not transferring tensors or computing things. For example, time spent in the Python interpreter? Overhead. Time spent in the PyTorch framework? Overhead. Time spent launching CUDA kernels (but not executing them)? Also... overhead.
> The primary reason overhead is such a pernicious problem is that modern GPUs are really fast. An A100 can perform 312 trillion floating point operations per second (312 TeraFLOPS). In comparison, Python is really slooooowwww. Benchmarking locally, Python can perform 32 million additions in one second.
> That means that in the time that Python can perform a single FLOP, an A100 could have chewed through 9.75 million FLOPS.
> Even worse, the Python interpreter isn't even the only source of overhead - frameworks like PyTorch also have many layers of dispatch before you get to your actual kernel. If you perform the same experiment with PyTorch, we can only get 280 thousand operations per second. Of course, tiny tensors aren't what PyTorch is built for, but... if you are using tiny tensors (such as in scientific computing), you might find PyTorch incredibly slow compared to C++.
Emphasis mine.
It’s all a bit jumbled up. I get that he was going for an informal tone and this isn’t exactly a benchmark. But I’m still not sure, based on the second emphasized part I think the “bad” measurements are coming from Python+PyTorch but with too-small workloads, and dispatching to CPU, maybe? But the first one looks like naive Python loops.
yes of course this is apples to oranges but that's kind of the point
it shows the vast span between specialized hardware throughput IFF you can use an A100 at its limit vs overhead of one of the most popular programming languages in use today that eventually does the "same thing" on a CPU
the interesting thing is why that is so
CPU vs GPU (latency vs throughput), boxing vs dense representation, interpreter overhead, scalar execution, layers upon layers, …
AMD EPYC 9965 FP32 throughput “at its limit”: 41.2 TFLOP/s (192 cores x 64 FP32 FLOP/cycle/core x 3.35GHz).
A100: 1935GBps of HBM2e
Most of those FLOPS are constrained by memory bandwidth.
but it is very impressive how far modern CPUs get as well (also in smart phones!)
I found the comparison interesting
on Intel Xeon 690P with 419 TFLOP/s it is still (maybe even more?) interesting to ask:
how much throughput can you reach with Python, Python with lib x, y, z, with C++ like this, with C++ like that etc etc and why?
no?
But this discussion is even more bizarre than comparing a screwdriver to a hammer, it’s like comparing a screwdriver to a nail.
Python is 9.75 million times faster than Python.
Tool calling, searches, cache movement if used, and even debug steps all stall the GPU waiting for the CPU.
There was a test of turning one of the under 1B Qwen3+ models into a kernel that didn't stall by the CPU as one GPU pass that saw quite a bit f perf lift over vLLM, I believe, showing this is an issue still.
Its been a month, so I don't remember more details than this.
The rest will be from "python float" (e.g. not from numpy) to C, which gives you already 2 to 3 order of magnitude difference, and then another 2 to 3 from plan C to optimized SIMD.
See e.g. https://github.com/Avafly/optimize-gemm for how you can get 2 to 3 order of magnitude just from C.
That onnx model run using onnxruntime with cuda ep is a different model than the one run with TRT ep.
And even among the same runtime, depending on the target hardware and the memory available during tuning, the model behaves differently. It is a humongous mess
[1] https://colab.research.google.com/drive/13a4Y-ko6QLMPAhBz64c...
Of course the model was dumber than GPT2 but still it was a great learning experience.