Top
Best
New

Posted by swatson741 11/2/2025

Backpropagation is a leaky abstraction (2016)(karpathy.medium.com)
353 points | 160 commentspage 2
alyxya 11/2/2025|
More generally, it's often worth learning and understanding things one step deeper. Having a more fundamental understanding of things explains more of the "why" behind why some things are the way they are, or why we do some things a certain way. There's probably a cutoff point for balancing how much you actually need to know though. You could potentially take things a step further by writing the backwards pass without using matrix multiplication, or spend some time understanding what the numerical value of a gradient means.
WithinReason 11/2/2025||
Karpathy suggests the following error:

  def clipped_error(x): 
    return tf.select(tf.abs(x) < 1.0, 
                   0.5 * tf.square(x), 
                   tf.abs(x) - 0.5) # condition, true, false
Following the same principles that he outlines in this post, the "- 0.5" part is unnecessary since the gradient of 0.5 is 0, therefore -0.5 doesn't change the backpropagated gradient. In addition, a nicer formula that achieves the same goal as the above is √(x²+1)
macleginn 11/2/2025||
If we don't subtract from the second branch, there will be a discontinuity around x = 1, so the derivative will not be well-defined. Also the value of the loss will jump at this value, which will make it hard to inspect the errors, for one thing.
WithinReason 11/2/2025||
No, that's not how backprop works. There will be no discontinuity in a backpropagated gradient.
macleginn 11/2/2025||
I did not say there will be a discontinuity in the gradient; I said that the modified loss function will not have a mathematically well-defined derivative because of the discontinuity in the function.
WithinReason 11/4/2025||
Which is completely irrelevant to the point I was making
kingstnap 11/2/2025|||
You do that to make things smoother when plotted. You could in theory add some crazy stairstep that adds a hundred to the middle part. It would make your loss curves spike and increase towards convergence but then those spikes are just visual artifacts from doing weird discontinuous nonsense with yoru loss.
slashdave 11/2/2025||
square roots are expensive
WithinReason 11/2/2025||
they are negligible, especially when the post was written when ops were not fused. The extra memory you need to store the extra tensors when you use the original version is more expensive
Huxley1 11/3/2025||
When I first started learning deep learning, I only had a vague idea of how backprop worked. It wasn't until I forced myself to implement it from scratch that I realized it was not magic after all. The process was painful, but it gave me much more confidence when debugging models or trying to figure out where the loss was getting stuck. I would really recommend everyone in deep learning try writing it out by hand at least once.
vrighter 11/5/2025|
Implementing one also finally gave me a way to intuitively grasp (and remember) the chain-rule from calculus.
emil-lp 11/2/2025||
... (2016)

9 years ago, 365 points, 101 comments

https://news.ycombinator.com/item?id=13215590

away74etcie 11/2/2025||
Karpathy's work on large datasets for deep neural flow is conceiving of the "backward pass" as the preparation for initializing the mechanics for weight ranges, either as derivatives in -10/+10 statistic deviations.
mirawelner 11/2/2025||
I feel like my learning curve for AI is:

1) Learn backprop, etc, basic math

2) Learn more advanced things, CNNs, LMM, NMF, PCA, etc

3) Publish a paper or poster

4) Forget basics

5) Relearn that backprop is a thing

repeat.

Some day I need to get my education together.

brcmthrowaway 11/2/2025||
Do LLMs still use backprop?
samsartor 11/2/2025||
Yes. Pretraining and fine-tuning use standard Adam optimizers (usually with weight-decay). Reinforcement learning has been the odd-man out historically, but these days almost all RL algorithms also use backprop and gradient descent.
ForceBru 11/2/2025|||
Are LLMs still trained by (variants of) stochastic GRADIENT descent? AFAIK what used to be called "backprop" is nowadays known as "automatic differentiation". It's widely used in PyTorch, JAX etc
imtringued 11/2/2025||
Gradient descent doesn't matter here. Second order and higher methods still use lower order derivatives.

Back propagation is reverse mode auto differentiation. They are the same thing.

And for those who don't understand what back propagation is, it is just an efficient method to calculate the gradient for all parameters.

raindear 11/2/2025||
Are dead ReLUs still a pronlem today? Why not?
intelkishan 11/13/2025|
There are alternative activation functions, which are also widely used.
joaquincabezas 11/2/2025|
off-topic, anybody knows what's going on with EurekaLabs? It's been a while since the announcement
meken 11/2/2025||
He gives an update in the Dwarkesh interview:

https://youtu.be/lXUZvyajciY?si=vbqKDOOY7l-491Ka&t=7028

Not too many details on timeline - just that he's working on it.

leobg 11/2/2025||
He does have a history of abandoning projects. OpenAI. Tesla. OpenAI again...

Then again, it might have been the corporate stuff that burned him out rather than the engineering.

leobg 11/3/2025||
Also, he published nanochat 3 weeks ago [0], and he says in the readme:

> nanochat will become the capstone project of the course LLM101n being developed by Eureka Labs.

[0] https://github.com/karpathy/nanochat

More comments...