Top
Best
New

Posted by swatson741 3 days ago

Backpropagation is a leaky abstraction (2016)(karpathy.medium.com)
351 points | 159 commentspage 2
Huxley1 2 days ago|
When I first started learning deep learning, I only had a vague idea of how backprop worked. It wasn't until I forced myself to implement it from scratch that I realized it was not magic after all. The process was painful, but it gave me much more confidence when debugging models or trying to figure out where the loss was getting stuck. I would really recommend everyone in deep learning try writing it out by hand at least once.
vrighter 4 hours ago|
Implementing one also finally gave me a way to intuitively grasp (and remember) the chain-rule from calculus.
alyxya 3 days ago||
More generally, it's often worth learning and understanding things one step deeper. Having a more fundamental understanding of things explains more of the "why" behind why some things are the way they are, or why we do some things a certain way. There's probably a cutoff point for balancing how much you actually need to know though. You could potentially take things a step further by writing the backwards pass without using matrix multiplication, or spend some time understanding what the numerical value of a gradient means.
WithinReason 3 days ago||
Karpathy suggests the following error:

  def clipped_error(x): 
    return tf.select(tf.abs(x) < 1.0, 
                   0.5 * tf.square(x), 
                   tf.abs(x) - 0.5) # condition, true, false
Following the same principles that he outlines in this post, the "- 0.5" part is unnecessary since the gradient of 0.5 is 0, therefore -0.5 doesn't change the backpropagated gradient. In addition, a nicer formula that achieves the same goal as the above is √(x²+1)
macleginn 3 days ago||
If we don't subtract from the second branch, there will be a discontinuity around x = 1, so the derivative will not be well-defined. Also the value of the loss will jump at this value, which will make it hard to inspect the errors, for one thing.
WithinReason 3 days ago||
No, that's not how backprop works. There will be no discontinuity in a backpropagated gradient.
macleginn 3 days ago||
I did not say there will be a discontinuity in the gradient; I said that the modified loss function will not have a mathematically well-defined derivative because of the discontinuity in the function.
WithinReason 1 day ago||
Which is completely irrelevant to the point I was making
kingstnap 3 days ago|||
You do that to make things smoother when plotted. You could in theory add some crazy stairstep that adds a hundred to the middle part. It would make your loss curves spike and increase towards convergence but then those spikes are just visual artifacts from doing weird discontinuous nonsense with yoru loss.
slashdave 3 days ago||
square roots are expensive
WithinReason 3 days ago||
they are negligible, especially when the post was written when ops were not fused. The extra memory you need to store the extra tensors when you use the original version is more expensive
away74etcie 3 days ago||
Karpathy's work on large datasets for deep neural flow is conceiving of the "backward pass" as the preparation for initializing the mechanics for weight ranges, either as derivatives in -10/+10 statistic deviations.
emil-lp 3 days ago||
... (2016)

9 years ago, 365 points, 101 comments

https://news.ycombinator.com/item?id=13215590

mirawelner 2 days ago||
I feel like my learning curve for AI is:

1) Learn backprop, etc, basic math

2) Learn more advanced things, CNNs, LMM, NMF, PCA, etc

3) Publish a paper or poster

4) Forget basics

5) Relearn that backprop is a thing

repeat.

Some day I need to get my education together.

brcmthrowaway 3 days ago||
Do LLMs still use backprop?
samsartor 3 days ago||
Yes. Pretraining and fine-tuning use standard Adam optimizers (usually with weight-decay). Reinforcement learning has been the odd-man out historically, but these days almost all RL algorithms also use backprop and gradient descent.
ForceBru 3 days ago|||
Are LLMs still trained by (variants of) stochastic GRADIENT descent? AFAIK what used to be called "backprop" is nowadays known as "automatic differentiation". It's widely used in PyTorch, JAX etc
imtringued 3 days ago||
Gradient descent doesn't matter here. Second order and higher methods still use lower order derivatives.

Back propagation is reverse mode auto differentiation. They are the same thing.

And for those who don't understand what back propagation is, it is just an efficient method to calculate the gradient for all parameters.

raindear 2 days ago||
Are dead ReLUs still a pronlem today? Why not?
joaquincabezas 3 days ago|
off-topic, anybody knows what's going on with EurekaLabs? It's been a while since the announcement
meken 3 days ago||
He gives an update in the Dwarkesh interview:

https://youtu.be/lXUZvyajciY?si=vbqKDOOY7l-491Ka&t=7028

Not too many details on timeline - just that he's working on it.

leobg 3 days ago||
He does have a history of abandoning projects. OpenAI. Tesla. OpenAI again...

Then again, it might have been the corporate stuff that burned him out rather than the engineering.

leobg 2 days ago||
Also, he published nanochat 3 weeks ago [0], and he says in the readme:

> nanochat will become the capstone project of the course LLM101n being developed by Eureka Labs.

[0] https://github.com/karpathy/nanochat

More comments...