Posted by swatson741 3 days ago
Plus his GitHub. The recently released nanochat https://github.com/karpathy/nanochat is fantastic. Having minimal, understandable and complete examples like that is invaluable for anyone who really wants to understand this stuff.
Later I understood that they don’t need to understand LLMs, and they don’t care how they work. Rather they need to believe and buy into them.
They’re more interested in science fiction discussions — how would we organize a society where all work is done by intelligent machines — than what kinds of tasks are LLMs good at today and why.
And the issue you mention in the last paragraph is very relevant, since the scenario is plausible, so it is something we definitely should be discussing.
The question here is whether the details are important for the major issues, or whether they can be abstracted away with a vague understanding. To what extent abstracting away is okay depends greatly on the individual case. Abstractions can work over a large area or for a long time, but then suddenly collapse and fail.
The calculator, which has always delivered sufficiently accurate results, can produce nonsense when one approaches the limits of its numerical representation or combines numbers with very different levels of precision. This can be seen, for example, when one rearranges commutative operations; due to rounding problems, it suddenly delivers completely different results.
The 2008 financial crisis was based, among other things, on models that treated certain market risks as independent of one another. Risk could then be spread by splitting and recombining portfolios. However, this only worked as long as the interdependence of the different portfolios was actually quite small. An entire industry, with the exception of a few astute individuals, had abstracted away this interdependence, acted on this basis, and ultimately failed.
As individuals, however, we are completely dependent on these abstractions. Our entire lives are permeated by things whose functioning we simply have to rely on without truly understanding them. Ultimately, it is the nature of modern, specialized societies that this process continues and becomes even more differentiated.
But somewhere there should be people who work at the limits of detailed abstractions and are concerned with researching and evaluating the real complexity hidden behind them, and thus correcting the abstraction if necessary, sending this new knowledge upstream.
The role of an expert is to operate with less abstraction and more detail in her oder his field of expertise than a non-expert -- and the more so, the better an expert she or he is.
Imagine if you were using single layer perceptrons without understanding seperability and going "just a few more tweaks and it will approximate XOR!"
Knowledge of backprop no matter how precise, and any convoluted 'theories' will not make you utilize LLMs any better. You'll be worse off if anything.
We don't even have a complete explanation of how we go from backprop to the emerging abilities we use and love, so who cares (for that purpose) how backprop works? It's not like we're actually using it to explain anything.
As I say in another comment, I often give talks to laypeople about LLMs and the mental model I present is something like supercharged Markov chain + massive training data + continuous vocabulary space + instruction tuning/RLHF. I think that provides the right abstraction level to reason about what LLMs can do and what their limitations are. It's irrelevant how the supercharged Markov chain works, in fact it's plausible that in the future one could replace backprop with some other learning algorithm and LLMs could still work in essentially the same way.
In the line of your first paragraph, probably many teens who had a lot of time in their hands when Bing Chat was released, and some critical spirit to not get misled by the VS, have better intuition about what an LLM can do than many ML experts.
And in fact this is true of any tool, you don’t have to know exactly how to build them but any craftsman has a good understanding how the tool works internally. LLMs are not a screw or a pen, they are more akin to an engine, you have to know their subtleties if you build a car. And even screws have to be understood structurally in advanced usage. Not understanding the tool is maybe true only for hobbyists.
There are things that you just can’t expect from current LLMs that people routinely expect from them.
They start out projects with those expectations. And that’s fine. But they don’t always learn from the outcomes of those projects.
It sounds to me very much like end users, not people who are training LLMs.
Feynman was right that "If you can't build it, you don't understand it", but of course not everyone needs or wants to fully understand how an LLM works. However, regarding an LLM as a magic black box seems a bit extreme if you are a technologist and hope to understand where the technology is heading.
I guess we are in an era of vibe-coded disposable "fast tech" (cf fast fashion), so maybe it only matters what can it do today, if playing with or applying towards this end it is all you care about, but this seems a rather blinkered view.
That or they are flat out lying. My money's on the latter.
(Thinking about it, would that necessarily be a bad thing...)
If you live in a world of horse carriages, you can be thinking about what the world of cars is going to be like, even if you don't fully understand what fuel mix is the most efficient or what material one should use for a piston in a four-stroke.
And what gives you that confidence? A few AI nerds already claimed that in the 80s.
We're currently exploring what LLMs can do. There is no indication that any further fundamental breakthrough is around the corner. Everybody is currently squeezing the same stone.
For example, things like "AI" image and video generation are amazing, as are things like AlphaGo and AlphaFold, but none of these have anything to do with LLMs, and the only technology they share with LLMs is machine learning and neural nets. If you lump these together with LLMs, calling them all "AI", then you'll come to the wrong conclusion that all of these non-LLM advances indicate that "AI" is rapidly advancing and therefore LLMs (also being "AI") will do too ...
Even if you leave aside things like AlphaGo, and just focus on LLMs, and other future technology that may take all our jobs, then using terms like "AI" and "AGI" are still confusing and misleading. It's easy to fall into the mindset that "AGI" is just better "AI", and that since LLMs are "AI", AGI is just better LLMs, and is around the corner because "AI" is advancing rapidly ...
In reality LLMs are, like AlphaFold, something highly specific - they are auto-regressive next-word predictor language models (just as a statement of fact, and how they are trained, not a put-down), based on the Transformer architecture.
The technology that could replace humans for most jobs in the future isn't going to be a better language model - a better auto-regressive next-word predictor - but will need to be something much more brain like. The architecture itself doesn't have to be brain-like, but in order to deliver brain-like functionality it will probably need to include another half-dozen "Transformer-level" architectural/algorithmic breakthroughs including things like continual learning, which will likely turn the whole current LLM training and deployment paradigm on it's head.
Again, just focusing on LLMs, and LLM-based agents, regarding them as a black-box technology, it's easy to be misled into thinking that advances in capability are broadly advancing, and will rise all ships, when in reality progress is much more narrow. Headlines about LLMs achievement in math and competitive programming, touted as evidence of reasoning, do NOT imply that LLM reasoning is broadly advancing, but you need to get under the hood and understand RL training goals to realize why that is not necessarily the case. The correctness of most business and real-world reasoning is not as easy to check as is marking a math problem as correct or not, yet that capability is what RL training depends on.
I could go on .. LLM-based agents are also blurring the lines of what "AI" can do, and again if treated as a black box will also misinform as to what is actually progressing and what is not. Thousands of bright people are indeed working on improving LLM-adjacent low-hanging fruit like this, but it'd be illogical to conclude that this is somehow helping to create next-generation brain-like architectures that will take away our jobs.
That's because you, as you admit in the next sentence, have almost no understanding of how they work.
Your reasoning is on the same level as someone in the 1950s thinking ubiquitous flying cars are just a few years away. Or fusion power, for that matter.
In your defense, that seems to be about the average level of engagement with this technology, even on this website.
Since nobody has yet figured out how to build an artificial brain, having that as a proof it's possible doesn't much help. It will be decades or more before we figure out how the brain works and are able to copy that, although no doubt people will attempt to build animal intelligence before fully knowing how nature did it.
Saying that AGI "just needs some different code" than an LLM is like saying that building an interstellar spaceship "just needs some different parts than a wheelbarrow". Both are true, and both are useless statements offering zero insight into the timeline involved.
Neither did the people expecting fusion power and flying cars to come quickly.
We have just as much evidence that fusion power is possible as we do that human level intelligence is possible. Same with small vehicle flight for that matter.
None of that makes any of these things feasible.
That's like saying, well, given how fast bicycles make us, so much closer to horse speed, I wonder if we can tweak this a little to move faster than any animal can run. But cars needed more technological breakthroughs, even though some aspects of them used insights gained from tweaking bicycles.
The Math Olympiad results are impressive, but at the end of the day is just this same next word prediction, but in this case fine tuned by additional LLM training on solutions to math problems, teaching the LLM which next word predictions (i.e. output) will add up to solution steps that lead to correct problem solutions in the training data. Due to the logical nature of math, the reasoning/solution steps that worked for training data problems will often work for new problems it is then tested on (Math Olympiad), but most reasoning outside of logical domains like math and programming isn't so clear cut, so this approach of training on reasoning examples isn't necessarily going to help LLMs get better at reasoning on more useful real-world problems.
> Yesterday I was browsing for a Deep Q Learning implementation in TensorFlow (to see how others deal with computing the numpy equivalent of Q[:, a], where a is an integer vector — turns out this trivial operation is not supported in TF). Anyway, I searched “dqn tensorflow”, clicked the first link, and found the core code. Here is an excerpt:
Notice how it's "browse" and "search" not just "I asked chatgpt". Notice how it made him notice a bug
Secondly, the article is from 2016, ChatGPT didn’t exist back then
He's just test driving LLMs, nothing more.
Nobody's asking this core question in podcasts. "How much and how exactly are you using LLMs in your daily flow?"
I'm guessing it's like actors not wanting to watch their own movies.
He's doing a capability check in this video (for the general audience, which is good of course), not attacking a hard problem in ML domain.
Despite this tweet: https://x.com/karpathy/status/1964020416139448359 , I've never seen him citing an LLM helped him out in ML work.
If he did not believe in the capability of these models, he would be doing something else with his time.
Truth be told, a whole lot of things are more important than copyright law.
A pretty terrible way, but... certainly one way.
It would be crazy to think the protections of IP laws and the ability to claim original work as your own and have a degree of control over it as an author fostered creativity in science and arts.
The human race has produced an extremely rich body of work long before US copyright law and the DMCA existed. Instead of creating new financial models which embrace freedoms while still ensuring incentives to create new art, we have contorted outdated financial models, various modes of rent-seeking and gatekeeping, to remain viable via artificial and arbitrary restriction of freedom.
Furthermore, claiming “X is not natural” is never a valid argument. Humans are part of nature, whatever we do is as well by extension. The line between natural and unnatural inevitably ends up being the line between what you like and what you don’t like.
The need to eat is as much a natural law as higher human needs—unless you believe we should abandon all progress and revert to pre-civilization times.
IP laws ensure that you have a say in the future of the product of your work, can possibly monetise it, etc., which means a creative 1) can fulfil your need to eat (individual benefit), and 2) has an incentive to create it in the first place (societal benefit).
In the last few hundred years intellectual property, not physical property, is increasingly the product of our work and creative activities. Believing that physical artifacts we create deserve protection against theft while intellectual property we create doesn’t needs a lot of explanation.
So you're able to use them commercially as you see fir, but you can't use them freely in the most absolute sense, but then again this is a thread about restricting the freedoms of organizations in the name of a 25-year-old law that has been a disgrace from the start.
> contributing to degradation of the Web for humans
I'll be the first to say that Meta did this with Facebook and Instagram, along with other companies such as Reddit.
However, we don't yet know what the web is going to look like post-AI, and it's silly to blame any one company for what clearly is an inevitable evolution in technology. The post-AI web was always coming, what's important is how we plan to steward these technologies.
> The post-AI web was always coming
“The third world war was always coming.”
These things are not a force of nature, they are products of human effort, which can be ill-intentioned. Referring to them as “always coming” is 1) objectively false and 2) defeatist.
> I think congrats again to OpenAI for cooking with GPT-5 Pro. This is the third time I've struggled on something complex/gnarly for an hour on and off with CC, then 5 Pro goes off for 10 minutes and comes back with code that works out of the box. I had CC read the 5 Pro version and it wrote up 2 paragraphs admiring it (very wholesome). If you're not giving it your hardest problems you're probably missing out.
Eureka runs LLM101n, which is teaching software for pedagogic symbiosis.
Backpropagation is a specific algorithm for computing gradients of composite functions, but even the failures that do come from composition (multiple sequential sigmoids cause exponential gradient decay) are not backpropagation specific: that's just how the gradients behave for that function, whatever algorithm you use. The remedy, of having people calculate their own backwards pass, is useful because people are _calculating their own derivatives_ for the functions, and get a chance to notice the exponents creeping in. Ask me how I know ;)
[1] Gradients being zero would not be a problem with a global optimization algorithm (which we don't use because they are impractical in high dimensions). Gradients getting very small might be dealt with by with tools like line search (if they are small in all directions) or approximate newton methods (if small in some directions but not others). Not saying those are better solutions in this context, just that optimization(+modeling) are the actually hard parts, not the way gradients are calculated.
I respect Karpathy’s contributions to the field, but often I find his writing and speaking to be more than imprecise — it is sloppy in the sense that it overreaches and butchers key distinctions. This may sound harsh, but at his level, one is held to a higher standard.
I think that's more because he's trying to write to an audience who isn't hardcore deep into ML already, so he simplifies a lot, sometimes to the detriment of accuracy.
At this point I see him more as a "ML educator" than "ML practitioner" or "ML researcher", and as far as I know, he's moving in that direction on purpose, and I have no qualms with it overall, he seems good at educating.
But I think shifting the mindset of what the purpose of his writings are maybe help understand why sometimes it feels imprecise.
And his _examples_ are about gradients, but nowhere does he distinguish between backpropagation, a (part of) an algorithm for automatic differentiation and the gradients themselves. None of the issues are due to BP returning incorrect gradients (it totally could, for example, lose too much precision, but it doesn't).
> In other words, it is easy to fall into the trap of abstracting away the learning process — believing that you can simply stack arbitrary layers together and backprop will “magically make them work” on your data.
Then follows this with multiple clear examples of exactly what he is talking about.
The target audience was people building and training neural networks (such as his CS231n students), so I think it's safe to assume they knew what backprop and gradients are, especially since he made them code gradients by hand, which is what they were complaining about!
Sure, instead of "the problem with backpropagation is that it's a leaky abstraction" he could have written "the problem with not learning how back propagation works and just learning how to call a framework is that backpropagation is a leaky abstraction". But that would be a terrible sub-heading for an introductory-level article for an undergraduate audience, and also unnecessary because he already said that in the introduction.
> ... he could have written "the problem with not learning how back propagation works and just learning how to call a framework is that backpropagation is a leaky abstraction". But that would be a terrible sub-heading ...
My concern isn't about the heading he chooses. My concern is deeper; he commits a category error [3]. These following things are true, but Karpathy's article gets them wrong: (1) Leaky abstractions only occur with interfaces; (2) Backpropagation is algorithm; (3) Algorithms can never be leaky abstractions.
Karpathy could have communicated his point clearly and correctly by saying e.g.: "treating backprop learning as a magical optimization oracle is risky". There is zero need for introducing the concept of leaky abstractions at all.
---
Ok, with the above out of the way, we can get to some interesting technical questions that are indeed about leaky abstractions which can inform the community about pros/cons of the design space: To what degree is the interface provided by [Library] a leaky abstraction? (where [Library] might be PyTorch or TensorFlow) Getting into these details is interesting. (See [4] for example.) There is room for more writing on this.
[1]: We can all gain because accepting criticism is hard. Once we see that even Karpathy messes up, we probably shouldn't be defensive when we mess up.
[2]: No one is being robbed here. Criticism is a gift; offering constructive criticism is a sign of respect. It also respects the community by saying i.e. "I want to make it easier for people to get the useful, clear ideas into their heads rather than muddled ones."
[3]: https://en.wikipedia.org/wiki/Category_mistake
[4]: https://elanapearl.github.io/blog/2025/the-bug-that-taught-m...
Can’t agree more about the technical points (category error etc), and then the unexpected switch to the value of receiving constructive criticism as a gift not an attack.
Myself, I’m definitely conditioned to receive it as an attack. I’m trying to break this habit. This morning I gave some extensive feedback to some friends who have a startup. The whole time I was writing it, I was stressing out that they’d feel attacked, because that’s how I might take similar criticism.
How was it actually received? A mix I think. Some people explicitly received it as a gift, and others I’m not so sure.
The point is that you can't abstract away the details of back propagation (which involve computing gradients) under some circumstances. For example, when we are using gradient descend. Maybe in other circumstances (global optimization algorithm) it wouldn't be an issue, but the leaky abstraction idea isn't that the abstraction is always an issue.
(Right now, back propagation is virtually the only way to calculate gradients in deep learning)
This is like complaining about long division not behaving nicely when dividing by 0. The algorithm isn't the problem, and blaming the wrong part does not help understanding.
It distracts from what is actually helping which is using different functions with nicer behaving gradients, e.g., the Huber loss instead of quadratic.
Fully agree. It's not the "fault" of Backprop. It does what you tell it to do, find the direction in which your loss is reduced the most. If the first layers get no signal because the gradient vanishes, then the reason is your network layout: Very small modifications in the initial layers would lead to very large modifications in the final layers (essentially an unstable computation), so gradient descend simply cannot move that fast.
Instead, it's a vital signal for debugging your network. Inspecting things like gradient magnitudes per layer shows you might have vanishing or exploding gradients. And that has lead to great inventions how to deal with that, such as residual networks and a whole class of normalization methods (such as batch normalization).
It is definitely a useful thing for people who are learning this topic to understand from day 1.
I told everyone this was the best single exercise of the whole year for me. It aligns with the kind of activity that I benefit immensely but won't do by myself, so this push was just perfect.
If you are teaching, please consider this kind of assignments.
P.S. Just checked now and it's still in the syllabus :)
I made a UI that showed how the weights and biases changed throughout the training iterations.
"Computers are good at maths" is normally a pretty obvious statement... but many things we take for granted from analytical mathematics, is quite difficult to actually implement in a computer. So there is a mountain of clever algorithms hiding behind some of the seemingly most obvious library operations.
One of the best courses I've ever had.
I understand how SGD is just taking a step proportional to the gradient and how backprop computes the partial derivative of the loss function with respect to each model weight.
But with more advanced optimizers the gradient is not really used directly. It gets per weight normalization, fudged with momentum, clipped, etc.
So really, how important is computing the exact gradient using calculus, vs just knowing the general direction to step? Would that be cheaper to calculate than full derivatives?
> how important is computing the exact gradient using calculus
Normally the gradient is computed with a small "minibatch" of examples, meaning that on average over many steps the true gradient is followed, but each step individually never moves exacty along the true gradient. This noisy walk is actually quite beneficial for the final performance of the network https://arxiv.org/abs/2006.15081 , https://arxiv.org/abs/1609.04836 so much so that people started wondering what is the best way to "corrupt" this approximate gradient even more to improve performance https://arxiv.org/abs/2202.02831 (and many other works relating to SGD noise)
> vs just knowing the general direction to step
I can't find relevant papers now, but I seem to recall that the Hessian eigenvalues of the loss function decay rather quickly, which means that taking a step in most directions will not change the loss very much. That is to say, you have to know which direction to go quite precisely for an SGD-like method to work. People have been trying to visualize the loss and trajectory taken during optimization https://arxiv.org/pdf/1712.09913 , https://losslandscape.com/
Yes, absolutely -- a lot of ideas inspired by this have been explored in the field of optimization, and also in machine learning. The very idea of "stochastic" gradient descent using mini-batches basically a cheap (hardware compatible) approximation to the gradient for each step.
For a relatively extreme example of how we might circumvent the computational effort of backprop, see Direct Feedback Alignment: https://towardsdatascience.com/feedback-alignment-methods-7e...
Ben Recht has an interesting survey of how various learning algorithms used in reinforcement learning relate with techniques in optimization (and how they each play with the gradient in different ways): https://people.eecs.berkeley.edu/~brecht/l2c-icml2018/ (there's nothing special about RL... as far as optimization is concerned, the concepts work the same even when all the data is given up front rather than generated on-the-fly based on interactions with the environment)
Non-stochastic gradient descent has to optimize over the full dataset. This doesn't matter for non-machine learning applications, because often there is no such thing as a dataset in the first place and the objective has a small fixed size. The gradient here is exact.
With stochastic gradient descent you're turning gradient descent into an online algorithm, where you process a finite subset of the dataset at a time. Obviously the gradient is no longer exact, you still have to calculate it though.
Seems like "exactness" is not that useful of a property for optimization. Also, I can't stress it enough, but calculating first order derivatives is so cheap there is no need to bother. It's roughly 2x the cost of evaluating the function in the first place.
It's second order derivatives that you want to approximate using first order derivatives. That's how BFGS and Gauss-Newton work.
First of all, gradient computation with back-prop (aka reverse-mode automatic differentiation) is exact to numerical precision (except for edge-cases that are not relevant here) so it's not about the way of computing the gradient.
What Andrej is trying to tell is that when you create a model, you have freedom of design in the shape of the loss function. And that in this design what matters for learning is not so much the value of the loss function, but its slopes, and curvature (peaks and valleys).
The problematic case being flat valleys, surrounded by straight cliffs, (picture the grand canyon).
Advanced optimizers in deep learning like "Adam", are still first-order, with diagonal approximation of the curvature, which mean the optimizer in addition to the gradient it has an estimate of the scale sensitivity of each parameter independently. So the cheap thing it can reasonably do is modulate the gradient with this scale.
The length of the gradient vector, being often problematic, what optimizers would usually do was something called "line-search", which is determine the optimal step-size along this direction. But the cost of doing that is usually between 10-100 evaluation of the cost function which is often not worth the effort in the noisy stochastic context, compared to just taking a smaller step multiple times.
Higher-order optimizers necessitate that the loss function is twice differentiable, so non-linearities like relu, which are cheap to calculate can't be used.
Lower-order global optimizers don't even necessitate the gradient, which is useful when the energy-function landscape has lots of local minima, (picture an egg-box).
Why would these things be "fudging"? Vanishing gradients (see the initial batch norm paper) are a real thing, and ensuring that the relative magnitudes are in some sense "smooth" between layers allows for an easier optimization problem.
> So really, how important is computing the exact gradient using calculus, vs just knowing the general direction to step? Would that be cheaper to calculate than full derivatives?
Very. In high dimensional space, small steps can move you extremely far from a proper solution. See adversarial examples.
How do you propose calculating the "general direction" ?
And, an example "advanced optimizer" - AdamW - absolutely uses gradients. It just does more, but not less.
But remember, that is for taking the derivative at a single data point - what's hard is the average derivative over the entire set of points and that's where sampling and approximations (SGD etc)comes in.
If you write down an explicit expression for partial derivatives, it will contain sums. The sign (which kind of defines the general direction) is affected by what's in the sum, and you can't avoid calculating it.
That said, I know Deepseek use fp32 for their gradient updates even though they use fp8 for inference. And a recent paper shows that RL+LLM training is shakier at bf16 than fp16, which would both imply that numerical precision in gradients still matters.
For example, Alex Graves's (great! with attention) 2013 paper "Sequence Generation with Recurrent Neural Networks" has this line:
One difficulty when training LSTM with the full gradient is that the derivatives sometimes become excessively large, leading to numerical problems. To prevent this, all the experiments in this paper clipped the derivative of the loss with respect to the network inputs to the LSTM layers (before the sigmoid and tanh functions are applied) to lie within a predefined range.
with this footnote:
In fact this technique was used in all my previous papers on LSTM, and in my publicly available LSTM code, but I forgot to mention it anywhere—mea culpa.
That said, backpropagation seems important enough to me that I once did a specialized videocourse just about PyTorch (1.x) autograd.
Perhaps, but maybe because there was more experimentation with different neural net architectures and nodes/layers back then?
Nowadays the training problems are better understood, clipping is supported by the frameworks, and it's easy to find training examples online with clipping enabled.
The problem itself didn't actually go away. ReLU (or GELU) is still the default activation for most networks, and training an LLM is apparently something of a black art. Hugging Face just released their "Smol Training Playbook: a distillation of hard earned knowledge to share exactly what it takes to train SOTA LLMs", so evidentially even in 2025 training isn't exactly a turn-key affair.
For simpler nets, like ResNet, it may just be that modern initialization and training recipes avoid most gradient issues, even though they are otherwise potentially still there.
> “Why do we have to write the backward pass when frameworks in the real world, such as TensorFlow, compute them for you automatically?”
worries me because is structured with the same reasoning of "why we have to demonstrate we understand addition if in the real world we have calculators"
Is this _more_ useful than the other things, which I could be learning but won't because I spent the time and effort to learn this instead?
We have a finite capacity for learning, if for no other reason then at least because we have a finite amount of time in this life, and infinite topics to learn (and there are plenty of other constraints besides time). The reason given for learning this topic, is that it has hidden failure modes which you will not be on the lookout for if you didn't know how it worked "under the hood".
Is this a good enough reason to spend time learning this rather than, say, how to model the physics of the system you're training the neural network to deal with? Tough question; maybe, maybe not. If you have time to learn both, do that, but if not, then you will have to choose which is most important. And in our education system, we do things like teach calculus but not intermediate statistics, and it would have been better to do the opposite for something like 90% of the people taking calculus.
That said, I've implemented backpropagation multiple times, it's a good way to evaluate a new language (just complex enough to reveal problems, not so complex that it takes forever).
For example: The recently posted-about pass-by-value overhead. And it seems that some AMD processors have a 4k aliasing issue. Knowing a tiny bit about how CPU caches actually work internally immediately made me go "oh.... I can see how that issue could arise due to how associative caches are typically implemented".
You don't necessarily use that stuff directly, but you will use them a lot more than you might think, even subconsciously.
You can ignore every detail of a transistor, but knowing it can be like a valve is enough to understand that it can model consistently a logic gate of a stream of electrons and that can be used to compute.
If wouldn't be for that, then the laws that govern your whole computer become magic for you brain.
I fear the consequences of minds that value more their laziness rather to understand, let's say with 5 degrees deep of "why", the things in reality work the way they do. The alternative is that they are hallucinating all the time on narratives that can be psychotic or real and they are not equipped to discern which is which.
> As a developer, you just pick the best one and find good hparams for it
It would be more correct to say: "As a developer, (not researcher), whose main goal is to get a good model working — just pick a proven architecture, hyperparameters, and training loop for it."
Because just picking the best optimizer isn't enough. Some of the issues in the article come from the model design, e.g. sigmoids, relu, RNNs. And some of the issues need to be addressed in the training loop, e.g. gradient clipping isn't enabled by default in most DL frameworks.
And it should be noted that the article is addressing people on the academic / research side, who would benefit from a deeper understanding.
Just because the framework you are using provides things like ReLU doesn't mean you can assume someone else has done all the work and you can just use these and expect them to work all the time. When things go wrong training a neural net you need to know where to look, and what to look for - things like exploding and vanishing gradients.
You've also missed the point of the article, if you're building novel model architectures you can't magic away the leakiness. You need to understand the back prop behaviours of the building blocks you use to achieve a good training run. Ignore these and what could be a good model architecture with some tweaks will either entirely fail to train or produce disappointing results.
Perhaps you're working at a level of bolting pre built models together or training existing architectures on new datasets but this course operates below that level to teach you how things actually work.
Diving through the abstraction reveals some of those.
You might even be able to do a ugly version of this (akin to dropout) where you swap activation functions (with adjusted scaling factors so they mostly yield similar output shapes to ReLU for most input) randomly during training. The point is we mostly know what an ReLU like activation function is supposed to do, so why should we care about the edge cases of the analytical limits of any specific one.
The advantage would be that you’d probably get useful gradients out of one of them(for training), and could swap to the computationally cheapest one during inferencing.
I agree that understanding them is useful, but they are not abstractions much less leaky abstractions.