A Theory of Deep Learning

Posted by elonlit 4 days ago

240 points | 59 commentspage 2

Macuyiko 2 days ago|

Reminded me strongly of the paper "Deep Learning is Not So Mysterious or Different" from a year ago: https://arxiv.org/abs/2503.02113

jeffrallen 2 days ago||

This looks like excellent work, it's reminding me of things I learned from Welch Labs vidoes. Given the amount of time I budget for keeping up with this stuff (regrettably too low) I'll wait until Welch Labs presents something on this.

menno-sh 2 days ago||

Unrelated to the contents, but WOW your blog styling is gorgeous. Incredible

airza 3 days ago||

A very fascinating read.

As a fellow tufte css enjoyer, Why is user select turned off on the sidenotes? I would like to be able to copy paste them quite badly.

piskov 3 days ago|

Layout is fine but font is atrocious.

Uppercase letters have different stroke width than lowercase ones — it’s like they are *B*old *L*ike this.

Not only that: tracking, kerning is basically non-existent.

Please don’t use that open-source font

You need real Bembo, not that piece of shit

airza 2 days ago||

Are you on Windows? I patched the original tufte fonts to fix these crazy kerning issues, but i have indeed noticed them.

xiaodai 2 days ago||

What a beautifully written article. It's extremely that I favourite an article but this is one.

gravity13 2 days ago||

Very extremely. Quite a lovely presentation. I'm definitely having a Patrick Bateman-esque appreciation for that delicate cream background.

renticulous 2 days ago||

The Hidden Physics of LLMs: Retrieval as Thermodynamics

https://www.youtube.com/watch?v=ppCZfjLdSY8

I found this video to be illustrative as well. Simple and anyone can understand.

hashmap 2 days ago||

this landed precisely on like 3 weird bugs ive been hitting and solving in different stupid ways for dealing with things like sgd collapsing too many good answers into one bad answer, and gave me a real direction to try to fix the link missing in my own ml stuff. what timing. i have tried analytic solutions too and they're useful for like mapping prompts into memory geometry but from there ive ended up still having to use sgd. cause i think what happens is, sgd teaches the neural net both the geometry and how to navigate it. if you just teleport to the answer it doesnt learn how to walk.

auggierose 2 days ago||

Looks like a typical machine learning paper to me. It cannot be understood unless you already kind of understand it. That is OK for communication with peers, but eventually I expect a "theory of" to be readable by anyone with a math degree.

vessenes 2 days ago|

So, this is either the paper of the year, or ... definitely not the paper of the year.

https://arxiv.org/pdf/2605.01172 is the current version. The money graphs are page 8 and on where they show (some weirdly thick) line charts with loss results reached in roughly 1/5 the number of steps that Adam takes, just what the blog post mentions.

They also claim holding back test data is not needed, also with more graphs.

I'm not an ML scientist, and I did not attempt to seriously parse the math. It reads to me as something precisely in that liminal space some math papers do where there's enough new terminology that actually parsing through it all is going to take real, concerted effort, possibly with mild brain damage as a risk.

Their 3d graphs of "kernel eigenstructure" also do double duty for me as totally impenetrable and possibly part of an April fool's ML paper that's hilarious to insiders. Or maybe they show something really amazing; they definitely seem to converge into a shape...What does that shape mean??? Why??? What is an eigenstructure? Is it just 3D eigenvectors of some matrices? Is it natural to have a 3D shape representing these large matrices? If not, how and why were these projected down? And why are they different colors in the paper?? You get the feel for my level of understanding.

I think it would frankly just be easier to validate this claim than parse the whole paper. If only I could understand

  > Each one-step kernel increment ηKMtSS integrates into WMS , so a sequence of one-step rate-maximizers is the greedy policy whose integral is the signal-channel content of the trajectory through G, exactly as plain SGD is the greedy step whose integral is empirical-risk descent through D. The diagonal cutoff µ2 k >σ2 k/(b−1) is the optimal first-order preconditioner for population risk on any diagonal base, and a streaming variance EMAˆst of squared gradient deviations realizes it as a one-line change to AdamW: one extra parameter-sized state vector and a per parameter gate that multiplies the standard moment update

Well enough to implement the one line update to Adam in python. I have not asked codex or claude to assist yet.

Also of note to me, they talk about grokking which I found SUUUPER fascinating when it was first reported, and have never heard about since. So I was really glad to read about it and read that there has been a little academic work on the phenomenon.

Finally, of the three models they repot results on, two are extremely tiny, the last is a DPO round on Qwen 0.5B -- if the code for that is published, I imagine it would be easy to adapt and evaluate in other regimes.

yorwba 2 days ago|

You don't need to understand that part of the derivation to implement it. You just need Algorithm 1 on page 33 of the paper. Or look at the author's implementation: https://github.com/elonlit/PopRiskMinimization/blob/main/pop...

vessenes 2 days ago||

Thanks for the link - I did not see a GitHub.

So, your thoughts on the paper?

yorwba 2 days ago||

I think it's a solid theoretical contribution, but it might nonetheless fail to have practical relevance if some of their assumptions and approximations turn out to be too unrealistic. One way this could happen, for example, would be if typical training batches get gradients with a high-enough signal-to-noise ratio that their optimizer tweak ends up not tweaking much. Their somewhat unusual selection of experiments makes me suspect that this might be the case.

I read the paper earlier when it showed up on https://news.ycombinator.com/from?site=arxiv.org and the writing style of the blog post turned me off so I didn't bother to check how much it overhypes the results compared to the paper, but certainly a lot of people seem to have gotten the idea that this must be big if true, whereas I think it's better classified as neat, but not revolutionary.

vessenes 2 days ago||

Thanks. I found the DPO test very interesting where they intentionally twiddled the dataset to create 'noise'. That would I think back your concern. But on the other hand a method that allows noisier data in or does better with data sets that are low signal across bands would be nice to have.

More comments...