"Cumulative Dissipation Gramian" Ws = Observability Gramian (from Control Theory). For example the spectral cutoff is exactly the Hankel singular value truncation from model reduction.
"Signal Channel" / "Reservoir" is Controllable/Observable vs. Uncontrollable/Unobservable Subspaces. Using Adamjan-Arov-Krein (AAK) theory gives the optimal nonlinear reduced model answering the optimal compression question.
"Drift–Diffusion Separation" is Freidlin-Wentzell Large Deviation Theory. They can predict "grokking" time from the FW action.
"Population-Risk Gate" is Quantum Weak Value / Postselection (Aharonov)
So for the follow-up problems
Control theory gives the truncation error bounds for model compression. Large deviation theory gives the grokking time predictions. Quantum measurement theory gives the imaginary preconditioners. Information geometry gives the optimal continuous relaxation of the gate.
Some nice implications of new ways of doing stuff which are nice to see formalized here:
Old: Pick architecture, hope it generalizes New: Design architecture to maximize observability Gramian rank (Honestly we pull a lot from control theory here)
Old: Use validation set to detect overfitting New: Monitor λ(Ws) spectrum during training; no validation needed
Old: Prune post-hoc based on magnitude New: Prune during training based on ker(Ws) membership
Old: Fixed learning rate New: Spectral learning rate
what is a non-elephant animal (to paraphrase stan ulam)?
We're given a signal channel and a reservoir. Signal lives in the channel, noise lives in the reservoir, and the reservoir supposedly doesn’t show up at test time.
Okay, but then we have: why would SGD put the right things in the right bucket?
If the answer is “because the reservoir is defined as the stuff that doesn’t transfer to test,” then this is close to circular.
The Borges/Lavoisier stuff is a tell. "We have unified the field” rhetoric should come after nontrivial predictions and results. Claiming to solve benign overfitting, double descent, grokking, implicit bias, risk of training on population, how to avoid a validation set, and last but not least, skipping training by analytically jumping to the end is 6 theory papers, 3 NeurIPS winners, and a $10B startup. Let's get some results before we tell everyone we unified the field. :) I hope you're right.
Think of it as a best fit curve and exceptions to that curve. The noise is essentially this set of exceptions that move points away from where they would otherwise fall on the curve.
Gradient descent wants to be able to make the smallest change that moves the most data points towards the curve. To do this it learns an arrangement where it can change, say, one parameter and have a bunch of points move at once. What does this correspond to? The big common patterns shared by many data points.
Most of the capacity gets soaked up modelling these sorts of common patterns, and after they have been learned the model starts adding exceptions that allow individual points to deviate from the curve.
Because they’re exceptions, they must not impact neighbouring points, or at least only ones within a very short distance from them. Otherwise they’re now driving the error higher by impacting more points than they should. So you end up with very narrow ranges of features that are able to trigger different sorts of noise.
How narrow they are is shaped by the training data, they’re exactly as narrow as needed not to raise the error, so assuming the total population has the same distribution, they don’t get hit. Much.
At least, that’s what I take away from it.
I suspect there is going to be a lot of handwaving to actually go from eNTK to that new update rule.
I also doubt it helps in the non-grokking regime, given the focus of the theory, which is where all the practical applications I have ever heard from live.
Don't get me wrong, I did enjoy reading this essay. It's well written and reasonably argumented without going into details.
Nah, the softer stuff seems like valuable outreach / good science communication for people that aren't up for the math. Including probably lots of software engineers who are sick of dumb debates in forums, and starting to dip into the real literature and listen to better authorities. More people should do this really, since it's the only way to see past the marketing and hype from fully entrenched AI boosters or detractors. Neither of those groups is big on critical thinking, and they dominate most conversation.
Time/effort coming from experts who want to make things accessible is a gift! The paper is linked elsewhere in the thread if you want no-frills.
Lastly, we derive an exact population-risk objective from a single training run with no validation data, for any architecture, loss, or optimizer, and prove that it measures precisely the noise in the signal channel. This objective reduces in practice to an SNR preconditioner on top of Adam, adding one state vector at no extra cost; it accelerates grokking by 5x, suppresses memorization in PINNs and implicit neural representations, and improves DPO fine-tuning under noisy preferences while staying 3x closer to the reference policy. [1]
Note that I said "predict" not "describe". It feels like we're still in the era of Kepler, not Newton.
[1] https://physoc.onlinelibrary.wiley.com/doi/full/10.1113/JP28...
The brain probably primarily uses something like TD for task learning, which is also not expressible as a gradient of any objective function. And, though the paper mentions Hebbian learning, it's only very particular network architectures (e.g. single neuron; symmetric connections) that you can treat its updates as a gradient of some energy function; these architectures aren't anything close to what we see in the brain.
Their claim isn't that the brain uses gradient descent, but that the direction of updates has (on average) positive inner product with the gradient. I expect this would also be true for (say) simulated annealing, yet we don't say that simulated annealing is gradient descent.
There's also a discussion of loss functions and how they relate to the update missing - as far as I know, there's still no great notion of how the brain picks a global loss function, and no mechanism for backprop. In this paper, looking at a specific learning task you can define a loss function extrinsically allowing us to talk about the gradient, but how that relates to things happening in the brain is a big big mystery.
1. Older ML models encoded in their architecture and lack of expressivity a bias to simplicity; which aided interpolation.
2. Overparameterized models instead use regularization to nudge parameters to simpler and more robust representations, while still memorizing the noise. In this manner, we still achieve generalization performance OOD. Moreover, the softer nudging and fundamental architectural expressivity allows for "data-specific" generalizations and representations that may be impossible to represent in small models. 3. At the critical point between the two regimes, the model is expressive enough to memorize; but not expressive enough to simultaneously both do that and encode general patterns.
I wonder how this understanding translates to these researchers' models of deep learning.
But at what computational cost?
Does anyone understand the formula they expressed above this sentence? is this just the classic "skip updating parameters with high gradient/loss variance in multiple batches/samples" ?
https://arxiv.org/abs/2411.16085 - set updates to 0 where theres disagreement in the sign of the parameter update - got accepted!
https://arxiv.org/pdf/2412.18052 - discard gradient updates from batches/minibatches that disagree where disagree means cosine distance threshold (they solved for 0.97 or something being optimal)