Posted by belleville 5 days ago
I don't really like one of their premises and conclusions:
> that does not learn hierarchical representations
There's an implicit bias here that (a) traditional networks do learn hierarchial representations, (b) that's bad, and (c) this training method does not learn those. However, [a] is situational, and it's easy to construct datasets where a standard gradient-descent neural net will learn a different way, even with a reverse hierarchy. [b] is unproven and also doesn't make a lot of intuitive sense to me. [c], even in this paper where they make that claim, has no evidence and also doesn't seem likely to be true.
I'm still not quite sure how to think of this. Maybe as being like unrolling a diffusion model, the equivalent of BPTT for RNNs?
It's also a bit interesting as an experimental result, since the core idea didn't require backprop. Being an implementation detail, you could theoretically swap in other layer types or solvers.
> For CIFAR-100 with one-hot embeddings, NoProp-FM fails to learn effectively, resulting in very slow accuracy improvement
In general any actual analysis is made impossible because of the lack of signal in the results. Fig 5 tells me nothing when the span is 99.58 to 99.46 percent accuracy.
Check the pseudo code of their algorithms.
"Update using gradient based optimizations""
Maybe you have a way of seeing it differently so that this looks like a gradient? Gradient keys my brain into a desired outcome expressed as an expectation function.
Gradient descent is only one way of searching for a minima, so in that sense it is not necessary, for example, when one can analytically solve for the extrema of the loss. As an alternative one could do Monte Carlo search instead of gradient descent. For a convex loss that would be less efficient of course.
The one that is not used, because it's inherently unstable?
Learning using locally accessible information is an interesting approach, but it needs to be more complex than "fire together, wire together". And then you might have propagation of information that allows to approximate gradients locally.
Is there anyone in particular whose work focuses on this that you know of?
It’s Hebbian and solves all stability problems.
I can't recall exactly what the Hebbian update is, but something tells me it minimises the "reconstruction loss", and effectively learns the PCA matrix.
There is no prediction or desired output, certainly explicit. I was playing with those things in my work to try and understand how our brains cause the emergence of intelligence rather than solve some classification or related problem. What I managed to replicate was the learning of XOR by some nodes and further that multidimensional XORs up to the number of inputs could be learned.
Perhaps you can say that PCAish is the implicit objective/result but I still reject that there is any conceptual notion of what a node "should" output even if iteratively applying the learning rule leads us there.
GP is essentially isomorphic with beam search where the population is the beam. It is a fancy search algorithm. It is not "training" anything.
>"We believe this work takes a first step TOWARDS introducing a new family of GRADIENT-FREE learning methods"
I.e. for the time being, authors can't convince themselves not to take advantage of efficient hw for taking gradients
(*Checks that Oxford University is not under sanctions*)
It's certifiably insane that it works at all. And not even vaguely backprop, though if you really wanted to stretch the definition I guess you could say that the feedforward layers align to take advantage of a synthetic gradient in a way that approximates backprop.
If I had to guess it's just local gradients, not an end-to-end gradient.
I don't think it is anywhere feasible to emulate anything resembling this in a computational neural network with fixed input and output neurons.
In particular in mammals, we have no idea how actively the mother's body helps shape the child. Of course, there's no direct neuron to neuron contact, but that doesn't mean that the mother's body can't contribute to aspects of even the fetal brain development in other ways.
Consider the comparison with LLM training. A state of the art LLM that is, say, only an order of magnitude better than an average 4 year old human child in language use is trained on ~all of the human text ever produced, consuming many megawatts of power in the process. And it's helped with plenty of pre-processing of this text information, and receives virtually no noise.
In contrast, a human child that is not deaf acquires language from a noisy enviroment with plenty of auditory stimuli from which they first have to even understand that they are picking up language. To be able to communicate and thus receive significant feedback on the learning, they also have to learn how to control a very complex set of organs (tongue, lips, larynx, chest muscles), all with many degrees of freedom and precise timing needed to produce any sound whatsoever.
And yet virtually all human children learn all of this in a matter of 12-24 months, consuming, say, and then spend another 2-3 years learning more language without struggling as much with the basics of word recognition and pronunciation. And they do all this while consuming a total of some 5kWh, and this includes many bodily processes that are not directly related to language acquisition, and a lot of direct physical activity too.
So, either we are missing something extremely fundamental, or the initial state of the brain is very, very far from random and much of this was actually trained over tens or hundreds of thousands of years of evolution of the hominids.
There sure is some "inductive bias" in the anatomy of the brain to develop things like language but it could be closer to how transformer architectures differ from pure MLPs.
The argument was for decades that no generic system can learn language from input alone. That turned out flat wrong.
Wij = f(Wij, xi, xj)
The weight of the connection between nodes i and j is modified by a function over the activations or inputs of node i and j.
The are many variants of back propagation too.
Regardless, yes it would be used within a network model such as a Hopfield network.
If you go for toy experiments you can brute force the optimization. Is it efficient, hell no."