Posted by jamie-simon 7 hours ago
The skepticism I'm seeing in the comments really highlights how little of this work is trickling down to the public, which is very sad to see. While it can offer few mathematical mechanisms to infer optimal network design yet (mostly because just trying stuff empirically is often faster than going through the theory, so it is more common to retroactively infer things), the question "why do neural networks work better than other models?" is getting pretty close to a solid answer. Problem is, that was never the question people seem to have ever really been interested in, so the field now has to figure out what questions we ask next.
After seeing AlexNet’s results, all of the major ML imaging labs switched to deep CNNs, and other approaches almost completely disappeared from SOTA imaging competitions. Over the next few years, deep neural networks took over in other ML domains as well.
The conventional wisdom is that it was the combination of (1) exponentially more compute than in earlier eras with (2) exponentially larger, high-quality datasets (e.g., the curated and hand-labeled ImageNet set) that finally allowed deep neural networks to shine.
The development of “attention” was particularly valuable in learning complex relationships among somewhat freely ordered sequential data like text, but I think most ML people now think of neural-network architectures as being, essentially, choices of tradeoffs that facilitate learning in one context or another when data and compute are in short supply, but not as being fundamental to learning. The “bitter lesson” [1] is that more compute and more data eventually beats better models that don’t scale.
Consider this: humans have on the order of 10^11 neurons in their body, dogs have 10^9, and mice have 10^7. What jumps out at me about those numbers is that they’re all big. Even a mouse needs hundreds of millions of neurons to do what a mouse does.
Intelligence, even of a limited sort, seems to emerge only after crossing a high threshold of compute capacity. Probably this has to do with the need for a lot of parameters to deal with the intrinsic complexity of a complex learning environment. (Mice and men both exist in the same physical reality.)
On the other hand, we know many simple techniques with low parameter counts that work well (or are even proved to be optimal) on simple or stylized problems. “Learning” and “intelligence”, in the way we use the words, tends to imply a complex environment, and complexity by its nature requires a large number of parameters to model.
The brain likely has more in common with Reservoir Computing (sans the actual learning algorithm) than Deep Learning.
Deep Learning relies on end to end loss optimization, something which is much more powerful than anything the brain can be doing. But the end-to-end limitation is restricting, credit assignment is a big problem.
Consider how crazy the generative diffusion models are, we generate the output in its entirety with a fixed number of steps - the complexity of the output is irrelevant. If only we could train a model to just use Photoshop directly, but we can't.
Interestingly, there are some attempts at a middle ground where a variable number of continuous variables describe an image: <https://visual-gen.github.io/semanticist/>
For a bit more context: Before 2012 most approaches were based on hand crafted features + SVMs that achieved state of the art performance on academic competitions such as Pascal VOC and neural nets were not competitive on the surface. Around 2010 Fei Fei Li of Stanford University collected a comparatively large dataset and launched the ImageNet competition. AlexNet cut the error rate by half in 2012 leading to major labs to switch to deeper neural nets. The success seems to be a combination of large enough dataset + GPUs to make training time reasonable. The architecture is a scaled version of ConvNets of Yan Lecun tying to the bitter lesson that scaling is more important than complexity.
Real intelligence deals with information over a ludicrous number of size scales. Simple models effectively blur over these scales and fail to pull them apart. However, extra compute is not enough to do this effectively, as nonparametric models have demonstrated.
The key is injecting a sensible inductive bias into the model. Nonparametric models require this to be done explicitly, but this is almost impossible unless you're God. A better way is to express the bias as a "post-hoc query" in terms of the trained model and its interaction with the data. The only way to train such a model is iteratively, as it needs to update its bias retroactively. This can only be accomplished by a nonlinear (in parameters) parametric model that is dense in function space and possesses parameter counts proportional to the data size. Every model we know of that does this is called "a neural network".
I feel like you are downplaying the importance of architecture. I never read the bitter lesson, but I have always heard more as a comment on embedding knowledge into models instead of making them to just scale with data. We know algorithmic improvement is very important to scale NNs (see https://www.semanticscholar.org/paper/Measuring-the-Algorith...). You can't scale an architecture that has catastrophic forgetting embedded in it. It is not really a matter of tradeoffs, some are really worse in all aspects. What I agree is just that architectures that scale better with data and compute do better. And sure, you can say that smaller architectures are better for smaller problems, but then the framing with the bitter lesson makes less sense.
Is this a practical viewpoint? Can you remove any of the specific architectural tricks used in Transformers and expect them to work about equally well?
I'd thought it was some issue with training where older math didn't play nice with having too many layers.
My understanding of the development is that persistent layer-wise pretraining with RBM or autoencoder created an initiation state where the optimization could cope even for more layers, and then when it was proven that it could work, analysis of why led to some changes such as new initiation heuristics, rectified linear activation, eventually normalizations ... so that the pretraining was usually not needed any more.
One finding was that the supervised training with the old arrangement often does work on its own, if you let it run much longer than people reasonably could afford to wait around for just on speculation contrary to observations in CPU computations in the 80s--00s. It has to work its way to a reasonably optimizable state using a chain of poorly scaled gradients first though.
Even PID loops have a training phase separate from recitation phase.
I also think you might be discounting exactly how much compute is used to train these monsters. A single 1ghz processor would take about 100,000,000 years to train something in this class. Even with on the order of 25k GPUs training GPT3 size models takes a couple months. The anemic RAM on GPUs a decade ago (I think we had k80 GPUs with 12GB vs 100’s of GBs on H100/H200 today) and it was actually completely impossible to train a large transformer model prior to the early 2020s.
I’m even reminded how much gamers complained in the late 2010s about GPU prices skyrocketing because of ML use.
I agree with your larger point but dismissed is rather too strong. They were considered fiddly to train, prone to local minima, long training time, no clear guidelines about what the number of hidden layers and number of nodes ought to be. But for homework (toy) exercises they were still ok.
In comparison, kernel methods gave a better experience over all for large but not super large data sets. Most models had easily obtainable global minimum. Fewer moving parts and very good performance.
It turns out, however, that if you have several orders of magnitude more data, the usual kernels are too simple -- (i) they cannot take advantage of more data after a point and start twiddling the 10th place of decimal of some parameters and (ii) are expensive to train for very large data sets. So bit of a double whammy. Well, there was a third, no hardware acceleration that can compare with GPUs.
Kernels may make a comeback though, you never know. We need to find a way to compose kernels in a user friendly way to increase their modeling capacity. We had a few ways of doing just that but they weren't great. We need a breakthrough to scale them to GPT sized data sets.
In a way DNNs are "design your own kernels using data" whereas kernels came in any color you liked provided it was black (yes there were many types, but it was still a fairly limited catalogue. The killer was that there was no good way of composing them to increase modeling capacity that yielded efficiently trainable kernel machines)
But they don't give the same results at those smaller scales. People imagined, but no one could have put into practice because the hardware wasn't there yet. Simplified, LLMs is basically Transformers with the additional idea of "and a shitton of data to learn from", and for making training feasible with that amount of data, you do need some capable hardware.
In olden days, the correct way to solve a linear system of equations was to use theory of minors. With advent of computers, you suddenly had a huge theory of gaussian elimination, or Krylov spaces and what not.
The incentive to design something new - which became the Transformer - came from language model researchers who had been working with recurrent models such as LSTMs, whose recurrent nature made them inefficient to train (needing BPPT), and wanted to come up with a new seq-2-seq/language model that could take advantage of the parallel hardware that now existed and (since AlexNet) was now being used to good effect for other types of model.
As I understand it, the inspiration for the concept of what would become the Transformer came from Attention paper co-author Jakob Uzkoreit who realized that language, while superficially appearing sequential (hence a good match for RNNs) was in fact really parallel + hierarchical as can be seen by linguist's sentence parse trees where different branches of the tree reflect parallel analysis of different parts of the sentence, which are then combined at higher levels of the hierarchical parse tree. This insight gave rise to the idea of a language model that mirrored this analytical structure with hierarchical layers of parallel processing, with the parallel processing being the whole point since this could be accelerated by GPUs. While the concept was Uzkoreit's, it took another researcher, Noam Shazeer, to take the concept and realize it as a performant architecture - the Transformer.
Without the fast parallel hardware already pre-existing, there would not have been any incentive to design a new type of language model to take advantage of it!
The other point is that while the Transformer is a very powerful general purpose and scalable type of model, it only really comes into it's own at scale. If a Transformer had somehow been designed in the pre-GPU-compute era, before the compute power to scale it up to massive size existed it, then it would likely not have appeared so promising/interesting.
The other aspect to the history is that neural networks, of various types, have evolved in complexity and sophistication over time. RNNs and LSTMs came first, then Bahdanau attention as a way to improve their context focus and performance. Attention was now seen to be a valuable part of language and seq-2-seq modelling, so when GPUs motivated the Transformer, attention was retained, recurrence ditched, and hence "Attention is all you need".
The time was right for the Transformer to appear when it did, designed to take advantage of recent GPU advances, building on top of this new attention architecture, and now with the compute power and dataset size available that it started to really shine when scaled from GPT-1 to GPT-2 size, and beyond.
Or perhaps a world where it happened later. I think a big part of what enabled the AI boom was the concentration of money and compute around the crypto boom.
It could have been done in the early 1970s -- see "Paper tape is all you need" at https://github.com/dbrll/ATTN-11 and the various C-64 projects that have been posted on HN -- but the problem was that Marvin Minsky "proved" that there was no way a perceptron-based network could do anything interesting. Funding dried up in a hurry after that.
What result are you referring to?
I'm sure it's an oversimplification to blame the entire 1970s AI winter on Minsky, considering they couldn't have gotten much further than the proof-of-concept stage due to lack of hardware. But his voice was a loud, widely-respected one in academia, and it did have a negative effect on the field.
It might lead to understanding how to measure when a deep learning system is making stuff up or hallucinating. That would have a huge payoff. Until we get that, deep learning systems are limited to tasks where the consequences of outputting bullshit are low.
That's a great problem to solve! (Maybe biased, because this is my primary research direction). One popular approach is OOD detection, but this always seemed ill-posed to me. My colleagues and I have been approaching this from a more fundamental direction using measures of model misspecification, but this is admittedly niche because it is very computationally expensive. Could still be a while before a breakthrough comes from any direction.
My fear is that this is as hopeless right now as explaining why humans or other animals can learn certain things from their huge amount of input data. We'll gain better empirical understanding, but it won't ever be fundamental computer science again, because the giga-datasets are the fundamental complexity not the architecture.
That would be amazing, but personally I’m skeptical.
There is so much to digest here but it's fascinating seeing it all put together!