Emergent abilities in large language models make a good case for further scaling. Is emergence the norm?
When large language models scale, AI researchers observe unpredictable and sudden jumps in performance on certain tasks. Such emergent abilities can be defined as abilities that occur in models above a certain size and do not occur in small models.
The cause of emergence has not yet been conclusively determined - if there is a single cause at all. Recent research by Google on emergent abilities sheds light on the phenomenon also known as a phase transition: A dramatic change in overall behavior that could not have been predicted when the system was studied at a smaller scale.
Are emergent abilities rare?
According to Google, the emergence phenomenon raises questions, about whether further scaling will enable large language models with new abilities. Good candidates for such abilities could be tasks in the BIG-Bench benchmark for NLP that no large language model has yet mastered. The presence of such abilities thus serves as a rationale for further scaling of AI models.
But it also raises the question of which undiscovered abilities remain in large language models or other AI systems. While individual cases might be dismissed as curiosities, an accumulation of phase transitions suggests a possible qualitative leap beyond a certain level of scaling.
Are such leaps the exception? Or are we just scratching the surface of the depths of large language models?
Google researcher counts emergent abilities in language models
A closer look at benchmarks provides some answers - because that's where emergent abilities are most likely to appear. Jason Wei, AI researcher at Google Brain and lead author of the Google paper, has taken the trouble to count all the emergent abilities discovered so far.
Wei comes up with 137 emergent abilities in the scaled language models GPT-3, Chinchilla, and PaLM. The largest sources of such empirical discoveries, he says, are the NLP benchmarks BIG-Bench and Massive Multitask Benchmark with 67 and 51 cases, respectively.
In BIG-Bench, emergent abilities show up for GPT-3 (175 billion parameters) for "analytic entailment, codenames, phrase relatedness, question answer creation, self-evaluation tutoring, common morpheme, fact-checker or figure of speech detection" and many other tasks.
For other models, abilities are found in benchmarks for questions on microeconomics, conceptual physics, medicine, or the ability to formulate gender-appropriate sentences in German. Further examples can be found in the respective papers, such as the simple mathematical skills of the GPT-3.
In addition to the individual emergent abilities that enable large language models to perform complex language tasks, scaling also enables the use of emergent prompting strategies. For example, chain-of-thought prompting works only when a certain number of parameters is reached.
Scaling and the role of benchmarks
Wei sees his count as an indication that further scaling is reasonable:
Overall, the existence of emergent abilities applies that scaling further would unlock even more emergent abilities. This idea is super exciting to me.
Jason Wei
Indeed, various studies on scaling laws show that further increases in performance can be expected. Whether and which emergent abilities will result is naturally difficult to estimate.
However, the fact that current models already have these abilities shows that we may not have discovered all emergent abilities yet, as there is no benchmark for them. As is so often the case in AI research, benchmarks, therefore, play a central role.
Away from scaling, there are other techniques, such as reinforcement learning with human feedback, that can improve existing models without scaling, most recently GPT-3. Improvements to the architecture and quality of training data, better prompts, or linking to external modules will lead to further performance improvements.
Where the road for large language models is headed will likely become clearer in 2023, when OpenAI's GPT-4 will show whether the company has made progress on maybe the biggest scaling goal - logical reasoning.
Wei's full list is in his blog post,"137 emergent abilities of large language models."