Current LLMs "undertrained by a factor of maybe 100-1000X or more" says OpenAI co-founder

Apr 24, 2024

Midjourney prompted by THE DECODER

Key Points

Meta has introduced Llama 3, a new language model that has been trained on a record amount of data and outperforms other models.
Even the 8-billion-parameter model was trained with about 15 trillion tokens, which exceeds the amount of data considered optimal according to DeepMind's Chinchilla scaling laws by a factor of 75.
According to AI researcher Andrej Karpathy, this could indicate that most current language models are undertrained by a factor of 100 to 1000 or more and have not yet reached their full potential.

Meta's Llama 3 was trained on a record amount of data, which could lead to a rethinking of the entire AI industry and produce better models.

With Llama 3, Meta has introduced a new language model that significantly outperforms the capabilities of other models in some areas. According to Meta, the key to this performance boost is the significantly increased training data and fine-tuning with 10 million high-quality examples.

While it was already clear that high-quality data can improve the performance of even smaller language models - which Microsoft recently confirmed again with its Phi-3 models - the amount of data used for pre-training is surprising. Even the 8 billion parameter model was trained with around 15 trillion tokens. Notably, this training not only far exceeds the amount of data used for Llama 2, but also the amount of data considered optimal according to the Chinchilla scaling laws.

Language models could be significantly undertrained

These laws, developed by DeepMind, state that for an 8 billion model, around 200 billion training tokens are considered optimal to utilize computing power most efficiently. Llama 3 was trained with 75 times that amount of data.

AI researcher Andrej Karpathy explains on X (formerly Twitter) that the Chinchilla law "tells you the point of compute optimality" but says nothing about how far a model can be trained until it reaches its maximum performance. Karpathy is a founding member of OpenAI and was formerly head of AI at Tesla.

Congrats to @AIatMeta on Llama 3 release!! 🎉https://t.co/fSw615zE8S
Notes:
Ad

Releasing 8B and 70B (both base and finetuned) models, strong-performing in their model class (but we'll see when the rankings come in @ @lmsysorg :))
400B is still training, but already encroaching…

— Andrej Karpathy (@karpathy) April 18, 2024
Ad
DEC_D_Incontent-2

Despite the enormous amount of training data, Meta found that the "8B and 70B parameter models continued to improve log-linearly after we trained them on up to 15T tokens," according to a blog post by the company.

Karpathy says this could suggest that most language models currently in use "are significantly undertrained by a factor of maybe 100-1000X or more, nowhere near their point of convergence." He hopes other AI companies will follow Meta's example and release more long-trained, smaller models.

It is still unclear how far the performance of a language model can be increased through longer and longer training before the gains become too small. However, Meta has shown that the limits of what is possible have not yet been reached.

AI News Without the Hype – Curated by Humans

As a THE DECODER subscriber, you get ad-free reading, our weekly AI newsletter, the exclusive "AI Radar" Frontier Report 6× per year, access to comments, and our complete archive.

Source: Meta AI