Current LLMs "undertrained by a factor of maybe 100-1000X or more" says OpenAI co-founder

Apr 24, 2024 Maximilian Schreiner

Meta's Llama 3 was trained on a record amount of data, which could lead to a rethinking of the entire AI industry and produce better models.

With Llama 3, Meta has introduced a new language model that significantly outperforms the capabilities of other models in some areas. According to Meta, the key to this performance boost is the significantly increased training data and fine-tuning with 10 million high-quality examples.

While it was already clear that high-quality data can improve the performance of even smaller language models - which Microsoft recently confirmed again with its Phi-3 models - the amount of data used for pre-training is surprising. Even the 8 billion parameter model was trained with around 15 trillion tokens. Notably, this training not only far exceeds the amount of data used for Llama 2, but also the amount of data considered optimal according to the Chinchilla scaling laws.

Language models could be significantly undertrained

These laws, developed by DeepMind, state that for an 8 billion model, around 200 billion training tokens are considered optimal to utilize computing power most efficiently. Llama 3 was trained with 75 times that amount of data.

AI researcher Andrej Karpathy explains on X (formerly Twitter) that the Chinchilla law "tells you the point of compute optimality" but says nothing about how far a model can be trained until it reaches its maximum performance. Karpathy is a founding member of OpenAI and was formerly head of AI at Tesla.

Congrats to @AIatMeta on Llama 3 release!! 🎉https://t.co/fSw615zE8S
Notes:

Releasing 8B and 70B (both base and finetuned) models, strong-performing in their model class (but we'll see when the rankings come in @ @lmsysorg :))
400B is still training, but already encroaching…

— Andrej Karpathy (@karpathy) April 18, 2024

Despite the enormous amount of training data, Meta found that the "8B and 70B parameter models continued to improve log-linearly after we trained them on up to 15T tokens," according to a blog post by the company.

Karpathy says this could suggest that most language models currently in use "are significantly undertrained by a factor of maybe 100-1000X or more, nowhere near their point of convergence." He hopes other AI companies will follow Meta's example and release more long-trained, smaller models.

It is still unclear how far the performance of a language model can be increased through longer and longer training before the gains become too small. However, Meta has shown that the limits of what is possible have not yet been reached.

Sources:

Meta AI