Content
summary Summary

Meta's Llama 3 was trained on a record amount of data, which could lead to a rethinking of the entire AI industry and produce better models.

With Llama 3, Meta has introduced a new language model that significantly outperforms the capabilities of other models in some areas. According to Meta, the key to this performance boost is the significantly increased training data and fine-tuning with 10 million high-quality examples.

While it was already clear that high-quality data can improve the performance of even smaller language models - which Microsoft recently confirmed again with its Phi-3 models - the amount of data used for pre-training is surprising. Even the 8 billion parameter model was trained with around 15 trillion tokens. Notably, this training not only far exceeds the amount of data used for Llama 2, but also the amount of data considered optimal according to the Chinchilla scaling laws.

Language models could be significantly undertrained

These laws, developed by DeepMind, state that for an 8 billion model, around 200 billion training tokens are considered optimal to utilize computing power most efficiently. Llama 3 was trained with 75 times that amount of data.

Ad
Ad

AI researcher Andrej Karpathy explains on X (formerly Twitter) that the Chinchilla law "tells you the point of compute optimality" but says nothing about how far a model can be trained until it reaches its maximum performance. Karpathy is a founding member of OpenAI and was formerly head of AI at Tesla.

Despite the enormous amount of training data, Meta found that the "8B and 70B parameter models continued to improve log-linearly after we trained them on up to 15T tokens," according to a blog post by the company.

Karpathy says this could suggest that most language models currently in use "are significantly undertrained by a factor of maybe 100-1000X or more, nowhere near their point of convergence." He hopes other AI companies will follow Meta's example and release more long-trained, smaller models.

It is still unclear how far the performance of a language model can be increased through longer and longer training before the gains become too small. However, Meta has shown that the limits of what is possible have not yet been reached.

Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • Meta has introduced Llama 3, a new language model that has been trained on a record amount of data and outperforms other models.
  • Even the 8-billion-parameter model was trained with about 15 trillion tokens, which exceeds the amount of data considered optimal according to DeepMind's Chinchilla scaling laws by a factor of 75.
  • According to AI researcher Andrej Karpathy, this could indicate that most current language models are undertrained by a factor of 100 to 1000 or more and have not yet reached their full potential.
Sources
Max is managing editor at THE DECODER. As a trained philosopher, he deals with consciousness, AI, and the question of whether machines can really think or just pretend to.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.