AI and society

Metas "LLaMA" language model shows that parameters are not everything

Matthias Bastian
A small brown Llama in a huge library with thousands of books, AI art, generated by Midjourney

Midjourney prompted by THE DECODER

Meta introduces the LLaMA language models, which with relatively few parameters can outperform much larger language models such as GPT-3.

Meta's AI research department releases four foundational models ranging from 7 to 65 billion parameters. The 13-billion-parameter LLaMA (Large Language Model Meta AI) is said to outperform Meta's open-source OPT model and GPT-3's 175-billion-parameter GPT-3 on "most" language tasks.

More data for better results

The largest LLaMA model, with 65 billion parameters, is said to be able to compete with Google's massive Palm model with 540 billion parameters, and is on par with Deepmind's Chinchilla, according to the researchers.

Benchmarking results for natural language reasoning tasks. | Image: Meta

The LLaMA model requires a similar amount of training hours, and thus consumes a similar amount of CO₂, as the 175 Billion models OPT and Bloom. However, its operational cost is lower (see below).

CO₂ consumption of the smaller models trained with more data compared to larger language models. | Image: Meta

The comparison with Chinchilla is interesting in that Deepmind then, like Meta now, adopted a new training approach with LLaMA, based on a larger number of training data (tokens) than usual. LLaMA is Meta's Chinchilla, so to speak, and the researchers explicitly cite the model as an inspiration.

LLaMA shows that training with more data is reflected in performance. Training is pricier and time consuming, but the model is more efficient later.

The objective of the scaling laws from Hoffmann et al. (2022) is to determine how to best scale the dataset and model sizes for a particular training compute budget. However, this objective disregards the inference budget, which becomes critical when serving a language model at scale. In this context, given a target level of performance, the preferred model is not the fastest to train but the fastest at inference, and although it may be cheaper to train a large model to reach a certain level of performance, a smaller one trained longer will ultimately be cheaper at inference.

From the paper

LLaMA's 13-billion model, which operates at the GPT-3 level, runs on a single Nvidia Tesla V100 graphics card, according to the Meta research team. It could help democratize access to and research on large-scale language models.

The LLaMA language models also show that larger models can still have significant performance reserves if a company pays the bills and combines large models with even more data. Meta's research team plans to do this in the future, as well as fine-tune the models with instructions.

We plan to release larger models trained on larger pretraining corpora in the future, since we have seen a constant improvement in performance as we were scaling.

From the paper

Public data for AI training

LLaMA differs from Deepmind's Chinchilla and other large language models in its training data, according to the Meta Research team. LLaMA uses only publicly available data, while other models use undocumented or non-public datasets for training.

Much (67%) of LLaMA's data comes from a cleaned version of the widely used English Common Crawl dataset. Other data sources include public GitHub and Wikipedia. The LLaMA models are therefore "compatible with open-sourcing," the team writes.

This is questionable, at least to the extent that common open-source licenses do not yet provide for their use in AI training, and the models typically do not cite sources in their output. Even if large companies are currently doing so, it is unlikely that effective consent to use data for AI training can be inferred from the public availability of data on the Internet alone. Going forward, the courts may help clarify this.

Meta releases the language models under the non-commercial GPL v3 license to selected partners in academia, government and industry. Interested parties can apply here. Access to the model card and instructions for use are available on Github.