Content
summary Summary

A new dataset called FineWeb-Edu highlights the importance of high-quality learning content for the performance of large language models.

FineWeb-Edu is a new high-quality Hugging Face dataset for training large language models (LLMs). It's based on FineWeb, an already filtered web dataset with 15 trillion tokens from 96 CommonCrawl snapshots.

Hugging Face researchers used a classifier trained on the results of a Llama-3-70B-Instruct model's evaluation of FineWeb articles to filter FineWeb for educational content, creating FineWeb-Edu.

Only text data that scored at least 3 out of 5 on an educational scale were included in FineWeb-Edu. This filtered dataset has 1.3 trillion tokens, less than 10% of the original.

Ad
Ad

The researchers trained a set of 1.82 billion parameter LLMs on 350 billion tokens each from FineWeb-Edu and other datasets. They then compared the performance of the models on various benchmarks.

The result: FineWeb-Edu massively outperforms the unfiltered FineWeb dataset and all other public web datasets, especially on tasks requiring knowledge and logical reasoning.

A small test model pre-trained on FineWeb-Edu clearly beats models trained on other datasets. | Image: Hugging Face

To reach the performance of FineWeb-Edu, other datasets like C4 or Dolma need up to 10 times more training data. This again shows the effectiveness of focusing on high quality educational data, something Microsoft has already shown with its "Textbooks is all you need" research and tiny Phi models. But Microsoft hasn't made its classifier and dataset publicly available.

Quality beats quantity, but to scale AI, you need both

AI expert Andrej Karpathy shares the Hugging Face team's assessment. The average website on the Internet is so random and terrible that it is not even clear how previous LLMs were able to learn anything from it, Karpathy says.

"You'd think it's random articles but it's not, it's weird data dumps, ad spam and SEO, terabytes of stock ticker updates, etc. And then there are diamonds mixed in there, the challenge is pick them out," he writes.

Recommendation

Along with the 1.3 trillion token dataset (very high educational content), the researchers are also releasing a less heavily filtered 5.4 trillion token version (high educational content) on Hugging Face. Both datasets are freely available, and the researchers also detail their process for compiling the dataset.

The researchers hope to apply the FineWeb-Edu findings to other languages in the future, to make high-quality Web data available for different languages.

Going forward, the research shows that data quality and diversity could take precedence over sheer size in AI training. In addition, synthetically generated data with human quality control could be used to fill specific gaps in data sets or to achieve the scale still needed for new flagship models.

This also explains why OpenAI and other LLM developers are so interested in deals with established publishers. They want access to high-quality data sources such as textbooks, news articles, or scientific papers that can improve their model training. Some of this material has already been used in GPT-4 and others, but without permission.

Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • Hugging Face researchers created FineWeb-Edu, a high-quality dataset for training large language models, by filtering the FineWeb dataset for educational content using a classifier. FineWeb-Edu contains 1.3 trillion tokens, less than 10% of the original dataset.
  • Language models trained on FineWeb-Edu significantly outperform models trained on unfiltered datasets, especially on tasks requiring knowledge and logical reasoning. To achieve the same performance as FineWeb-Edu, other datasets like C4 or Dolma need up to 10 times more training data.
  • The research demonstrates the importance of data quality and diversity in AI training, and suggests that synthetically generated data with human quality control could be used to fill specific gaps in datasets or achieve the scale needed for new flagship models. This also explains the interest of OpenAI and other AI developers in partnering with established publishers to access high-quality data sources.
Online journalist Matthias is the co-founder and publisher of THE DECODER. He believes that artificial intelligence will fundamentally change the relationship between humans and computers.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.