Research shows that high-quality education data is key to AI performance

Midjourney prompted by THE DECODER

A new dataset called FineWeb-Edu highlights the importance of high-quality learning content for the performance of large language models.

FineWeb-Edu is a new high-quality Hugging Face dataset for training large language models (LLMs). It's based on FineWeb, an already filtered web dataset with 15 trillion tokens from 96 CommonCrawl snapshots.

Hugging Face researchers used a classifier trained on the results of a Llama-3-70B-Instruct model's evaluation of FineWeb articles to filter FineWeb for educational content, creating FineWeb-Edu.

Only text data that scored at least 3 out of 5 on an educational scale were included in FineWeb-Edu. This filtered dataset has 1.3 trillion tokens, less than 10% of the original.

The researchers trained a set of 1.82 billion parameter LLMs on 350 billion tokens each from FineWeb-Edu and other datasets. They then compared the performance of the models on various benchmarks.

The result: FineWeb-Edu massively outperforms the unfiltered FineWeb dataset and all other public web datasets, especially on tasks requiring knowledge and logical reasoning.

A small test model pre-trained on FineWeb-Edu clearly beats models trained on other datasets. | Image: Hugging Face

To reach the performance of FineWeb-Edu, other datasets like C4 or Dolma need up to 10 times more training data. This again shows the effectiveness of focusing on high quality educational data, something Microsoft has already shown with its "Textbooks is all you need" research and tiny Phi models. But Microsoft hasn't made its classifier and dataset publicly available.

Quality beats quantity, but to scale AI, you need both

AI expert Andrej Karpathy shares the Hugging Face team's assessment. The average website on the Internet is so random and terrible that it is not even clear how previous LLMs were able to learn anything from it, Karpathy says.

"You'd think it's random articles but it's not, it's weird data dumps, ad spam and SEO, terabytes of stock ticker updates, etc. And then there are diamonds mixed in there, the challenge is pick them out," he writes.

Recommendation

AI research

So-called reasoning models are more efficient but not more capable than regular LLMs, study finds

Along with the 1.3 trillion token dataset (very high educational content), the researchers are also releasing a less heavily filtered 5.4 trillion token version (high educational content) on Hugging Face. Both datasets are freely available, and the researchers also detail their process for compiling the dataset.

The researchers hope to apply the FineWeb-Edu findings to other languages in the future, to make high-quality Web data available for different languages.

Going forward, the research shows that data quality and diversity could take precedence over sheer size in AI training. In addition, synthetically generated data with human quality control could be used to fill specific gaps in data sets or to achieve the scale still needed for new flagship models.

This also explains why OpenAI and other LLM developers are so interested in deals with established publishers. They want access to high-quality data sources such as textbooks, news articles, or scientific papers that can improve their model training. Some of this material has already been used in GPT-4 and others, but without permission.

Join our community

Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

Research shows that high-quality education data is key to AI performance

Quality beats quantity, but to scale AI, you need both

So-called reasoning models are more efficient but not more capable than regular LLMs, study finds

AI system StreamDiT generates livestream videos from text at 16 fps 512p

Researchers used 1,600 YouTube fail videos to show AI models struggle with surprises

AI coding can make developers slower even if they feel faster

OpenAI launches new ChatGPT agent that automates complex tasks for Pro, Plus, and Team

Kimi-K2 is the next open-weight AI milestone from China after Deepseek

New Energy-Based Transformer architecture aims to bring better "System 2 thinking" to AI models

Research shows that high-quality education data is key to AI performance

Quality beats quantity, but to scale AI, you need both

Share

Bank details