MIT's perplexity-based data pruning helps big language models learn faster with less data

A new study by MIT researchers underscores a notion that has been gaining traction in recent years: Less data can lead to better language models.

The team developed a technique in which small AI models select only the most useful parts of training data sets. They then used this selected data to train much larger models. They found that the language models both performed better on benchmarks and required fewer training steps.

The approach, called "perplexity-based data pruning," has the smaller model assign a perplexity value to each training data set. Perplexity is a measure of how "surprised" the model is by a given example. The idea is that the examples with higher perplexity contain the most information and are therefore potentially the most useful for training the model.

Different approaches for different kinds of training data

In experiments, the researchers used a comparatively small model with 125 million parameters to reduce the training data for models more than 30 times larger.

The large models trained with this reduced data significantly outperformed the base models trained with the full data sets. In one test, pruning increased the accuracy of a model with three billion parameters by more than two percentage points.

Interestingly, they found that different datasets benefit from different pruning approaches, depending on the composition of the data. As a result, they recommend tailoring the choice of method to the particular data set.

The MIT researchers see their work as an important step toward making data reduction a standard part of AI training, and it confirms previous research that more data does not necessarily lead to better language models.

Join our community

Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

MIT's perplexity-based data pruning helps big language models learn faster with less data

Different approaches for different kinds of training data

AI training shifts from clickworkers to experts in physics, biology and engineering

New Energy-Based Transformer architecture aims to bring better "System 2 thinking" to AI models

Researchers train AI to generate long-form text using only reinforcement learning

Meta's human-like chatbot personas can mislead users and result in real-world harm

OpenAI launches GPT-5 as a unified system with adaptive reasoning for complex tasks

Google Deepmind's Genie 3 creates interactive 3D worlds that stay consistent for "multiple minutes"

MIT's perplexity-based data pruning helps big language models learn faster with less data

Different approaches for different kinds of training data

AI training shifts from clickworkers to experts in physics, biology and engineering

New Energy-Based Transformer architecture aims to bring better "System 2 thinking" to AI models

Researchers train AI to generate long-form text using only reinforcement learning