Content
summary Summary

Researchers introduce a new method called "Selective Language Modeling" that trains language models more efficiently by focusing on the most relevant tokens.

The method leads to significant performance improvements in mathematical tasks, according to a new paper from researchers at Microsoft, Xiamen University, and Tsinghua University. Instead of considering all tokens in a text corpus equally during training as before, Selective Language Modeling (SLM) focuses specifically on the most relevant tokens.

The researchers first analyzed the training dynamics at the token level. They found that the loss for different token types develops very differently during training. Some tokens are learned quickly, others hardly at all.

Based on these findings, the researchers developed a three-step process:
1. First, a reference model is trained on a high-quality, manually filtered dataset, such as for math.
2. Using the reference model, the loss is then calculated for each token in the entire training corpus, which also contains many irrelevant tokens.
3. The actual language model is then selectively trained on the tokens that show a high difference between the loss of the reference model and the current model.

Ad
Ad
Image: Microsoft

In the mathematical example, tokens in sentences like "2 + 2 = 4" or "The derivative of sin(x) is cos(x)" are assigned a low perplexity because they fit well with the learned knowledge of the reference model. Tokens in sentences like "Click here for great insurance" are assigned a high perplexity because they have nothing to do with math.

While such cases can still be removed from the training dataset relatively reliably with classical filtering methods, this becomes more difficult with sentences like "The farm has 35 hens <Apr12 1:24> and 12 pigs. ##davidjl123 says totaling 47 animals." This sentence contains both useful information (the number of animals on the farm) and irrelevant or erroneous information (the date, username, and spelling error "totaling"). Since the method works at the token level, it can also prioritize the tokens relevant for training here.

In this way, the system specifically learns the tokens that are most relevant for the target task.

Selective Language Modeling trains faster and increases accuracy

SLM trains faster and increases accuracy. In mathematics, SLM led to an accuracy increase of over 16% on various benchmarks like GSM8K and MATH in the team's presented RHO-1 model with 1 billion parameters. In addition, the accuracy of the baseline was achieved up to 10 times faster.

The 7 billion parameter variant of RHO-1 achieved comparable performance to a DeepSeekMath model trained with 500 billion tokens, using only 15 billion training tokens. After fine-tuning, the SLM models achieved SOTA on the MATH dataset.

Recommendation

Even outside of mathematics, SLM improved the performance of the Tinyllama-1B model by an average of 6.8% over 15 benchmarks after training with 80 billion tokens. The gains were particularly pronounced for code and math tasks, with over 10% improvement.

The researchers attribute the success of SLM to the method's ability to identify tokens that are relevant to the desired distribution. They hope the approach can help develop tailored AI models faster and more cost-effectively. The method could also further improve open-source models like Meta's Llama 3 through SLM-based fine-tuning.

More information, the code, and the RHO-1 model are available on GitHub.

Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • Researchers have developed a method called Selective Language Modeling (SLM), which trains language models more efficiently by focusing on the most relevant tokens. First, a reference model is trained, which is used to calculate the relevance of each token in the entire training corpus.
  • The actual language model is then trained specifically on the tokens that show a high difference between the loss of the reference model and the current model. In this way, the system learns the most relevant tokens for the target task.
  • With only 15 billion training tokens, RHO-1 trained with SLM achieved performance comparable to a DeepSeekMath model trained with 500 billion tokens. The method could help develop AI models more quickly and cost-effectively.
Sources
Max is managing editor at THE DECODER. As a trained philosopher, he deals with consciousness, AI, and the question of whether machines can really think or just pretend to.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.