The dark secret of LLMs: Task contamination could be fooling us, researchers say

A new study shows that much of the improvement in the performance of large language models in recent years may be due to the phenomenon of task contamination.

In a new paper, researchers at the University of California, Santa Cruz, show the possible effects of task contamination on the performance of large language models such as GPT-3 in zero-shot and few-shot tasks.

Task contamination refers to a phenomenon in which an AI model is exposed to examples or data during training that are later used as part of test or evaluation tasks. This can skew the results of zero-shot or few-shot evaluations because the model is not truly "blind" to the tasks - it has already seen similar or identical tasks during training.

In practice, the model may then perform better on certain tasks, not because it can learn from few or no examples (as would be the case with true zero-shot or fee-shot learning ability), but because it has already been exposed to similar examples during training. Task contamination thus calls into question the model's ability to deal with new, unfamiliar tasks and may lead to an overestimation of its performance.

Study reveals task contamination in language models

The team looked at different variants of the GPT-3 model series, including GPT-3.5-Turbo, as well as several open language models such as Metas Llama, Bloom, Alpaca or Vicuna.

The researchers found that performance on datasets published before the date of the training data collection was significantly better than on more recent datasets. This strongly suggests contamination by the task.

The study also included the analysis of open model training data and a membership inference attack. By examining the training data and extracting task examples from the models, the researchers found further evidence of task contamination: The methods showed that certain task examples were present in the training data, which could falsify the evaluation of the zero and few-shot abilities of the models.

Using the Membership Inference Attack, the team also checked whether the content generated by the models corresponded exactly to examples from the dataset. A high degree of agreement indicates contamination of the model - again, the team found evidence of task contamination.

The team has not yet investigated GPT-4, but points out that the problem of task contamination is likely to be even greater in reinforcement learning with human feedback.

Recommendation

AI research

Meta's Self-Rewarding LLMs are designed to overcome the limitations of human feedback

Task contamination as a driving force for better language model performance?

Closed models such as GPT-3.5-Turbo can overperform in zero- and few-shot tasks due to task contamination, they said. According to the team, in experiments with classification tasks without detectable task contamination, the language models rarely show significant improvements over simple baselines in both zero- and few-shot scenarios.

The observed increase in performance over time of GPT-3 models from Davinci to GPT-3.5-Turbo on similar tasks is therefore likely to be explained by task contamination, according to the researchers. However, checking training data for such contamination remains a challenge, they said, especially for models with closed source code, as it is often unclear what data has been used and the model does not necessarily provide evidence of contamination.

The team therefore recommends the publication of training datasets to improve the diagnosis and understanding of contamination problems. Transparent publication of training data would not only facilitate the identification of contamination but also contribute to the development of more robust and reliable language models.

Join our community

Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

The dark secret of LLMs: Task contamination could be fooling us, researchers say

Study reveals task contamination in language models

Meta's Self-Rewarding LLMs are designed to overcome the limitations of human feedback

Task contamination as a driving force for better language model performance?

You can now fine-tune three GPT-3.5 models, GPT-4 to follow this fall

Rule-Based Rewards: OpenAI provides insight into the GPT-4 safety stack

Meta takes on OpenAI's GPT-4o with Llama 3 405B, its largest open-source LLM to date

AI models might need to scale down to scale up again

The dark secret of LLMs: Task contamination could be fooling us, researchers say

Study reveals task contamination in language models

Meta's Self-Rewarding LLMs are designed to overcome the limitations of human feedback

Task contamination as a driving force for better language model performance?

You can now fine-tune three GPT-3.5 models, GPT-4 to follow this fall