Content
summary Summary

OpenAI has created a new benchmark called MLE-bench to evaluate how well AI agents can develop machine learning solutions. The benchmark includes 75 Kaggle competitions and aims to measure the progress of autonomous AI systems in ML engineering.

Ad

MLE-bench focuses on two key areas: selecting challenging tasks that represent current ML development and comparing AI results with human performance.

The 75 competitions cover various fields, including natural language processing, computer vision, and signal processing. Many tasks have real-world applications, such as predicting COVID-19 mRNA vaccine degradation or decoding ancient scrolls.

Initial tests show promise and limitations

OpenAI tested several AI models and agent frameworks on MLE-bench. The o1-preview model with the AIDE framework performed best, achieving at least a bronze medal in 16.9% of competitions. This result surpassed Anthropic's Claude 3.5 Sonnet.

Ad
Ad

The researchers also examined how different scaling methods affected AI agent performance. More attempts per competition significantly improved success rates. With 8 attempts, the medal rate of o1-preview doubled to 34.1%. Longer processing times led to better results. GPT-4o increased its medal rate from 8.7% to 11.8% when processing time was extended from 24 to 100 hours. However, additional GPU power had little impact on performance.

MLE-bench to be continued

While creating MLE-bench, OpenAI faced challenges such as potential contamination from publicly available Kaggle competitions. To address this, the company used a plagiarism detector to compare agent submissions with top Kaggle solutions and conducted experiments to check for contamination effects.

OpenAI acknowledges that MLE-bench doesn't cover all aspects of AI research and development. The benchmark focuses on tasks with clear problems, clean datasets, and straightforward evaluation metrics. Real-world challenges are often less well-defined.

Despite these limitations, OpenAI sees MLE-bench as a valuable tool for assessing core ML engineering skills. These include preparing large multimodal datasets, managing long-term training procedures, and debugging underperforming models.

The MLE-bench benchmark is available on GitHub.

Ad
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Recommendation
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • OpenAI has launched MLE-bench, a new benchmark to measure the capabilities of AI agents in the development of machine learning solutions. The test includes 75 Kaggle competitions from different domains such as natural language processing and computer vision.
  • In initial experiments, the o1-preview model with the AIDE framework achieved the best results. It achieved at least a bronze medal in 16.9% of the competitions. More trials per competition and longer processing times led to better results, while additional GPU power had no significant impact.
  • OpenAI sees MLE-bench as an important tool for evaluating core competencies in ML engineering, but acknowledges that the benchmark does not cover all aspects of AI research. In order to avoid possible contamination effects, various measures such as a plagiarism detector were implemented.
Sources
Max is managing editor at THE DECODER. As a trained philosopher, he deals with consciousness, AI, and the question of whether machines can really think or just pretend to.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.