OpenAI's own AI engineering benchmark gives o1-preview top marks
Key Points
- OpenAI has launched MLE-bench, a new benchmark to measure the capabilities of AI agents in the development of machine learning solutions. The test includes 75 Kaggle competitions from different domains such as natural language processing and computer vision.
- In initial experiments, the o1-preview model with the AIDE framework achieved the best results. It achieved at least a bronze medal in 16.9% of the competitions. More trials per competition and longer processing times led to better results, while additional GPU power had no significant impact.
- OpenAI sees MLE-bench as an important tool for evaluating core competencies in ML engineering, but acknowledges that the benchmark does not cover all aspects of AI research. In order to avoid possible contamination effects, various measures such as a plagiarism detector were implemented.
OpenAI has created a new benchmark called MLE-bench to evaluate how well AI agents can develop machine learning solutions. The benchmark includes 75 Kaggle competitions and aims to measure the progress of autonomous AI systems in ML engineering.
MLE-bench focuses on two key areas: selecting challenging tasks that represent current ML development and comparing AI results with human performance.
The 75 competitions cover various fields, including natural language processing, computer vision, and signal processing. Many tasks have real-world applications, such as predicting COVID-19 mRNA vaccine degradation or decoding ancient scrolls.
Initial tests show promise and limitations
OpenAI tested several AI models and agent frameworks on MLE-bench. The o1-preview model with the AIDE framework performed best, achieving at least a bronze medal in 16.9% of competitions. This result surpassed Anthropic's Claude 3.5 Sonnet.
The researchers also examined how different scaling methods affected AI agent performance. More attempts per competition significantly improved success rates. With 8 attempts, the medal rate of o1-preview doubled to 34.1%. Longer processing times led to better results. GPT-4o increased its medal rate from 8.7% to 11.8% when processing time was extended from 24 to 100 hours. However, additional GPU power had little impact on performance.
MLE-bench to be continued
While creating MLE-bench, OpenAI faced challenges such as potential contamination from publicly available Kaggle competitions. To address this, the company used a plagiarism detector to compare agent submissions with top Kaggle solutions and conducted experiments to check for contamination effects.
OpenAI acknowledges that MLE-bench doesn't cover all aspects of AI research and development. The benchmark focuses on tasks with clear problems, clean datasets, and straightforward evaluation metrics. Real-world challenges are often less well-defined.
Despite these limitations, OpenAI sees MLE-bench as a valuable tool for assessing core ML engineering skills. These include preparing large multimodal datasets, managing long-term training procedures, and debugging underperforming models.
The MLE-bench benchmark is available on GitHub.
AI News Without the Hype – Curated by Humans
As a THE DECODER subscriber, you get ad-free reading, our weekly AI newsletter, the exclusive "AI Radar" Frontier Report 6× per year, access to comments, and our complete archive.
Subscribe now