OpenAI's own AI engineering benchmark gives o1-preview top marks

OpenAI has created a new benchmark called MLE-bench to evaluate how well AI agents can develop machine learning solutions. The benchmark includes 75 Kaggle competitions and aims to measure the progress of autonomous AI systems in ML engineering.

MLE-bench focuses on two key areas: selecting challenging tasks that represent current ML development and comparing AI results with human performance.

The 75 competitions cover various fields, including natural language processing, computer vision, and signal processing. Many tasks have real-world applications, such as predicting COVID-19 mRNA vaccine degradation or decoding ancient scrolls.

Initial tests show promise and limitations

OpenAI tested several AI models and agent frameworks on MLE-bench. The o1-preview model with the AIDE framework performed best, achieving at least a bronze medal in 16.9% of competitions. This result surpassed Anthropic's Claude 3.5 Sonnet.

The researchers also examined how different scaling methods affected AI agent performance. More attempts per competition significantly improved success rates. With 8 attempts, the medal rate of o1-preview doubled to 34.1%. Longer processing times led to better results. GPT-4o increased its medal rate from 8.7% to 11.8% when processing time was extended from 24 to 100 hours. However, additional GPU power had little impact on performance.

MLE-bench to be continued

While creating MLE-bench, OpenAI faced challenges such as potential contamination from publicly available Kaggle competitions. To address this, the company used a plagiarism detector to compare agent submissions with top Kaggle solutions and conducted experiments to check for contamination effects.

OpenAI acknowledges that MLE-bench doesn't cover all aspects of AI research and development. The benchmark focuses on tasks with clear problems, clean datasets, and straightforward evaluation metrics. Real-world challenges are often less well-defined.

Despite these limitations, OpenAI sees MLE-bench as a valuable tool for assessing core ML engineering skills. These include preparing large multimodal datasets, managing long-term training procedures, and debugging underperforming models.

The MLE-bench benchmark is available on GitHub.

Join our community

Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

Recommendation

AI research

OpenAI's own AI engineering benchmark gives o1-preview top marks

Initial tests show promise and limitations

MLE-bench to be continued

MatterGen: Microsoft presents AI tools for generating and simulating new materials

AI trading bots can independently learn to coordinate for higher profits

MLE-STAR is designed to automate machine learning pipelines with minimal human input

Every leading AI agent failed at least one security test during a massive red teaming competition

Google upgrades Gemini with Deep Think and flags early warning risks

OpenAI’s math breakthrough might also mean AI is getting better at knowing its own limits

Google DeepMind's Gemini wins Mathematical Olympiad gold using only natural language

OpenAI's own AI engineering benchmark gives o1-preview top marks

Initial tests show promise and limitations

MLE-bench to be continued

Share

Bank details