OpenAI has created a new benchmark called MLE-bench to evaluate how well AI agents can develop machine learning solutions. The benchmark includes 75 Kaggle competitions and aims to measure the progress of autonomous AI systems in ML engineering.
MLE-bench focuses on two key areas: selecting challenging tasks that represent current ML development and comparing AI results with human performance.
The 75 competitions cover various fields, including natural language processing, computer vision, and signal processing. Many tasks have real-world applications, such as predicting COVID-19 mRNA vaccine degradation or decoding ancient scrolls.
Initial tests show promise and limitations
OpenAI tested several AI models and agent frameworks on MLE-bench. The o1-preview model with the AIDE framework performed best, achieving at least a bronze medal in 16.9% of competitions. This result surpassed Anthropic's Claude 3.5 Sonnet.
The researchers also examined how different scaling methods affected AI agent performance. More attempts per competition significantly improved success rates. With 8 attempts, the medal rate of o1-preview doubled to 34.1%. Longer processing times led to better results. GPT-4o increased its medal rate from 8.7% to 11.8% when processing time was extended from 24 to 100 hours. However, additional GPU power had little impact on performance.
MLE-bench to be continued
While creating MLE-bench, OpenAI faced challenges such as potential contamination from publicly available Kaggle competitions. To address this, the company used a plagiarism detector to compare agent submissions with top Kaggle solutions and conducted experiments to check for contamination effects.
OpenAI acknowledges that MLE-bench doesn't cover all aspects of AI research and development. The benchmark focuses on tasks with clear problems, clean datasets, and straightforward evaluation metrics. Real-world challenges are often less well-defined.
Despite these limitations, OpenAI sees MLE-bench as a valuable tool for assessing core ML engineering skills. These include preparing large multimodal datasets, managing long-term training procedures, and debugging underperforming models.
The MLE-bench benchmark is available on GitHub.