Content
summary Summary

OpenAI's new PaperBench benchmark reveals the current limitations of AI's ability to independently replicate scientific research, with human researchers still maintaining an edge.

Ad

The benchmark puts AI systems through a demanding test: recreate 20 research papers presented at ICML 2024, one of machine learning's most prestigious conferences. The papers cover a broad spectrum of machine learning research, from deep reinforcement learning to probabilistic methods and robustness testing.

To measure performance accurately, the team worked directly with the original authors to create an extensive evaluation framework. This resulted in more than 8,300 specific checkpoints that determine whether a system successfully replicated the research.

While AI systems can search the internet for general information, they can't peek at the original authors' code. Instead, they must develop their own complete codebase and create a "reproduce.sh" script that runs all experiments automatically. Each system gets twelve hours to complete the task under standard testing conditions.

Ad
Ad
Diagram: Overview of PaperBench AI research replication benchmark with inputs, agent submission, reproduction execution and AI evaluation.
AI agents receive research papers and detailed evaluation criteria, then must write code to replicate the original results. An AI evaluator assesses whether the submission meets all requirements. | Image: OpenA

Reviewing each submission presents its own challenge. Human experts typically spend dozens of hours evaluating a single paper, making large-scale testing impractical. To solve this problem, OpenAI developed an AI-based evaluation system that dramatically reduces both time and cost.

The team's o3-mini model matches human judgment with 83 percent accuracy while cutting costs from thousands of dollars per paper to just $66. Their more powerful o1 model achieves slightly better accuracy at 84 percent, but comes with a higher price tag of $830 per paper.

Humans are slower but more thorough in reproducing research papers

Even the best-performing models struggle to replicate published research effectively. Anthropic's Claude 3.5 Sonnet leads with a 21 percent success rate in reproducing the papers' results. Other systems performed notably weaker: GPT-4o managed only 4.1 percent, DeepSeek-R1 reached 6 percent, and Google's Gemini 2.0 Flash achieved just 3.2 percent of successful replications.

Table with model names in the left column and numerical replication scores in the right column.
Basic reasoning capabilities don't improve paper replication success. Claude 3.5 Sonnet leads without specialized reasoning features. The newer 3.7 version wasn't included in testing. | Image: OpenAI

OpenAI developed an enhanced version of their agent framework called IterativeAgent to maximize performance. This framework forces AI models to use their full time allocation and tackle tasks incrementally. The results showed significant improvements: o1's success rate jumped from 13.2 to 24.4 percent, while o3-mini improved from 2.6 to 8.5 percent.

However, not every model benefited from this approach. Claude 3.5 Sonnet's performance actually dropped from 21 to 16.1 percent with the new framework. When researchers extended the time limit from 12 to 36 hours, o1 reached its best performance at 26 percent. These results highlight how sensitive AI models are to different prompting strategies and time limits, though longer processing times significantly increase computing costs.

Recommendation
Table with model names in the left column and numerical replication scores in the right column.
OpenAI's reasoning models outperformed Claude 3.5 Sonnet when using the specialized IterativeAgent approach, showing how system configuration affects results. | Image: OpenAI

Different approaches yield different results

To establish a human baseline, OpenAI recruited eight computer science PhD students from top universities including Berkeley, Cambridge, and Cornell. After 48 hours of work, these researchers achieved a 41.4 percent success rate - significantly outperforming every AI system tested.

The study revealed fundamental differences in how humans and AI systems tackle complex research tasks. While AI systems rapidly generate code in the first hour, they quickly reach a plateau and struggle with strategic planning and improvements. Human researchers take more time to understand the papers initially but show consistent progress throughout their work.

The research also exposed a critical weakness in current AI systems: most ended their work prematurely, either mistakenly believing they had finished or concluding they had hit unsolvable problems. Among all systems tested, only Claude 3.5 Sonnet consistently used its full time allocation.

The benchmark is now available on GitHub. OpenAI designed PaperBench to track AI systems' growing capacity for independent research, highlighting its importance for monitoring AI safety as these capabilities advance.

Ad
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

To make the benchmark more accessible, OpenAI also offers PaperBench Code-Dev, a simplified version that focuses solely on code development without execution. This streamlined variant reduces evaluation costs by 85 percent while still providing meaningful insights into AI capabilities.

Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • OpenAI has introduced PaperBench, a new benchmark for evaluating the scientific reasoning capabilities of AI systems by challenging them to independently reproduce the key findings of research papers.
  • To assess the AI-generated reproductions, OpenAI developed an AI-based evaluation system that achieves 83% agreement with human expert judgments while significantly reducing costs from thousands of dollars per paper to just $66.
  • Initial tests reveal the limitations of current AI systems, with the top-performing model, Claude 3.5 Sonnet, successfully replicating only 21% of the papers, compared to a 41.4% replication rate achieved by human doctoral students working for 48 hours.
Sources
Jonathan writes for THE DECODER about how AI tools can make our work and creative lives better.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.