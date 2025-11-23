AI research
Gemini 3 Pro and GPT-5 still fail at complex physics tasks designed for real scientific research

Gemini 3 Pro and GPT-5 still fail at complex physics tasks designed for real scientific research
Matthias is the co-founder and publisher of THE DECODER, exploring how AI is fundamentally changing the relationship between humans and computers.
A new physics benchmark called "CritPt" puts leading AI models to the test at the level of early-stage PhD research. The results show that even top systems like Gemini 3 Pro and GPT-5 still fall far short of acting as autonomous scientists.

More than 50 physicists from over 30 institutions built the "CritPt" benchmark to see whether AI can truly help researchers push the boundaries of modern physics. Their goal goes far beyond checking textbook recall. The benchmark asks models to solve original, unpublished research problems that resemble the work of a capable graduate student starting an independent project.

The early results set a sobering baseline. In an independent evaluation by Artificial Analysis, Google's "Gemini 3 Pro Preview" reached just 9.1 percent accuracy while using 10 percent fewer tokens than OpenAI's "GPT-5.1 (high)," which placed second at 4.9 percent. Even at the top of the leaderboard, the systems miss the vast majority of tasks.

CritPt Benchmark Leaderboard: Results
Independent evaluation by Artificial Analysis: Even the strongest upcoming models barely solve 10 percent of tasks. | Image: Artificial Analysis

Doctoral-level reasoning remains a major hurdle

CritPt includes 71 full research challenges from eleven physics fields, such as quantum physics, astrophysics, high-energy physics, and biophysics. To prevent guessing or retrieval, all problems are based on unpublished material. The team also broke each challenge into 190 smaller "checkpoints" to measure partial progress.

The findings offer a reality check: current large language models lack the rigor, creativity, and precision needed to solve open-ended physics problems on their own. Still, the models show measurable improvement on simpler, well-defined subtasks, which suggests that targeted support roles may be more realistic.

Models perform much better on simple checkpoints than on full research challenges. | Image: Zhu et al.

The team also tested consistency using a stricter metric called the "consistently solved rate," which requires a model to give the correct answer four out of five times. Under this requirement, performance collapses across the board, showing how fragile model reasoning remains even on tasks they sometimes solve.

The "Consistently Solved Rate" highlights how easily accuracy falls apart when models must be correct four times out of five. Gemini 3 Pro was not included in this study. | Image: Zhu et al.

This lack of robustness creates a serious challenge for research workflows. The models often produce answers that look convincing but contain subtle errors that are difficult to catch, which can easily mislead researchers and require time-consuming expert review.

The researchers argue that, for the foreseeable future, the more realistic goal is not an "AI scientist" replacing human experts, but a "research assistant" automating specific workflow steps. This matches current industry plans: OpenAI aims to ship a research intern system by September 2026 and a fully autonomous researcher by March 2028. The company claims that GPT-5 is already saving researchers time.

Summary
  • More than 50 physicists around the world have created a new benchmark called "CritPt" to test whether AI models can handle complex, unpublished physics research problems at the level of doctoral students.
  • The early results fall far short of that goal. In the current tests, Google's Gemini 3 Pro Preview performs the best, but it still reaches only 9.1 percent accuracy.
  • The models struggle with full research projects and often fail to reproduce their own solutions reliably. Even so, the researchers see room for these systems to act as research assistants, especially on well-defined subtasks where their limitations are easier to manage.
Sources
Paper Minyang Tian via X Artificial Analysis via X
AI research

OpenAI report suggests GPT‑5 is starting to ease scientists’ daily workloads

AI research

Arxiv tightens moderation for computer science papers amid flood of AI-generated review articles

AI research

OpenAI builds 'AI for Science' team to advance computational discovery

Gemini 3 Pro and GPT-5 still fail at complex physics tasks designed for real scientific research

AI research

Gemini 3 Pro tops new AI reliability benchmark, but hallucination rates remain high

AI research

Researchers push "Context Engineering 2.0" as the road to lifelong AI memory

AI and society

German court deepens the split on AI and copyright with its latest ruling

