Skip to content

New benchmark shows LLMs still can't do real scientific research

Image description
Nano Banana Pro prompted by THE DECODER

Getting top marks on exams doesn't automatically make you a good researcher. A new study shows this academic truism applies to large language models too.

AI as a research accelerator represents one of the AI industry's biggest hopes. If generative AI models can speed up science or even drive new discoveries, the benefits for humanity—and the potential profits—would be enormous. OpenAI aims to ship an autonomous research assistant capable of exactly that by 2028.

A new study by an international team of more than 30 researchers from Cornell, MIT, Stanford, Cambridge, and other institutions shows there's still a long way to go. The first authors include the Chinese company Deep Principle, which focuses on using AI in science.

The impressive results on established scientific benchmarks like GPQA or MMMU don't translate well to scenario-based research tasks. While GPT-5 hits 0.86 accuracy on the GPQA-Diamond benchmark, it only manages between 0.60 and 0.75 on the new Scientific Discovery Evaluation (SDE) benchmark, depending on the field.

The researchers trace this performance gap to a fundamental disconnect between decontextualized quiz questions and real scientific discovery. Actual research requires problem-based contextual understanding, iterative hypothesis generation, and interpreting incomplete evidence, skills that standard benchmarks don't measure.

Current benchmarks test the wrong skills

The problem, according to the researchers, lies in how existing science benchmarks like GPQA, MMMU, or ScienceQA are designed. They test isolated factual knowledge loosely connected to specific research areas. But scientific discovery works differently. It requires iterative thinking, formulating and refining hypotheses, and interpreting incomplete observations.

To address this gap, the team developed the SDE benchmark with 1,125 questions across 43 research scenarios in four domains: biology, chemistry, materials science, and physics. The key difference from existing tests is that each question ties to a specific research scenario drawn from actual research projects. Expert teams first defined realistic research scenarios from their own work, then developed questions that colleagues reviewed.

Three-part infographic: (a) Network diagram shows performance of GPT-3.5 to GPT-5 on benchmarks such as GPQA, AIME, SWE-bench, and Arena-hard, with question marks for the new SDE benchmark. (b) Traditional scientific evaluation with loose connections between domains and questions, examples of incorrect answers and irrelevant questions marked. (c) SDE framework as a four-level hierarchy: domain, project, scenario, question, with close links. Illustrated with researcher, molecular structure of artemisinin, laboratory equipment, and computer screen. Evaluation takes place at the project and question levels.
From knowledge tests to research simulations: conventional benchmarks only loosely link questions to subject areas and sometimes include incorrect or irrelevant tasks. The SDE framework ties each question closely to specific research scenarios and projects. Evaluation happens at both the question and project level, with LLMs working through the entire discovery cycle. | Image: Song et al.

The scenarios range from predicting chemical reactions and structure elucidation using NMR spectra to identifying causal genes in genome-wide association studies. This range aims to reflect what scientists actually need in their research.

Performance varies wildly across scenarios

The results show a general drop in performance compared to conventional benchmarks, plus extreme variation between different research scenarios. GPT-5 scores 0.85 in retrosynthesis planning but just 0.23 in NMR-based structure elucidation. This variance holds across all tested models.

For the researchers, this means benchmarks that only categorize questions by subject area aren't enough. Scientific discovery often fails at the weakest link in the chain. The SDE benchmark is designed to highlight language model strengths and weaknesses in specific research scenarios.

Scaling and reasoning hit diminishing returns

The study also tested whether the usual performance-boosting strategies—larger models and more compute time for reasoning—help with scientific discovery. The answer is mixed.

Reasoning does improve performance overall: Deepseek-R1 outperforms Deepseek-V3.1 in most scenarios, even though both share the same base model. When assessing Lipinski's rule of five, a rule of thumb for predicting oral bioavailability of drugs, reasoning boosts accuracy from 0.65 to 1.00.

But the researchers also found diminishing returns. For GPT-5, cranking reasoning effort from "medium" to "high" barely helps. The jump from o3 to GPT-5 shows only marginal progress too, with GPT-5 actually performing worse in eight scenarios.

Diminishing returns in scaling and reasoning: (a) More reasoning effort initially improves accuracy significantly but stagnates between "medium" and "high." (b) Larger models perform better, but the jump from o3 to GPT-5 is small. (c) Direct comparison shows GPT-5 only slightly outperforms o3 - in some scenarios it actually does worse. | Image: Song et al.

The takeaway: the current strategy of increasing model size and test-time compute, which recently drove major progress in coding and math, is hitting its limits in scientific discovery.

Top models fail in the same ways

Another finding: the best models from different providers, GPT-5, Grok-4, Deepseek-R1, and Claude-Sonnet-4.5, show highly correlated error profiles. In chemistry and physics, correlation coefficients between all model pairs exceed 0.8. The models often converge on the same wrong answers, especially for the hardest questions.

The researchers see this as evidence of similar training data and optimization targets rather than architectural differences. In practice, ensemble strategies like majority voting between different models probably won't help much on the toughest questions.

Two-part illustration of error correlation between four large language models in two scientific scenarios. The top section shows grid plots in which rows represent the models (Claude-Sonnet-4.5, DeepSeek-R1, Grok-4, GPT-5) and columns represent individual questions. Red dots indicate incorrect answers, green dots indicate correct answers. On the left, in the “TMC property prediction” scenario, most points are red, and many columns are completely red, meaning that all four models answer the same question incorrectly. On the right, in the “MOF synthesis” scenario, most points are green, but there are some columns in which all points are red, indicating common errors. Below, two donut charts show the distribution of how many models were correct per question: For TMC property prediction, 40% of the questions were answered correctly by no model, 30% by one, 20% by two, and 10% by three models; no question was answered correctly by all four. For MOF synthesis, 72.7% of the questions were answered correctly by all four models, 18.2% by none, and 4.55% by two or three models, respectively.
Common errors across four top LLMs in two chemistry scenarios. Top: Each column represents a question, each row a model; green = correct, red = incorrect. On the left (TMC property prediction) models are often wrong together; on the right (MOF synthesis) they're mostly right together but still share some failures. The donut charts below show how many questions 0-4 models answered correctly. | Image: Song et al.

To isolate these weaknesses, the team created a subset called SDE-hard with 86 particularly difficult questions. All standard models score below 0.12 accuracy there. Only GPT-5-pro, which costs twelve times more, reaches 0.224 and correctly answers nine questions where all others fail.

Project-level testing reveals more gaps

Beyond individual questions, the SDE framework also evaluates performance at the project level. Here, models work through a real scientific discovery cycle: formulating hypotheses, running experiments, and interpreting results to refine their hypotheses.

The eight projects examined range from protein design and gene editing to retrosynthesis, molecular optimization, and symbolic regression. The key finding: no single model dominates all projects. Leadership shifts depending on the task.

Bar chart shows average accuracy of eleven language models across all domains. GPT-5 achieves 0.658, Claude Sonnet 4.5 follows with 0.637, o3 with 0.627, Grok 4 with 0.619, Claude Opus 4.1 with 0.610, DeepSeek R1 with 0.600, Grok 3 with 0.515, DeepSeek V3.1 with 0.506, GPT-5-chat-latest with 0.491, GPT-4o with 0.489, and Claude-Sonnet-4 with 0.447.
Average accuracy across all 43 research scenarios: GPT-5 leads at 0.658, followed by Claude-Sonnet-4.5 and o3. Differences between top models are small; older models like GPT-4o and Claude-Sonnet-4 fall significantly behind. | Image: Song et al.

Surprisingly, strong question performance doesn't automatically translate to strong project performance. When optimizing transition metal complexes, GPT-5, Deepseek-R1, and Claude-Sonnet-4.5 find optimal solutions from millions of possibilities despite performing poorly on related knowledge questions. Conversely, models fail at retrosynthesis planning despite good question scores because their proposed synthesis paths don't actually work.

The researchers' interpretation: what matters isn't just precise specialist knowledge, but the ability to systematically explore large solution spaces and find promising approaches, even ones that weren't targeted from the start.

LLMs are far from scientific superintelligence but still useful

The study's conclusion is clear: no current language model comes close to scientific "superintelligence." But that doesn't mean they're useless. LLMs already perform well on specific projects, especially when paired with specialized tools and human guidance. They can plan and run experiments, sift through massive search spaces, and surface promising candidates that researchers might never have considered.

To close the gap, the researchers recommend shifting focus from pure scaling to targeted training for problem formulation and hypothesis generation. They also call for diversifying pre-training data to reduce shared error profiles, integrating tool use into training, and developing reinforcement learning strategies that specifically target scientific reasoning. Current optimizations for coding and math don't seem to transfer automatically to scientific discovery.

The framework and benchmark data will serve as a resource for pushing language model development toward scientific discovery. The study currently covers only four domains - fields like geosciences, social sciences, and engineering aren't included yet - but the framework's modular architecture allows for future expansion. The team has made the question-level code and evaluation scripts as well as project-level datasets publicly available.

A few days ago, OpenAI released its own benchmark, FrontierScience, designed to measure AI performance in science beyond simple question-and-answer tests. The result was similar: quiz knowledge doesn't equal research expertise.

AI News Without the Hype – Curated by Humans

As a THE DECODER subscriber, you get ad-free reading, our weekly AI newsletter, the exclusive "AI Radar" Frontier Report 6× per year, access to comments, and our complete archive.

Source: Paper

AI news without the hype
Curated by humans.

  • Over 20 percent launch discount.
  • Read without distractions – no Google ads.
  • Access to comments and community discussions.
  • Weekly AI newsletter.
  • 6 times a year: “AI Radar” – deep dives on key AI topics.
  • Up to 25 % off on KI Pro online events.
  • Access to our full ten-year archive.
  • Get the latest AI news from The Decoder.
Subscribe to The Decoder