Content
summary Summary

A new study in JAMA Network Open raises fresh doubts about whether large language models (LLMs) can actually reason through medical cases or if they're just matching patterns they've seen before. The researchers say these models aren't ready for clinical work.

Ad

Researchers led by Suhana Bedi started with 100 questions from the MedQA benchmark, a standard multiple-choice test for medical knowledge. For each question, they swapped the correct answer for "None of the other answers" (NOTA).

A clinical expert reviewed every modified question to confirm that NOTA was the only correct answer. In the end, 68 questions met this standard. To get these right, LLMs had to recognize that none of the usual options applied and pick NOTA instead. This set up a direct test: can LLMs actually reason, or are they just following familiar answer patterns from training?

Making "None of the other answers" the right choice can trip up language models. | Image: Bedi et al.

Small changes, big drop in accuracy

Every model saw its accuracy drop when faced with the revised questions, but some struggled much more than others. Standard LLMs like Claude 3.5 (-26.5 percentage points), Gemini 2.0 (-33.8), GPT-4o (-36.8), and LLaMA 3.3 (-38.2) all took a major hit.

Ad
Ad

Reasoning-focused models like Deepseek-R1 (-8.8) and o3-mini (-16.2) held up better but still lost ground. The researchers also tried "chain-of-thought" prompts, asking models to lay out their reasoning step by step, but even this didn't help the models reliably reach the correct medical answer.

When answer options are changed, all language models lose accuracy. General-purpose models drop much more than reasoning-optimized systems like DeepSeek and o3-mini. | Image: Bedi et al.

According to the authors, these results highlight a core problem: today's models mostly rely on statistical pattern matching, not genuine reasoning. Some dropped from 80 to 42 percent accuracy with only minor changes to the questions. That makes them risky for medical practice, where unusual or complex cases are common.

Doctors frequently encounter rare conditions or unexpected symptoms that don't fit textbook patterns. If LLMs are just matching familiar answers instead of reasoning through each case, they're likely to miss or misinterpret these outliers. The findings call into question whether current LLMs are robust or reliable enough for clinical use, since medicine demands systems that can handle ambiguity and adapt to new situations.

Language models are easily thrown off

It's well known that LLMs can give completely different answers if the prompt changes slightly or includes irrelevant info. Even reasoning-focused models aren't immune to this problem.

But it's still not clear if these systems truly lack logical reasoning skills or just can't apply them reliably. Right now, debates about LLM "reasoning" are bogged down by vague definitions and fuzzy benchmarks, which makes it tough to judge what these models can actually do.

Recommendation

The study also didn't include the very latest reasoning models like GPT-5-Thinking or Gemini 2.5 Pro, which might do better. Deepseek-R1 and o3-mini are current for their class but still may lag the most advanced systems. Still, their stronger performance here suggests there's a path toward more robust, reasoning-capable LLMs.

Ad
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • A study shows that large language models significantly lose accuracy with minimally changed but logically identical medical questions, raising doubts about their suitability for clinical practice.
  • While standard models such as GPT-4o and LLaMA 3.3 fell particularly sharply, reasoning-optimized models such as DeepSeek-R1 and o3-mini proved to be more robust, but also lost performance.
  • According to the researchers, the results indicate that current language models rely primarily on pattern recognition rather than medical reasoning.
Sources
Matthias is the co-founder and publisher of THE DECODER, exploring how AI is fundamentally changing the relationship between humans and computers.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.