Content
summary Summary

Large language models excel at medical exams but fall short with real patients, Oxford study finds.

Ad

A new study from the University of Oxford suggests that while large language models (LLMs) perform on par with physicians in medical exams, their effectiveness drops sharply when interacting with real patients. The randomized controlled trial, conducted by the Oxford Internet Institute, involved 1,298 participants from the United Kingdom.

Participants were asked to work through ten realistic medical scenarios. In each case, they had to identify a possible cause for their symptoms and select an appropriate course of action—ranging from self-care to visiting a primary care doctor or seeking emergency care. Some participants received assistance from a large language model, while others were free to search for information with Google search.

The models evaluated included GPT-4o from OpenAI, Llama 3 from Meta, and Command R+ from Cohere.

Ad
Ad

Models deliver strong answers—until humans enter the loop

When prompted directly in the test scenarios, the language models returned convincing answers. GPT-4o identified at least one relevant condition in 94.7% of cases and recommended the right action 64.7% of the time. Llama 3 achieved 99.2% accuracy in identifying conditions, while Command R+ reached 90.8%. Their recommendations for next steps were correct 48.8% and 55.5% of the time, respectively.

However, once users began interacting with the models, results deteriorated significantly. With LLM assistance, participants named a relevant condition in only up to 34.5% of cases—less than the control group, who conducted their own research (47%). Users assisted by the models also failed to outperform the control group in choosing the correct course of action.

Human factors limit the benefits of LLMs

The study attributes this shortfall to how humans and machines interact, not to the models themselves. Users frequently provided incomplete information or misunderstood the language models' responses. For example, Command R+ missed a diagnosis of gallstones when a participant left out details about pain location. In another instance, GPT-4o categorized nearly identical symptoms as an emergency in one case and as a minor headache in another.

The models typically suggested an average of 2.2 possible diagnoses, but users only adopted about 1.33 of these suggestions—and only about a third of those choices were correct. According to the researchers this pattern points to a transfer problem: The language models provide medical knowledge, but it does not reach the users.

Call for real-world user testing

The study concludes that LLMs are not yet suitable as a first point of contact for patients. Simulation-based benchmarks are not a substitute for trials involving real users. The models will need to learn to proactively seek necessary information and communicate recommendations more clearly.

Recommendation

"For a public-facing medical LLM to exist, we expect that it would need to be proactive in managing and requesting information rather than relying on the user to have the expertise to guide the interaction," the paper states. The study also notes that hopes for LLMs to supplement overstretched healthcare systems may need to be reconsidered in light of these findings.

Limitations of lab-based testing

The study also raises questions about standard evaluation methods. The LLMs tested performed well on the MedQA benchmark, which draws on medical licensing exam questions, but this did not translate into better performance during user interactions. Simulated users—LLMs posing as patients—also outperformed real humans, but did not accurately reflect actual user behavior.

Still, the Oxford study itself was conducted under controlled conditions. Participants simulated illnesses according to a set script, which may be closer to reality than vignette-style tests but still differs from how genuinely sick people might communicate.

Additionally, the models were evaluated through API interfaces, not via interactive chat platforms like ChatGPT, which are designed for more supportive, dialog-based guidance. Prompts lacked contextual system instructions, which can significantly affect how a model interacts with users. The study did not examine reasoning models or techniques such as "Chain-of-Thought” prompting. Recently, a paper showed that OpenAI's o1-preview outperformed humans in diagnosing difficult medical cases - albeit using the more vinette-style form of comparison.

Ad
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

Nevertheless, the study clearly shows that the use of AI in medicine remains difficult - and the interaction between humans and large language models continues to be a central obstacle.

Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • A University of Oxford study found that large language models like GPT-4o, Llama 3, and Command R+ performed well on medical exam-style questions, but their effectiveness dropped sharply when used by real people facing simulated medical scenarios.
  • When participants relied on these models to help diagnose symptoms or choose what to do next, their accuracy was lower than that of those who did their own research, mainly because users provided incomplete information or misunderstood the models’ answers.
  • The researchers conclude that LLMs are not yet ready to serve as a first contact for patients, and that lab-based benchmarks don’t reflect real-world use—future models must be better at proactively gathering information and giving clear, actionable advice.
Sources
Max is the managing editor of THE DECODER, bringing his background in philosophy to explore questions of consciousness and whether machines truly think or just pretend to.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.