AI’s medical skills are stuck behind a human bottleneck, according to new Oxford research

Large language models excel at medical exams but fall short with real patients, Oxford study finds.

A new study from the University of Oxford suggests that while large language models (LLMs) perform on par with physicians in medical exams, their effectiveness drops sharply when interacting with real patients. The randomized controlled trial, conducted by the Oxford Internet Institute, involved 1,298 participants from the United Kingdom.

Participants were asked to work through ten realistic medical scenarios. In each case, they had to identify a possible cause for their symptoms and select an appropriate course of action—ranging from self-care to visiting a primary care doctor or seeking emergency care. Some participants received assistance from a large language model, while others were free to search for information with Google search.

The models evaluated included GPT-4o from OpenAI, Llama 3 from Meta, and Command R+ from Cohere.

Models deliver strong answers—until humans enter the loop

When prompted directly in the test scenarios, the language models returned convincing answers. GPT-4o identified at least one relevant condition in 94.7% of cases and recommended the right action 64.7% of the time. Llama 3 achieved 99.2% accuracy in identifying conditions, while Command R+ reached 90.8%. Their recommendations for next steps were correct 48.8% and 55.5% of the time, respectively.

However, once users began interacting with the models, results deteriorated significantly. With LLM assistance, participants named a relevant condition in only up to 34.5% of cases—less than the control group, who conducted their own research (47%). Users assisted by the models also failed to outperform the control group in choosing the correct course of action.

Human factors limit the benefits of LLMs

The study attributes this shortfall to how humans and machines interact, not to the models themselves. Users frequently provided incomplete information or misunderstood the language models' responses. For example, Command R+ missed a diagnosis of gallstones when a participant left out details about pain location. In another instance, GPT-4o categorized nearly identical symptoms as an emergency in one case and as a minor headache in another.

The models typically suggested an average of 2.2 possible diagnoses, but users only adopted about 1.33 of these suggestions—and only about a third of those choices were correct. According to the researchers this pattern points to a transfer problem: The language models provide medical knowledge, but it does not reach the users.

Call for real-world user testing

The study concludes that LLMs are not yet suitable as a first point of contact for patients. Simulation-based benchmarks are not a substitute for trials involving real users. The models will need to learn to proactively seek necessary information and communicate recommendations more clearly.

Recommendation

AI research

DeepMind's Genie 2 generates playable 3D worlds from single images

"For a public-facing medical LLM to exist, we expect that it would need to be proactive in managing and requesting information rather than relying on the user to have the expertise to guide the interaction," the paper states. The study also notes that hopes for LLMs to supplement overstretched healthcare systems may need to be reconsidered in light of these findings.

Limitations of lab-based testing

The study also raises questions about standard evaluation methods. The LLMs tested performed well on the MedQA benchmark, which draws on medical licensing exam questions, but this did not translate into better performance during user interactions. Simulated users—LLMs posing as patients—also outperformed real humans, but did not accurately reflect actual user behavior.

Still, the Oxford study itself was conducted under controlled conditions. Participants simulated illnesses according to a set script, which may be closer to reality than vignette-style tests but still differs from how genuinely sick people might communicate.

Additionally, the models were evaluated through API interfaces, not via interactive chat platforms like ChatGPT, which are designed for more supportive, dialog-based guidance. Prompts lacked contextual system instructions, which can significantly affect how a model interacts with users. The study did not examine reasoning models or techniques such as "Chain-of-Thought” prompting. Recently, a paper showed that OpenAI's o1-preview outperformed humans in diagnosing difficult medical cases - albeit using the more vinette-style form of comparison.

Join our community

Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

Nevertheless, the study clearly shows that the use of AI in medicine remains difficult - and the interaction between humans and large language models continues to be a central obstacle.

AI’s medical skills are stuck behind a human bottleneck, according to new Oxford research

Models deliver strong answers—until humans enter the loop

Human factors limit the benefits of LLMs

Call for real-world user testing

DeepMind's Genie 2 generates playable 3D worlds from single images

Limitations of lab-based testing

Microsoft’s MAI-DxO boosts AI diagnostic accuracy and cuts costs by nearly 70 percent

A Tencent-backed startup has brought an AI doctor to the clinic

Five major obstacles are holding back RAG systems in healthcare

OpenAI launches GPT-5 as a unified system with adaptive reasoning for complex tasks

Google Deepmind's Genie 3 creates interactive 3D worlds that stay consistent for "multiple minutes"

Google upgrades Gemini with Deep Think and flags early warning risks

AI’s medical skills are stuck behind a human bottleneck, according to new Oxford research

Models deliver strong answers—until humans enter the loop

Human factors limit the benefits of LLMs

Call for real-world user testing

Limitations of lab-based testing

Share

Bank details