Large language models excel at medical exams but fall short with real patients, Oxford study finds.
A new study from the University of Oxford suggests that while large language models (LLMs) perform on par with physicians in medical exams, their effectiveness drops sharply when interacting with real patients. The randomized controlled trial, conducted by the Oxford Internet Institute, involved 1,298 participants from the United Kingdom.
Participants were asked to work through ten realistic medical scenarios. In each case, they had to identify a possible cause for their symptoms and select an appropriate course of action—ranging from self-care to visiting a primary care doctor or seeking emergency care. Some participants received assistance from a large language model, while others were free to search for information with Google search.
The models evaluated included GPT-4o from OpenAI, Llama 3 from Meta, and Command R+ from Cohere.
Models deliver strong answers—until humans enter the loop
When prompted directly in the test scenarios, the language models returned convincing answers. GPT-4o identified at least one relevant condition in 94.7% of cases and recommended the right action 64.7% of the time. Llama 3 achieved 99.2% accuracy in identifying conditions, while Command R+ reached 90.8%. Their recommendations for next steps were correct 48.8% and 55.5% of the time, respectively.
However, once users began interacting with the models, results deteriorated significantly. With LLM assistance, participants named a relevant condition in only up to 34.5% of cases—less than the control group, who conducted their own research (47%). Users assisted by the models also failed to outperform the control group in choosing the correct course of action.
Human factors limit the benefits of LLMs
The study attributes this shortfall to how humans and machines interact, not to the models themselves. Users frequently provided incomplete information or misunderstood the language models' responses. For example, Command R+ missed a diagnosis of gallstones when a participant left out details about pain location. In another instance, GPT-4o categorized nearly identical symptoms as an emergency in one case and as a minor headache in another.
The models typically suggested an average of 2.2 possible diagnoses, but users only adopted about 1.33 of these suggestions—and only about a third of those choices were correct. According to the researchers this pattern points to a transfer problem: The language models provide medical knowledge, but it does not reach the users.
Call for real-world user testing
The study concludes that LLMs are not yet suitable as a first point of contact for patients. Simulation-based benchmarks are not a substitute for trials involving real users. The models will need to learn to proactively seek necessary information and communicate recommendations more clearly.
"For a public-facing medical LLM to exist, we expect that it would need to be proactive in managing and requesting information rather than relying on the user to have the expertise to guide the interaction," the paper states. The study also notes that hopes for LLMs to supplement overstretched healthcare systems may need to be reconsidered in light of these findings.
Limitations of lab-based testing
The study also raises questions about standard evaluation methods. The LLMs tested performed well on the MedQA benchmark, which draws on medical licensing exam questions, but this did not translate into better performance during user interactions. Simulated users—LLMs posing as patients—also outperformed real humans, but did not accurately reflect actual user behavior.
Still, the Oxford study itself was conducted under controlled conditions. Participants simulated illnesses according to a set script, which may be closer to reality than vignette-style tests but still differs from how genuinely sick people might communicate.
Additionally, the models were evaluated through API interfaces, not via interactive chat platforms like ChatGPT, which are designed for more supportive, dialog-based guidance. Prompts lacked contextual system instructions, which can significantly affect how a model interacts with users. The study did not examine reasoning models or techniques such as "Chain-of-Thought” prompting. Recently, a paper showed that OpenAI's o1-preview outperformed humans in diagnosing difficult medical cases - albeit using the more vinette-style form of comparison.
Nevertheless, the study clearly shows that the use of AI in medicine remains difficult - and the interaction between humans and large language models continues to be a central obstacle.