OpenAI says its latest models outperform doctors in medical benchmark

OpenAI has released a new benchmark for testing AI systems in healthcare. Called HealthBench, it's designed to evaluate how well language models handle realistic medical conversations. According to OpenAI, its latest models outperform doctors on the test.

The company says earlier benchmarks fell short—they didn't reflect real-world doctor-patient interactions, lacked input from medical experts, or weren't detailed enough to measure progress in newer models. To fix that, OpenAI collaborated with 262 doctors from 60 countries. Together, they created 5,000 realistic medical scenarios covering 26 specialties and 49 languages.

HealthBench spans seven medical domains, from emergency medicine to global health. Each AI response is rated across five categories: communication quality, instruction-following, accuracy, contextual understanding, and completeness. Altogether, the system applies 48,000 medically grounded evaluation points.

The scoring is handled by GPT-4.1. To check how reliable that is, OpenAI compared the model's evaluations with those of human doctors. The results showed that GPT-4.1's judgments matched human assessments at about the same level of agreement you'd see between different doctors.

GPT-4.1 and o3 outscore doctors—at least on this test

OpenAI says its latest models—GPT-4.1 and o3—outperformed physician responses on the HealthBench benchmark. In early tests from September 2024, doctors could improve older model outputs by editing them, while unaided doctor responses scored the lowest. But by April 2025, the results had shifted: GPT-4.1 and o3 outperformed physicians even without any additional input or refinement.

Bar chart: HealthBench scores of physician vs. AI responses. AI models achieve equal/better scores on 5 evaluation axes. — Across all five evaluation areas, GPT-4.1 and o3 match or exceed physician responses. | Image: OpenAI

OpenAI notes that the comparison has important limitations. Doctors don't typically write chat-style responses to medical questions, so the benchmark doesn't reflect how clinical care actually works. Instead, it tests how well language models handle a very specific kind of communication—one that may play to AI's strengths more than a physician's.

In terms of raw scores, the o3 model reached 0.60 on the benchmark, nearly twice the 0.32 scored by GPT-4o in August 2024. Only a few competing models came close: xAI's Grok 3 scored 0.54, and Google's Gemini 2.5 hit 0.52.

Bar chart: HealthBench scores of various AI models, broken down by medical conversation theme, showing performance differences. — AI model performance in HealthBench varies depending on the medical topic. | Image: OpenAI

Measuring worst-case reliability

In healthcare, a single wrong answer can outweigh dozens of correct ones. HealthBench includes a stress test to measure worst-case performance: how good is the least helpful response a model gives? OpenAI says its latest models show major improvements here too, but admits there's still work to be done.

Bar chart: HealthBench consensus error rates for different AI models, stratified by medical conversation theme. — Consensus error rates in HealthBench show how newer models—especially GPT-4.1—have reduced worst-case errors across medical topics. | Image: OpenAI

Efficiency is another focus. OpenAI says its compact GPT-4.1 nano model is 25 times more cost-effective than the earlier GPT-4o from August 2024, while also delivering better results. That could make it more accessible in low-resource settings.

Recommendation

AI research

Study reveals AI models have hidden capabilities they can't access through normal prompts

To support further testing, OpenAI has released two additional datasets: HealthBench Consensus and HealthBench Hard. The "Consensus" set includes only highly validated criteria, while the "Hard" set features 1,000 especially difficult cases where most models still fail.

All test data and evaluation methods are available on GitHub. OpenAI has also published a detailed paper and is encouraging researchers to build on the benchmark.

Join our community

Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

OpenAI says its latest models outperform doctors in medical benchmark

GPT-4.1 and o3 outscore doctors—at least on this test

Measuring worst-case reliability

Study reveals AI models have hidden capabilities they can't access through normal prompts

Higher token consumption can reduce the efficiency of open reasoning models

Spiral-Bench shows which AI models most strongly reinforce users' delusional thinking

Mirage 2 allows users to turn sketches and photos into interactive game worlds

Google downplays AI's environmental impact in new study

Deepseek’s first hybrid model V3.1 surpasses its R1 reasoning model on benchmarks

Meta's human-like chatbot personas can mislead users and result in real-world harm

OpenAI says its latest models outperform doctors in medical benchmark

GPT-4.1 and o3 outscore doctors—at least on this test

Measuring worst-case reliability

Share

Bank details