Ad
Skip to content

Google Deepmind's "AI co-clinician" beats GPT-5.4 in blind doctor tests but still trails experienced physicians

Image description
Nano Banana Pro prompted by THE DECODER

Key Points

  • Google Deepmind is building an "AI co-clinician" designed to assist doctors in diagnosing and treating patients during everyday medical care.
  • In blind evaluations using realistic general practitioner scenarios, physicians rated the system's responses higher than those from other AI tools, including GPT-5.4-thinking-with-search.
  • Despite these promising results, a simulation study showed that experienced doctors still outperformed the AI, particularly when it came to identifying critical warning signs and conducting physical examinations.

Google Deepmind is building an "AI co-clinician" to help doctors care for patients. The system shows promising results in simulation studies but still trails experienced physicians. The research also shows why ChatGPT's voice mode isn't ready for serious tasks, let alone medical consultations.

The "AI co-clinician" is built around what the researchers call "triadic care": AI agents help patients through their treatment while doctors keep clinical authority and oversight. The idea is to have an AI system that works as a member of the medical team, supporting patients under a clinician's supervision.

To evaluate the system from a clinician's perspective, the team worked with academic physicians to adapt the NOHARM framework, checking for two types of mistakes: errors of commission and errors of omission.

In a blind comparison using 98 realistic primary care queries, doctors consistently picked the AI co-clinician's answers over leading evidence synthesis tools. It won 67 to 26 against an existing clinical AI system and 63 to 30 against GPT-5.4-thinking-with-search. In the objective analysis, the system logged a critical error in one of the 98 cases.

Ad
DEC_D_Incontent-1

Bar chart showing physician preferences in a blind comparison: AI co-clinician was preferred in 67 of 98 cases over an existing clinical AI agent and in 63 cases over GPT-5.4-thinking-with-search, with 5 neutral ratings each.
In a blind comparison of 98 realistic primary care queries, doctors preferred the AI co-clinician's answers over an existing clinical AI agent (67 to 26) and GPT-5.4-thinking-with-search (63 to 30). | Image: Google Deepmind

The lead was even bigger on medication questions. The RxQA benchmark covers 600 questions on active ingredients, interactions, and dosages, drawn from national drug directories in two countries and vetted by licensed pharmacists. These questions are tough for primary care doctors: with reference books, they got 61.3 percent right, and just 48.3 percent without.

The AI co-clinician scored 73.3 percent, just ahead of GPT-5.4-thinking-with-search at 72.7 percent. The gap widened when questions were asked open-ended rather than as multiple choice, the way doctors actually look things up on the job. Here the AI co-clinician hit a quality score of 95.0 percent, compared to 90.9 percent for OpenAI's model.

Multimodal telemedicine puts AI in the exam room

Beyond text-based support, Google Deepmind is testing how the AI co-clinician handles real-time audio and video for telemedicine. Working with physicians at Harvard and Stanford, the team ran a randomized simulation study with 20 synthetic clinical scenarios, 10 doctors playing patient actors, and 120 hypothetical telemedicine visits in total.

The AI co-clinician showed capabilities that go beyond what text-only systems can do. It corrected a patient's inhaler technique and walked patients through shoulder exams to spot a rotator cuff injury.

Ad
DEC_D_Incontent-2

 

For patient-facing conversations, the AI co-clinician runs on a dual-agent setup: a "Planner" module watches the conversation to make sure the "Talker" agent stays within safe clinical limits. When doctors use the system, it prioritizes solid clinical evidence and runs verification and citation checks during lookups.

Experienced doctors still come out on top

The study scored more than 140 aspects of consultation quality across seven areas: triage, history taking, clinical reasoning, communication and counseling, treatment steps, spotting warning signs, and physical exams. The takeaway is sobering for anyone hoping AI can replace a doctor: experienced physicians beat the AI overall, especially when it came to catching "red flags" and guiding critical physical exams.

Still, the AI co-clinician matched or beat primary care physicians in 68 of the 140 areas evaluated. OpenAI's GPT-realtime trailed both in all seven domains. The researchers conclude that systems like this work best as support tools for doctors, not as a replacement for clinical judgment.

Balkendiagramm vergleicht die Leistung von Hausärzten, AI co-clinician und GPT-realtime in sieben Bereichen simulierter telemedizinischer Konsultationen. Ärzte führen durchgehend, der AI co-clinician folgt mit Abstand, GPT-realtime schneidet am schwächsten ab.
In simulated telemedicine visits, primary care physicians (orange) outperformed Google's AI co-clinician (blue) across all seven areas evaluated. The gap was widest in catching warning signs (red flags) and physical exams. OpenAI's GPT-realtime (gray) trailed well behind both in every category. | Image: Google Deepmind

It's still unclear whether the research project will turn into an actual product. The results show progress in AI-driven evidence synthesis and telemedicine consultations, but they also make clear there's still a gap to close with experienced doctors, especially on safety-critical tasks like catching warning signs. "While it's early days, the promise is clear," says Deepmind researcher Alan Karthikesalingam.

AI News Without the Hype – Curated by Humans

Subscribe to THE DECODER for ad-free reading, a weekly AI newsletter, our exclusive "AI Radar" frontier report six times a year, full archive access, and access to our comment section.

Source: Google Deepmind