Content
summary Summary

Microsoft has introduced a new AI system that could dramatically improve how complex medical cases are diagnosed, offering four times the diagnostic accuracy of experienced physicians while reducing costs. The technology was evaluated using a new benchmark designed to closely simulate the real, step-by-step diagnostic process.

Ad

Researchers at Microsoft AI unveiled the system in their paper "Sequential Diagnosis with Language Models", claiming the model significantly outperforms human doctors in both accuracy and cost-effectiveness for challenging cases.

To produce more realistic results, the team created the Sequential Diagnosis Benchmark (SDBench). Traditional medical AI tests, the authors argue, often overstate model performance by presenting all information at once, rather than mimicking the sequential nature of clinical decision-making.

SDBench draws on 304 complex case reports from the New England Journal of Medicine (NEJM). At the start, a human or AI diagnostician receives only a brief case summary and must actively request more information by asking targeted questions or ordering tests. A "gatekeeper" model only reveals the requested details and, according to the paper, can even generate realistic synthetic test results for procedures not described in the original case, preventing accidental hints.

Ad
Ad
Bild: Microsoft

Costs are calculated as the sum of all tests and doctor visits. Each round of patient questioning is priced at $300 for a doctor consultation. Specific test costs are determined by mapping requests to standardized CPT codes and matching them against a 2023 price list from a major US health system.

Bild: Microsoft

In testing, a group of 21 experienced physicians from the US and UK achieved a diagnostic accuracy of 19.9 percent at an average cost of $2,963 per case. By comparison, Microsoft's "MAI Diagnostic Orchestrator" (MAI-DxO), combined with OpenAI's o3 model, reached 79.9 percent accuracy at a lower average cost of $2,397.

The major advance with MAI-DxO is the reduction in cost. On its own, the o3 model achieved the highest accuracy among standard models at 78.6 percent, but at an average cost of $7,850 per case. With MAI-DxO orchestrating the process, accuracy rose slightly while costs dropped by nearly 70 percent.

Virtual medical team boosts performance

According to the paper, MAI-DxO succeeds by simulating a virtual panel of doctors, all roles played by a single language model. "Dr. Hypothesis" maintains a list of likely diagnoses, "Dr. Test-Chooser" picks the most informative tests, and "Dr. Challenger" acts as a devil's advocate to prevent cognitive bias. "Dr. Stewardship" monitors costs, while "Dr. Checklist" ensures quality control.

Bild: Microsoft

The structured approach is designed to prevent the system from anchoring on an early hypothesis. In one example, a standard language model incorrectly suspected antibiotic toxicity, ordered $3,431 in tests, and made a wrong diagnosis. MAI-DxO, by asking targeted questions about toxin exposure, correctly identified hand sanitizer ingestion as the cause for just $795.

Recommendation

Limitations and open questions

The authors acknowledge several limitations. SDBench is based exclusively on complex, teaching-oriented NEJM cases, so it does not reflect the distribution of diseases seen in everyday practice and excludes healthy patients or benign conditions. It remains unclear if the system's performance gains would translate to common, routine illnesses.

The cost calculations are only rough estimates, based on US prices and not accounting for real-world factors like geography, insurance, test invasiveness, wait times, or equipment availability.

The comparison with human physicians is also limited. The participating doctors were general practitioners, who would typically refer such complex cases to specialists, and they were not allowed to use external resources like search engines or medical literature. The authors note that the system's "superhuman" performance is partly due to its ability to combine the broad knowledge of a generalist with the deep expertise of multiple specialists - a combination that would be unrealistic for any single human.

Ad
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • Microsoft has introduced an AI system, MAI Diagnostic Orchestrator (MAI-DxO), that achieved nearly 80 percent accuracy in diagnosing complex medical cases—four times higher than experienced physicians—while also reducing the average cost per case compared to both doctors and other AI models.
  • The system was tested using a new benchmark (SDBench) that mimics the real, step-by-step diagnostic process, requiring both humans and AI to request information sequentially and tallying up costs based on consultations and tests.
  • While the results are promising, the research is limited to rare and complex cases from medical journals, does not reflect everyday illnesses, and the cost figures are only rough estimates; the AI’s advantage also comes from combining multiple specialist roles in a way that no individual doctor could.
Sources
Max is the managing editor of THE DECODER, bringing his background in philosophy to explore questions of consciousness and whether machines truly think or just pretend to.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.