. Scientific Frontline: Study Suggests AI Is Good Enough at Diagnosing Complex Medical Cases To Warrant Clinical Testing

Friday, May 1, 2026

Study Suggests AI Is Good Enough at Diagnosing Complex Medical Cases To Warrant Clinical Testing

LLM outperformed physicians on clinical tasks spanning published cases, real-world emergency room data
Image Credit: Scientific Frontline

Scientific Frontline: Extended "At a Glance" Summary
: Large Language Models in Clinical Diagnostics

The Core Concept: A large language model (LLM) demonstrated the ability to review complex patient charts and outperform physicians across various clinical reasoning tasks, including identifying likely diagnoses and determining emergency management steps.

Key Distinction/Mechanism: Unlike previous studies that pre-processed or "smoothed out" patient data, this research tested the AI against raw, unstructured electronic health records from actual emergency department cases, evaluating its reasoning capabilities early in the patient's course when clinical data is notably sparse.

Major Frameworks/Components

  • Evaluation across multiple stages of emergency care, ranging from initial triage to hospital admission decisions.
  • Utilization of unmodified, real-world electronic health records (EHR) to test algorithmic reasoning under standard clinical ambiguity.
  • Comparison against hundreds of human clinicians using diagnostic challenges and reasoning exercises.
  • A shift away from traditional multiple-choice AI benchmarks, which modern models have essentially mastered, toward real-world application testing.

Branch of Science: Biomedical Informatics, Artificial Intelligence, and Clinical Medicine.

Future Application: AI systems acting as advanced diagnostic and decision-making aids for human medical practitioners in emergency rooms and clinics, pending successful evaluation through rigorous, prospective clinical trials.

Why It Matters: The findings mark a critical turning point in healthcare technology, suggesting medical AI is now capable enough to warrant the same rigorous clinical testing as new medical interventions. However, human oversight remains an essential baseline to ensure patient safety and prevent unnecessary medical testing.

Researchers have completed one of the largest studies yet comparing artificial intelligence and physicians across a wide range of clinical reasoning tasks, evaluating whether an AI system could do what physicians do every day: review a messy patient chart and decide what to do next.

A large language model (LLM) outperformed physicians across many of these tasks, including making emergency room decisions based on the available information, identifying likely diagnoses, and choosing the next steps in management, a team led by physicians and computer scientists at Harvard Medical School and Beth Israel Deaconess Medical Center reported April 30 in Science.

“We tested the AI model against virtually every benchmark, and it eclipsed both prior models and our physician baselines,” said co senior author Arjun (Raj) Manrai, an assistant professor of biomedical informatics in the Blavatnik Institute at HMS and founding deputy editor of NEJM AI.

The results make the case that medical AI is ready to be studied the same way as all new medical interventions: through carefully controlled, rigorous, and prospective clinical trials in real care settings.

Manrai noted that these trials are necessary to evaluate whether, how, and where such tools should be deployed in clinical care as aids to human practitioners.

The model’s performance also suggests that long-standing ways of testing medical AI may no longer capture the abilities of current systems—pointing to a possible turning point for the field.

“Models are increasingly capable. We used to evaluate models with multiple-choice tests; now they are consistently scoring close to 100%, and we can’t track progress anymore because we’re already at the ceiling,” said co first author Peter Brodeur, an HMS clinical fellow in medicine at Beth Israel Deaconess.

Testing Medical AI in the Real World

Incorporating standards first created in the 1950s to train and evaluate doctors, the researchers compared how an AI system performed against hundreds of clinicians. The comparisons included case-study diagnostic challenges, reasoning exercises, and real emergency department cases.

In one experiment, the team tasked the LLM with evaluating patients at various points in a standard emergency department setting, ranging from early triage to later admission decisions. At each stage, the model was given only the information available at that point—drawn directly from actual electronic health records—and asked to generate likely diagnoses and recommend what should happen next.

“To better understand real-world performance, we needed to test performance early in the patient course, when clinical data are sparse,” said co first author Thomas Buckley, a Harvard Kenneth C. Griffin Graduate School of Arts and Sciences doctoral student, a Dunleavy Fellow in HMS’s AI in Medicine PhD track, and a member of the Manrai Lab.

Unlike in prior studies, the team did not smooth out the messiness of real-world care before testing the model; the emergency department cases were presented exactly as they appeared in the electronic health record.

“We didn’t preprocess the data at all,” said co senior author Adam Rodman, an HMS assistant professor of medicine at Beth Israel Deaconess, director of AI programs for the Carl J. Shapiro Center for Education and Research, and associate editor of NEJM AI.

At the early decision points in the real-world emergency department cases, the model matched or exceeded attending physicians in diagnostic accuracy—a result that surprised even the researchers.

“I thought it was going to be a fun experiment but that it wouldn’t work that well. That was not at all what happened,” Rodman said.

The researchers emphasized that their results do not suggest that AI systems are ready to practice medicine autonomously or that physicians can be removed from the diagnostic process.

“A model might get the top diagnosis right but also suggest unnecessary testing that could expose a patient to harm,” Brodeur said. “Humans should be the ultimate baseline when it comes to evaluating performance and safety.”

Additional information: Rodman is a visiting researcher at Google DeepMind. Goh is employed by Microsoft. Chen is a cofounder of Reaction Explorer LLC, serves as a paid medical expert witness for Elite Experts, and has received one-time honoraria or travel expenses for invited presentations by Insitro, General Reinsurance Corporation, AASCIF, and other industry conferences, academic institutions, and health systems. Kanjee discloses royalties from Oakstone Publishing and Wolters Kluwer. Olson discloses the employment of his spouse by Exact Sciences. Abdulnour is employed by the Massachusetts Medical Society and has consulted for Lumeris.

Funding: The research was supported by the National Institutes of Health (grants R01ES032470, 1R01AI17812101, UM1TR004921, and U01NS134358), the Harvard Medical School Dean’s Innovation Award for Artificial Intelligence, the Macy Foundation (awards B25-15 and P25-04), the Stanford Bio-X Interdisciplinary Initiatives Seed Grants Program, and a Stanford RAISE Health Seed Grant 2024.

Published in journal: Science

TitlePerformance of a large language model on the reasoning tasks of a physician

Authors: Peter G. Brodeur, Thomas A. Buckley, Zahir Kanjee, Ethan Goh, Evelyn Bin Ling, Priyank Jain, Stephanie Cabral, Raja-Elie Abdulnour, Adrian D. Haimovich, Jason A. Freed, Andrew Olson, Daniel J. Morgan, Jason Hom, Robert Gallo, Liam G. McCoy, Haadi Mombini, Christopher Lucas, Misha Fotoohi, Matthew Gwiazdon, Daniele Restifo, Daniel Restrepo, Eric Horvitz, Jonathan Chen, Arjun K. Manrai, and Adam Rodman

Source/CreditHarvard Medical School | Beth Israel

Reference Number: ai050126_01

Privacy Policy | Terms of Service | Contact Us