Benchmarking Large Language Models in Clinical Reasoning

Neurology resident Dr. Liam McCoy and his team developed a new benchmark that highlights gaps in AI clinical reasoning in neurology and other specialties.

29 September 2025

Title

Journal: New England Journal of Medicine AI (September 25, 2025)


Background

Large language models (LLMs, the technology behind tools like ChatGPT) are increasingly used in clinical decision support, yet common evaluations—such as medical licensing exams—rarely assess whether decisions are revised appropriately when new or uncertain information appears.

To address this gap, we used Script Concordance Testing (SCT), a long-standing method in medical education that quantifies how new data should shift diagnostic or therapeutic judgments under uncertainty, and we evaluated leading LLMs using purpose-built questions. These items do not fully mirror real-world care, but they target a core element of clinical reasoning: knowing when evidence should change your mind and by how much.

Methods

We created a public benchmark of 750 SCT questions drawn from 10 international datasets across multiple specialties, nine of which are newly released. Each question presents a clinical scenario and asks how additional information affects the likelihood of a diagnosis or management option, scored against expert-panel responses. We compared the performance of 10 state-of-the-art LLMs with more than 1,500 human participants, including medical students, residents, and attending physicians.

Results

The top model had scores comparable to medical students and junior resident physicians, but they lagged behind senior resident physicians and attending physicians. Concerningly, models were overall quite overconfident in their responses, and recent developments in the technology have actually made this problem worse.

Even when the extra information was completely irrelevant, more advanced models convinced themselves that it was important. They shifted their judgments too far in response to weak or equivocal information and were less likely to appropriately indicate that the new information is irrelevant to the question at hand.

Conclusions

This work demonstrates that, while modern LLM systems show a lot of promise in the medical field, there are still important gaps in their performance. As we move toward using these LLMs in real medical contexts, we can and should look towards the tools we have spent decades developing for assessing human clinicians. These proven tests give us a practical way to measure, compare, and improve medical LLMs before they enter real-world care.


Cite: McCoy LG, Sagar N, Bacchi S,  Fong JMN, Tan NCK, Rodman A.  NEJM AI 2025;2(10). 2025 Sep 25. DOI: 10.1056/AIdbp2500120

 

Dr. Liam McCoy is a neurology resident physician (PGY-4) with the Department of Medicine.

Liam McCoy