Robot or AI doctor

(© BiancoBlue | Dreamstime.com)

In a nutshell

  • When AI doctors had to diagnose through conversation rather than multiple choice tests, accuracy dropped dramatically – from 82% to as low as 26% in some cases
  • Current AI systems struggle with basic clinical skills like asking appropriate follow-up questions and synthesizing information from multiple exchanges
  • The findings suggest AI tools should supplement rather than replace human doctors, as they’re not yet ready for independent patient interaction

BOSTON — Artificial intelligence has shown remarkable promise in healthcare, from reading X-rays to suggesting treatment plans. But when it comes to actually talking to patients and making accurate diagnoses through conversation — a cornerstone of medical practice — AI still has significant limitations, according to new research from Harvard Medical School and Stanford University.

Published in Nature Medicine, the study introduces an innovative testing framework called CRAFT-MD (Conversational Reasoning Assessment Framework for Testing in Medicine) to evaluate how well large language models (LLMs) perform in simulated doctor-patient interactions. As patients increasingly turn to AI tools like ChatGPT to interpret symptoms and medical test results, understanding these systems’ real-world capabilities becomes crucial.

“Our work reveals a striking paradox — while these AI models excel at medical board exams, they struggle with the basic back-and-forth of a doctor’s visit,” explains study senior author Pranav Rajpurkar, assistant professor of biomedical informatics at Harvard Medical School. “The dynamic nature of medical conversations – the need to ask the right questions at the right time, to piece together scattered information, and to reason through symptoms – poses unique challenges that go far beyond answering multiple choice questions.”

The research team, led by senior authors Rajpurkar and Roxana Daneshjou of Stanford University, evaluated four prominent AI models across 2,000 medical cases spanning 12 specialties. Current evaluation methods typically rely on multiple-choice medical exam questions, which present information in a structured format. However, study co-first author Shreya Johri notes that “in the real world this process is far messier.”

Testing conducted through CRAFT-MD revealed stark performance differences between traditional evaluations and more realistic scenarios. In four-choice multiple-choice questions (MCQs), GPT-4’s diagnostic accuracy dropped from 82% when reading prepared case summaries to 63% when gathering information through dialogue. This decline became even more pronounced in open-ended scenarios without multiple-choice options, where accuracy fell to 49% with written summaries and 26% during simulated patient interviews.

The AI models demonstrated particular difficulty synthesizing information from multiple conversation exchanges. Common problems included missing critical details during patient history-taking, failing to ask appropriate follow-up questions, and struggling to integrate various types of information, such as combining visual data from medical images with patient-reported symptoms.

CRAFT-MD’s efficiency highlights another advantage of the framework: it can process 10,000 conversations in 48-72 hours, plus 15-16 hours of expert evaluation. Traditional human-based evaluations would require extensive recruitment and approximately 500 hours for patient simulations and 650 hours for expert assessments.

“As a physician scientist, I am interested in AI models that can augment clinical practice effectively and ethically,” says Daneshjou, assistant professor of Biomedical Data Science and Dermatology at Stanford University. “CRAFT-MD creates a framework that more closely mirrors real-world interactions and thus it helps move the field forward when it comes to testing AI model performance in health care.”

Based on these findings, the researchers provided comprehensive recommendations for AI development and regulation. These include creating models capable of handling unstructured conversations, better integration of various data types (text, images, and clinical measurements), and the ability to interpret non-verbal communication cues. They also emphasize the importance of combining AI-based evaluation with human expert assessment to ensure thorough testing while avoiding premature exposure of real patients to unverified systems.

The study demonstrates that while AI shows promise in healthcare, current systems require significant advancement before they can reliably engage in the complex, dynamic nature of real doctor-patient interactions. For now, these tools may best serve as supplements to, rather than replacements for, human medical expertise.

Paper Summary

Methodology

The researchers created a sophisticated testing system where one AI acted as a patient (providing information based on real medical cases) while another AI played the role of the doctor (asking questions and making diagnoses). Medical experts reviewed these interactions to ensure quality and accuracy. The study included 2,000 cases across different medical specialties and tested multiple formats: traditional written case summaries, back-and-forth conversations, single-question diagnoses, and summarized conversations. They also tested scenarios with and without multiple-choice options for diagnoses.

Results

The key finding was that AI performance dropped significantly when moving from written summaries to conversational diagnosis. With multiple-choice options, accuracy fell from 82% to 63% for GPT-4. Without multiple choice, accuracy dropped even more dramatically – to 26% for conversational diagnosis. The AI also struggled with synthesizing information from multiple exchanges and knowing when to stop gathering information.

Limitations

The study primarily used simulated patient interactions rather than real patients, which might not fully capture the complexity of actual clinical encounters. The research also focused mainly on diagnostic accuracy rather than other important aspects of medical care like bedside manner or emotional support. Additionally, the study used AI to simulate patient responses, which might not perfectly mirror how real patients communicate.

Discussion and Takeaways

The research suggests that current AI models, while impressive in certain structured tasks, are not yet ready for independent patient interaction. The findings indicate AI might be more effectively used as a supporting tool for human doctors rather than a replacement. The study also highlights the importance of developing AI systems that can better handle dynamic conversations and information synthesis.

Funding and Disclosures

The research received support from the HMS Dean’s Innovation Award and Microsoft’s Accelerate Foundation Models Research grant awarded to Pranav Rajpurkar. Additional funding came through the IIE Quad Fellowship. Several researchers disclosed industry relationships, including Daneshjou’s consulting roles with DWA, Pfizer, L’Oreal, and VisualDx, along with stock options in medical technology firms. Other disclosures include patents pending and various advisory and equity positions held by team members in healthcare companies.

Publication Information

This research was published in Nature Medicine (DOI: 10.1038/s41591-024-03328-5) as “An Evaluation Framework for Conversational Reasoning in Clinical LLMs During Patient Interactions” by researchers from Harvard Medical School, Stanford University, and other leading medical institutions.

About StudyFinds Analysis

Called "brilliant," "fantastic," and "spot on" by scientists and researchers, our acclaimed StudyFinds Analysis articles are created using an exclusive AI-based model with complete human oversight by the StudyFinds Editorial Team. For these articles, we use an unparalleled LLM process across multiple systems to analyze entire journal papers, extract data, and create accurate, accessible content. Our writing and editing team proofreads and polishes each and every article before publishing. With recent studies showing that artificial intelligence can interpret scientific research as well as (or even better) than field experts and specialists, StudyFinds was among the earliest to adopt and test this technology before approving its widespread use on our site. We stand by our practice and continuously update our processes to ensure the very highest level of accuracy. Read our AI Policy (link below) for more information.

Our Editorial Process

StudyFinds publishes digestible, agenda-free, transparent research summaries that are intended to inform the reader as well as stir civil, educated debate. We do not agree nor disagree with any of the studies we post, rather, we encourage our readers to debate the veracity of the findings themselves. All articles published on StudyFinds are vetted by our editors prior to publication and include links back to the source or corresponding journal article, if possible.

Our Editorial Team

Steve Fink

Editor-in-Chief

John Anderer

Associate Editor

Leave a Reply