Dr. Chatbot is not ready to see you: Critical flaws in medical AI systems exposed by Harvard-Stanford study

In a nutshell

When AI doctors had to diagnose through conversation rather than multiple choice tests, accuracy dropped dramatically – from 82% to as low as 26% in some cases
Current AI systems struggle with basic clinical skills like asking appropriate follow-up questions and synthesizing information from multiple exchanges
The findings suggest AI tools should supplement rather than replace human doctors, as they’re not yet ready for independent patient interaction

BOSTON — Artificial intelligence has shown remarkable promise in healthcare, from reading X-rays to suggesting treatment plans. But when it comes to actually talking to patients and making accurate diagnoses through conversation — a cornerstone of medical practice — AI still has significant limitations, according to new research from Harvard Medical School and Stanford University.

Published in Nature Medicine, the study introduces an innovative testing framework called CRAFT-MD (Conversational Reasoning Assessment Framework for Testing in Medicine) to evaluate how well large language models (LLMs) perform in simulated doctor-patient interactions. As patients increasingly turn to AI tools like ChatGPT to interpret symptoms and medical test results, understanding these systems’ real-world capabilities becomes crucial.

“Our work reveals a striking paradox — while these AI models excel at medical board exams, they struggle with the basic back-and-forth of a doctor’s visit,” explains study senior author Pranav Rajpurkar, assistant professor of biomedical informatics at Harvard Medical School. “The dynamic nature of medical conversations – the need to ask the right questions at the right time, to piece together scattered information, and to reason through symptoms – poses unique challenges that go far beyond answering multiple choice questions.”

The research team, led by senior authors Rajpurkar and Roxana Daneshjou of Stanford University, evaluated four prominent AI models across 2,000 medical cases spanning 12 specialties. Current evaluation methods typically rely on multiple-choice medical exam questions, which present information in a structured format. However, study co-first author Shreya Johri notes that “in the real world this process is far messier.”

Testing conducted through CRAFT-MD revealed stark performance differences between traditional evaluations and more realistic scenarios. In four-choice multiple-choice questions (MCQs), GPT-4’s diagnostic accuracy dropped from 82% when reading prepared case summaries to 63% when gathering information through dialogue. This decline became even more pronounced in open-ended scenarios without multiple-choice options, where accuracy fell to 49% with written summaries and 26% during simulated patient interviews.

The AI models demonstrated particular difficulty synthesizing information from multiple conversation exchanges. Common problems included missing critical details during patient history-taking, failing to ask appropriate follow-up questions, and struggling to integrate various types of information, such as combining visual data from medical images with patient-reported symptoms.

CRAFT-MD’s efficiency highlights another advantage of the framework: it can process 10,000 conversations in 48-72 hours, plus 15-16 hours of expert evaluation. Traditional human-based evaluations would require extensive recruitment and approximately 500 hours for patient simulations and 650 hours for expert assessments.

“As a physician scientist, I am interested in AI models that can augment clinical practice effectively and ethically,” says Daneshjou, assistant professor of Biomedical Data Science and Dermatology at Stanford University. “CRAFT-MD creates a framework that more closely mirrors real-world interactions and thus it helps move the field forward when it comes to testing AI model performance in health care.”

Based on these findings, the researchers provided comprehensive recommendations for AI development and regulation. These include creating models capable of handling unstructured conversations, better integration of various data types (text, images, and clinical measurements), and the ability to interpret non-verbal communication cues. They also emphasize the importance of combining AI-based evaluation with human expert assessment to ensure thorough testing while avoiding premature exposure of real patients to unverified systems.

The study demonstrates that while AI shows promise in healthcare, current systems require significant advancement before they can reliably engage in the complex, dynamic nature of real doctor-patient interactions. For now, these tools may best serve as supplements to, rather than replacements for, human medical expertise.

Paper Summary

Methodology

The researchers created a sophisticated testing system where one AI acted as a patient (providing information based on real medical cases) while another AI played the role of the doctor (asking questions and making diagnoses). Medical experts reviewed these interactions to ensure quality and accuracy. The study included 2,000 cases across different medical specialties and tested multiple formats: traditional written case summaries, back-and-forth conversations, single-question diagnoses, and summarized conversations. They also tested scenarios with and without multiple-choice options for diagnoses.

Results

The key finding was that AI performance dropped significantly when moving from written summaries to conversational diagnosis. With multiple-choice options, accuracy fell from 82% to 63% for GPT-4. Without multiple choice, accuracy dropped even more dramatically – to 26% for conversational diagnosis. The AI also struggled with synthesizing information from multiple exchanges and knowing when to stop gathering information.

Limitations

The study primarily used simulated patient interactions rather than real patients, which might not fully capture the complexity of actual clinical encounters. The research also focused mainly on diagnostic accuracy rather than other important aspects of medical care like bedside manner or emotional support. Additionally, the study used AI to simulate patient responses, which might not perfectly mirror how real patients communicate.

Discussion and Takeaways

The research suggests that current AI models, while impressive in certain structured tasks, are not yet ready for independent patient interaction. The findings indicate AI might be more effectively used as a supporting tool for human doctors rather than a replacement. The study also highlights the importance of developing AI systems that can better handle dynamic conversations and information synthesis.

Funding and Disclosures

The research received support from the HMS Dean’s Innovation Award and Microsoft’s Accelerate Foundation Models Research grant awarded to Pranav Rajpurkar. Additional funding came through the IIE Quad Fellowship. Several researchers disclosed industry relationships, including Daneshjou’s consulting roles with DWA, Pfizer, L’Oreal, and VisualDx, along with stock options in medical technology firms. Other disclosures include patents pending and various advisory and equity positions held by team members in healthcare companies.

Publication Information

This research was published in Nature Medicine (DOI: 10.1038/s41591-024-03328-5) as “An Evaluation Framework for Conversational Reasoning in Clinical LLMs During Patient Interactions” by researchers from Harvard Medical School, Stanford University, and other leading medical institutions.