Dr. ChatGPT? AI system close to passing the U.S. Medical Licensing Exam

Image by Gerd Altmann from Pixabay

MOUNTAIN VIEW, Calif. — A new artificial intelligence (AI) system is close to achieving something that even the smartest humans struggle with every year — earning a medical license. ChatGPT scores at or around the 60 percent passing grade on the United States Medical Licensing Exam (USMLE). According to researchers, the AI system is capable of providing its own unique answers which not only make sense, but also do it without all the complicated medical jargon patients typically can’t stand.

A team from the virtual pulmonary rehab treatment center AnsibleHealth tested the advanced AI program against 350 of the 376 public questions available from the June 2022 USMLE release. ChatGPT is a large language model (LLM), meaning it generates human-like writing by predicting upcoming sequences of words.

ChatGPT is different from other chatbot systems. This one does not search the internet to find an answer. Instead, it uses its own internal networks to create new text based on word relationships. Previous studies have shown that this new generation of chatbot can actually write term papers for college-level students. The software could even find its way into the courtroom, providing legal defenses for suspects, or all the way to Hollywood, writing new scripts for television shows and movies.

How well does ChatGPT do on medical exams?

The USMLE is a highly standardized and regulated series of three tests (Steps 1, 2CK, and 3). Medical students and physicians-in-training all take the test in order to get their official medical license. The exams test each person’s knowledge on most medical fields, including biochemistry, diagnostic reasoning, and bioethics.

The AnsibleHealth team removed image-based questions from the test, focusing on those that ChatGPT needed to answer with its words. After getting rid of responses the researchers classified as “indeterminate,” ChatGPT produced scores between 52.4 and 75 percent across all three exams. Each year, the passing grade typically hovers around 60 percent.

Along with this near-passing score, study authors report that ChatGPT also provided a consistent answer 94.6 percent of the time. It also created at least one new, unique, and clinically valid answer 88.9 percent of the time.

For comparison, PubMedGPT — a chatbot trained specifically using biomedical literature — only scored 50.8 percent on a previous edition of the U.S. Medical Licensing Exam. The AnsibleHealth team is hoping this new program has a future in the medical education field. ChatGPT is already rewriting jargon-heavy reports which can be difficult for everyday patients to understand.

“Reaching the passing score for this notoriously difficult expert exam, and doing so without any human reinforcement, marks a notable milestone in clinical AI maturation,” the authors say in a media release.

“ChatGPT contributed substantially to the writing of [our] manuscript… We interacted with ChatGPT much like a colleague, asking it to synthesize, simplify, and offer counterpoints to drafts in progress…All of the co-authors valued ChatGPT’s input,” adds study author Dr. Tiffany Kung.

The results are published in the journal PLOS Digital Health.