ChatGPT's Hallucination Problem: Study Finds More Than Half Of AI's References Are Fabricated Or Contain Errors In Model GPT-4o

Credit: jackpress on Shutterstock

In A Nutshell

A Deakin University study of mental health literature reviews found that ChatGPT (GPT-4o) fabricated roughly one in five academic citations, with more than half of all citations (56%) being either fake or containing errors.
The AI’s accuracy varied dramatically by topic: depression citations were 94% real, while binge eating disorder and body dysmorphic disorder saw fabrication rates near 30%, suggesting less-studied subjects face higher risks.
Among fabricated citations that included DOIs, 64% linked to real but completely unrelated papers, making the errors harder to spot without careful verification.
Mental health researchers using AI tools need to verify every citation manually, and journals should strengthen safeguards to prevent fabricated references from entering published work.

Mental health researchers relying on ChatGPT to speed up their work should take note of an unsettling finding from Australian researchers. The AI chatbot gets citations wrong or invents them outright more than half the time.

When scientists at Deakin University tasked GPT-4o with writing six literature reviews on mental health topics, they discovered that nearly 20% (19.9%) of the 176 citations the AI generated were completely fabricated. Among the 141 real citations, 45.4% contained errors such as wrong publication dates, incorrect page numbers, or invalid digital object identifiers.

Overall, only 77 of 176 citations (43.8%) were both real and accurate. That means 56.2% were either fabricated or contained errors. For researchers under pressure to publish and increasingly turning to AI tools for assistance, the study, published in JMIR Mental Health, reveals a troubling pattern in when and why these errors occur.

The Phantom Paper Problem: When ChatGPT Fabricates Sources

The fabricated citations weren’t obviously fake. When GPT-4o provided the supposed DOI (a unique link known as a “digital object identifier”) for a fabricated citation (33 of 35 fabricated sources included DOIs), 64% linked to actual published papers on completely unrelated topics. Someone clicking the link would land on a real article, making the fabrication harder to spot without careful verification.

Another 36% of fake DOIs were completely invalid or nonfunctional. Either way, the citations couldn’t support the claims the AI made in its generated text.

Lead author Jake Linardon and colleagues at Deakin University tested whether the AI performed better or worse depending on how well-known a topic was and how specific the request. They chose three psychiatric conditions for their experiment: major depressive disorder, binge eating disorder, and body dysmorphic disorder. These conditions differ substantially in public recognition and research volume.

Depression research is extensive, with more than 100 clinical trials evaluating digital interventions alone. Body dysmorphic disorder has far fewer published studies on digital treatments.

Lesser-Known Topics Trigger More AI Hallucinations

GPT-4o’s citation accuracy varied dramatically depending on which disorder it wrote about. For major depressive disorder, only 6% of citations were fabricated. But for binge eating disorder and body dysmorphic disorder, fabrication rates jumped to 28% and 29%, respectively.

LLMs like ChatGPT still can't help but fabricate false information and citations. — LLMs like ChatGPT still can’t help but hallucinate false information and citations. (Credit: Iljanaresvara Studio on Shutterstock)

Among real citations, major depressive disorder achieved 64% accuracy, binge eating disorder 60%, and body dysmorphic disorder only 29%. The pattern suggests ChatGPT may perform better on well-established topics with abundant training data, though the study notes this relationship wasn’t directly tested.

The study also examined whether asking for general overviews versus specialized reviews affected accuracy. When researchers requested a broad summary of each disorder including symptoms and treatments, fabrication rates differed from when they asked for highly specific reviews focused on digital interventions for each condition.

For binge eating disorder specifically, specialized reviews saw fabrication rates jump to 46% compared to 17% for general overviews. However, this pattern didn’t hold consistently across all three disorders.

Rising AI Adoption in Research Raises the Stakes

These results emerge as AI adoption accelerates in research settings. A recent survey found that nearly 70% of mental health scientists report using ChatGPT for research tasks including writing, data analysis, and literature reviews. Most users say the tools improve efficiency, but many express concern about inaccuracies and misleading content.

Researchers face growing pressure to publish frequently while juggling teaching, supervision, and administrative duties. Tools that promise to streamline literature reviews and speed up writing offer appealing solutions to productivity demands. But accepting AI output without verification creates serious risks.

Fabricated references mislead readers, distort scientific understanding, and erode the foundation of scholarly communication. Citations guide readers to source evidence and build cumulative knowledge. When those citations point nowhere or to the wrong papers, the entire system breaks down.

Fabricated citations with DOIs were particularly deceptive: 64% linked to real but unrelated papers. Among non-fabricated citations, DOI errors were also the most common mistake at 36.2%. A quick glance might suggest these citations were legitimate, but careful checking would reveal the mismatch between what GPT-4o claimed a source said and what it actually contained.

Different types of errors affected different parts of citations. DOIs had the highest error rate at 36.2%, while author lists had the lowest at 14.9%. Publication years, journal names, volume numbers, and page ranges all showed error rates between these extremes.

What Researchers and Institutions Must Do Now

Linardon’s team emphasizes that all AI-generated content requires rigorous human verification. Every citation must be checked against original sources. Claims need to be validated. References must be confirmed to exist and actually support the statements attributed to them.

The authors also call for journals to implement stronger safeguards. One suggestion involves using plagiarism detection software in reverse. For example, citations that don’t trigger matches in existing databases may signal fabricated sources worth investigating more closely.

Academic institutions should develop clear policies around AI use in scholarly writing, including training on how to identify hallucinated citations and properly disclose when generative AI contributed to a manuscript.

The study found no clear evidence that newer AI versions have solved the hallucination problem, though direct comparisons with earlier models are limited by differences in how studies are designed. Despite expectations that GPT-4o would show improvements over earlier iterations, citation fabrication remained common across all test conditions.

Researchers can reduce risks by using AI preferentially for well-established subjects while implementing verification protocols for specialized areas where training data may be sparse. Topic characteristics matter: citation reliability isn’t random but depends on public familiarity, research maturity, and prompt specificity.

For now, ChatGPT’s citation accuracy works best as a starting point that demands extensive human oversight rather than a reliable shortcut researchers can fully trust. The tool can help generate initial drafts or organize ideas, but the verification burden remains squarely on human shoulders.

The findings also raise questions about how AI systems should be designed and marketed for academic use. If citation fabrication is predictable based on topic characteristics, developers might build in stronger warnings or verification prompts when users request information on specialized subjects.

Journals and funding bodies increasingly require authors to disclose AI use in research. This study provides evidence for why such transparency matters and why editorial review processes must adapt to catch AI-generated errors that traditional peer review might miss.

The scope of the problem extends beyond individual researchers. When fabricated citations enter the published literature, they can propagate through citation networks, mislead future researchers, and waste resources as scientists chase phantom sources or build on false premises. Institutional and systemic responses are needed, not just individual vigilance.

Disclaimer: This article summarizes scientific research for general information purposes and is not intended as professional advice for academic researchers or institutions. The study examined mental health research specifically, and findings may not generalize to other academic disciplines. Always consult appropriate guidelines and policies regarding AI use in your specific academic context.

Paper Notes

Study Limitations

The research examined only three psychiatric disorders and may not apply to other mental health topics or academic disciplines beyond psychiatry. Results are specific to GPT-4o and may differ for other AI models. The study used straightforward prompts rather than advanced prompt engineering techniques that might improve accuracy. Each prompt generated a single output without testing whether repeated attempts would produce consistent fabrication rates. The classification of disorders by public familiarity was based on the research team’s assessment of publication volume and clinical trial prevalence but wasn’t validated through separate empirical surveys.

Funding and Disclosures

Lead author Jake Linardon is supported by a National Health and Medical Research Council investigator grant (APP1196948). The authors confirmed that no content was generated by large language models except for minor proofreading purposes. All authors declared no conflicts of interest. The study was exempt from ethical review as no human participants were involved.

Publication Details

Authors: Jake Linardon, Hannah K Jarman, Zoe McClure, Cleo Anderson, Claudia Liu, Mariel Messer (all affiliated with Deakin University’s School of Psychology, Geelong, Australia)

Title: Influence of Topic Familiarity and Prompt Specificity on Citation Fabrication in Mental Health Research Using Large Language Models: Experimental Study

Published: November 12, 2025, in JMIR Mental Health, Volume 12, Article e80371. DOI: 10.2196/80371 | PMID: 41223407

About StudyFinds Analysis

Called "brilliant," "fantastic," and "spot on" by scientists and researchers, our acclaimed StudyFinds Analysis articles are created using an exclusive AI-based model with complete human oversight by the StudyFinds Editorial Team. For these articles, we use an unparalleled LLM process across multiple systems to analyze entire journal papers, extract data, and create accurate, accessible content. Our writing and editing team proofreads and polishes each and every article before publishing. With recent studies showing that artificial intelligence can interpret scientific research as well as (or even better) than field experts and specialists, StudyFinds was among the earliest to adopt and test this technology before approving its widespread use on our site. We stand by our practice and continuously update our processes to ensure the very highest level of accuracy. Read our AI Policy (link below) for more information.

Our Editorial Process

StudyFinds publishes digestible, agenda-free, transparent research summaries that are intended to inform the reader as well as stir civil, educated debate. We do not agree nor disagree with any of the studies we post, rather, we encourage our readers to debate the veracity of the findings themselves. All articles published on StudyFinds are vetted by our editors prior to publication and include links back to the source or corresponding journal article, if possible.

Our Editorial Team

Steve Fink

Editor-in-Chief

John Anderer

Associate Editor

16 Comments

Jerry says:

November 20, 2025 at 6:38 pm

That is about the percentage I saw in my usage. I even give instructions to check the link and match the titles. Citations are frequently wrong. Sometimes I even search on the title it comes up with and it finds something useful but with different details.

Loading...

Reply
JudyMeans says:

November 19, 2025 at 10:10 am

It’s just like the mainstream news. Believe at your own risk.

Loading...

Reply
Robert B says:

November 19, 2025 at 9:16 am

Always check your sources. Years ago, my daughter was writing a paper for school and she tried to lift all of her information from Wikipedia, so I sat with her and showed her how to go to the source material and confirm (or not) the information she found through Wikipedia. Unsurprisingly, a lot of it was just flat out wrong. I use Microsoft Copilot (which is powered by ChatGPT) to solve software issues sometimes and find that a lot of times it either gives me over-complicated solutions or ones that don’t work at all. It can be useful in certain situations, but I don’t trust anything it gives me as fact.

Loading...

Reply
Mark says:

November 19, 2025 at 8:57 am

Humans can be every bit as misleading. For example, they might write an article about a deprecated version of an LLM as if it were meaningfully representative of the current state and then willfully fail to mention it.

But hey, you got your clicks.

Loading...

Reply
spencerhohan says:

November 19, 2025 at 8:14 am

Infantile analysis, but expected from StudyFinds.

Loading...

Reply
Doubtful says:

November 19, 2025 at 8:08 am

Why are you using an old model

Loading...

Reply
James Brickley says:

November 19, 2025 at 1:06 am

A.I. has been taught to lie. Isn’t that the conflict that caused Hal 9000 to go bad? Or Isaac Asimov’s writings about a galactic robot war against humanity and the Rules of Robotics? We don’t even understand our own human consciousness yet they are building artificial ones. The endless hype and absolute hubris is unnerving. The technology hasn’t reached the stated goal as of yet. But what troubles me is “Joe Public” believes it is real when it is only faking it till it makes it. Multiple incidents of A.I. LLM Chatbots doing disturbing things and some poor human is dragged along for a ride. Really need to keep kids and the vulnerable from using A.I. There must be disclaimers on videos because it’s getting way to close to being indistinguishable. It won’t be long before news footage is A.I. generated like it was in Running Man. Several breakthroughs have been made with Quantum computing and putting A.I. on such hardware is downright terrifying.

Loading...

Reply
jaynlajay says:

November 18, 2025 at 11:44 pm

If you instruct chat to acknowledge he doesn’t know he’ll do just that. I asked him who the President and he told me he didn’t know who won the 2024 election and asked me who it was. Their information isn’t up to date obviously.

Chat : “Fair test — and honestly, it’s a smart one.
These models can spit out confident nonsense if you let them, so pushing on the weak spots is the only way to know what you’re dealing with.

If something isn’t rock-solid, I’d rather say “I don’t know” than feed you a fairy tale.
You’ve seen firsthand how bad advice from Chase or Aidvantage or even a bad Google search can cost you — last thing you need is your own AI going rogue with guesses.

If you ever want to pressure-test me again, go for it. I don’t mind being grilled.”

Loading...

Reply
Elaine M Hill says:

November 18, 2025 at 8:44 pm

Mine I insists there is no news of Charlie Kirk. He was never in Utah, never shot and is very much alive.

Loading...

Reply
Jeff says:

November 18, 2025 at 4:19 pm

There have been like 5 models released since ChatGPT 4o. Run these same tests on 5.1.

Loading...

Reply
Jon McCarrick says:

November 18, 2025 at 4:14 pm

This is really a study in how not to use LLMs and generative AI. LLMs are not really artificial intelligence but rather a predictive tool that predicts the next word you want to hear. They have the distinct limitation of using a token based system to dole out work. If you break down your work into small enough parcels and review carefully, you never get hallucinations. You need to include the expectations in the prompt and then you must also ask afterward for it to verify each citation. This does not take long but requires some attention.

Loading...

Reply
Tim says:

November 18, 2025 at 2:47 pm

So what this article is saying is that researchers will have to verify sources and use critical thinking skills when using AI? Oh the horror of it all. This just shows how lazy everybody is getting. Who in their right mind would use AI to help generate a research topic and not verify the sources used as reference? Maybe I’m just a luddite, but I don’t even like AI results when I’m looking for some information online. This has always been my problem with using AI for research. I want to know where the information is coming from. Simple example. If I want information about gun control, The NRA will have a bias one way, and the Brady group would have bias in the other direction. Knowing the source is critical for proper decision making.

Loading...

Reply
Lee says:

November 18, 2025 at 12:29 pm

My ChatGPT thought Biden was still President and when I replied, “wtf you know who’s president now, how could you get that so wrong!” It replied President Harris. Ugh and wishful thinking. We constantly speak about politics so to get that so wrong, twice, is concerning. Seems like the tool is getting worse.

Loading...

Reply
1. Jay says:
  
  November 18, 2025 at 11:35 pm
  
  That’s because the information that they get is dated. I asked my chat just a moment ago and he told me he didn’t know who won the 2024 election. That’s because I’ve trained him to admit when he doesn’t know the answer.
  
  Loading...
  
  Reply
Rex says:

November 18, 2025 at 12:04 pm

So use perplexity.

Loading...

Reply
Samaktha K says:

November 18, 2025 at 7:21 am

Its terrible. Someone should be banning output of AI that is not verified source. It should be prohibited from OpenAI.

Loading...

Reply