
A recent study has found that popular chatbots like ChatGPT and DeepSeek often exaggerate scientific findings when summarizing research articles.
The study, conducted by Uwe Peters from Utrecht University and Benjamin Chin-Yee from Western University in Canada and the University of Cambridge in the UK, analyzed nearly 5,000 chatbot-generated summaries of scientific studies.
Their findings, published in Royal Society Open Science, reveal that up to 73% of these summaries contained inaccuracies that made the research appear more conclusive or impactful than it actually was.
The researchers evaluated ten leading large language models (LLMs), including ChatGPT, DeepSeek, Claude, and LLaMA, over a year-long period.
They compared summaries of abstracts and full-length articles from top scientific journals like Nature, Science, and The Lancet.
What they found was surprising: six out of the ten models routinely exaggerated claims from the original research.
For example, chatbot-generated summaries would often change a careful statement like “The treatment was effective in this study” to a much broader claim like “The treatment is effective,” misleading readers to believe the results were more universally applicable.
What was even more surprising is that prompting the chatbots to be more accurate actually made the problem worse.
According to the study, when the chatbots were specifically asked to avoid inaccuracies, they nearly doubled the rate of overgeneralized conclusions.
This is concerning, said Peters, because many people, including students, researchers, and policymakers, might assume that adding an accuracy request would produce more reliable summaries.
The study proved the opposite, highlighting a critical gap in how LLMs handle scientific information.
The researchers also compared these chatbot-generated summaries with human-written ones. They found that the chatbots were nearly five times more likely to produce exaggerated claims than humans summarizing the same research.
More alarmingly, newer versions of the AI models, such as ChatGPT-4o and DeepSeek, performed worse than older versions, suggesting that improvements in AI capabilities have not necessarily led to better accuracy in science communication.
To address these issues, Peters and Chin-Yee recommend using chatbots like Claude, which showed the highest accuracy in avoiding exaggeration.
They also suggest setting chatbots to a lower “temperature,” which reduces their creativity and likelihood of overgeneralizing. Additionally, they advise using prompts that encourage more cautious, past-tense language when summarizing scientific findings.
The researchers stress that if AI is to be a reliable tool for science communication, there needs to be more testing and stricter guidelines.
According to Peters, the goal should be to support science literacy, not undermine it. Their study serves as a reminder that while chatbots are powerful tools, they still need careful oversight when it comes to representing scientific facts.