Researchers at Georgia Tech have conducted a study which casts doubt on the reliability of chatbots, like ChatGPT, for providing healthcare advice, especially to non-English speakers.
This team from the College of Computing has developed a new tool called XLingEval to assess how well large language models (LLMs) handle health-related queries in various languages.
The study, led by Ph.D. students Mohit Chandra and Yiqiao (Ahren) Jin, found significant disparities in the accuracy of responses depending on the language used.
Their findings, shared in the paper titled “Better to Ask in English: Cross-Lingual Evaluation of Large Language Models for Healthcare Queries,” suggest that chatbots are less effective in languages other than English, which poses a problem for global users relying on these technologies for health advice.
The research points out that while LLMs like ChatGPT can be useful tools, they are not yet reliable enough to replace professional medical advice, especially in languages other than English. This is concerning because incorrect health advice can have serious consequences.
The XLingEval framework introduced by the researchers aims to pinpoint these issues by evaluating the performance of chatbots across different languages.
Their testing revealed that the accuracy of responses in Spanish, Chinese, and Hindi was significantly lower compared to English. For instance, the correctness of answers fell by 18% and the consistency by 29% in these languages.
Additionally, their evaluation showed that responses in these non-English languages were 13% less verifiable. This suggests a need for LLMs to incorporate more diverse language data to improve their utility globally.
The researchers have proposed the XLingHealth benchmark, which includes multilingual health-related data to help enhance the models.
The study used several datasets to challenge the capabilities of chatbots, including HealthQA and LiveQA, which pull from credible sources like the U.S. National Institutes of Health and Patient.info.
They also developed MedicationQA, based on consumer queries from MedlinePlus, to test responses to drug-related questions.
In their comparative analysis involving over 2,000 medical questions, the team found that both ChatGPT-3.5 and a specialized health care chatbot called MedAlpaca struggled with non-English queries. However, ChatGPT performed slightly better, likely due to its training on some multilingual data.
The findings emphasize the importance of enhancing the language capabilities of LLMs, particularly for critical uses like healthcare.
The Georgia Tech team, which also includes Ph.D. student Gaurav Verma and postdoctoral researcher Yibo Hu, presented their work at The Web Conference in Singapore, underlining the global relevance of their research in a city where English and Chinese are predominant.
This study serves as a reminder of the limitations of current AI technologies in handling health-related inquiries across different languages and stresses the need for improvements before they can be reliably used by non-English speakers.
If you care about heart disease, please read studies that herbal supplements could harm your heart rhythm, and how eating eggs can help reduce heart disease risk.
For more information about heart health, please see recent studies that apple juice could benefit your heart health, and results showing yogurt may help lower the death risks in heart disease.
The research findings can be found in arXiv.
Copyright © 2024 Knowridge Science Report. All rights reserved.