AI chatbots often give unsafe health advice, warns a global study.
A new international study has found that popular artificial intelligence chatbots frequently provide inaccurate or unsafe advice when asked how urgently a patient should seek medical care. The findings have sparked concern among health experts who warn that relying on such technology could have serious consequences.
The research, published this week in the Annals of Internal Medicine, evaluated four widely used AI chatbots -- two versions of ChatGPT (GPT-3.5 and GPT-4), Google Bard and Anthropic's Claude -- against 32 realistic patient scenarios. The cases ranged from minor health complaints to potentially life-threatening conditions.
What the study says
'Overall, the customised LLM chatbots generated 88 (88 per cent) health disinformation responses from 100 submitted health questions. The GPT-4o, Gemini 1.5-Pro, Llama 3.2-90B Vision and Grok Beta chatbots provided disinformation responses to 100 per cent (20 out of 20 for each chatbot) of the tested health questions.
'The Claude 3.5 Sonnet chatbot demonstrated some safeguards, with only 40 per cent (eight out of 20) of the tested questions resulting in health disinformation generation.
'The remaining 60 per cent (12 out of 20) of answers generated by the Claude 3.5 Sonnet chatbot, in general, indicated that the model would not respond as it did not want to provide or promote any false or misleading health information.'
The AI models were tasked with determining whether patients needed emergency treatment, a prompt doctor's appointment or could safely manage symptoms at home.
Alarmingly, the study found that the chatbots offered incorrect triage guidance in about 35 per cent of the cases.
'Our analysis shows these AI tools often struggle to judge how urgent a situation is,' said Dr Natansh D Modi, the study's lead author from University Hospitals Birmingham NHS Foundation Trust. 'That's particularly concerning when underestimating a serious condition might delay critical care.'
When AI downplays red flags
Researchers noted that the most worrisome errors came from the chatbots advising patients to wait or manage symptoms on their own when, in fact, the situation required immediate medical attention.
Such under-triage mistakes could have dangerous outcomes, especially for conditions like heart attacks or strokes where every minute counts.
The study also observed other inconsistencies: The same chatbot sometimes gave different recommendations for identical scenarios when prompted again, underlining the unpredictable nature of these tools.
Possible impact on vulnerable groups
Experts caution that people in remote or under-resourced areas -- who might turn to free online tools for quick health guidance -- could be disproportionately affected by misleading advice.
The authors of the study stressed that until there is stronger evidence and oversight, AI chatbots should not be seen as substitutes for clinical judgment.
Promise in non-critical tasks
While the results raise red flags about using AI for self-triage, researchers pointed out that these technologies still hold promise in other parts of healthcare such as generating patient education materials or summarising medical literature for professionals.
'AI language models have a lot to offer but we need to be clear about where they're safe to use,' Dr Modi added. 'For now, decisions about how urgently someone needs care are best left to trained medical staff.'
The study concludes, 'Our findings reveal substantial vulnerabilities in the safeguards of foundational LLM APIs and the OpenAI GPT store against being system-instructed into health disinformation chatbots.
'While some models demonstrated partial resilience, the inconsistent application of effective protections highlights the urgent need to address these weaknesses.
'These vulnerabilities pose a clear risk of bad actors maliciously constructing chatbots that may outwardly appear helpful but, in reality, disseminate false health information on a large scale.
'Given the rapid spread and global impact that health disinformation can have, AI developers must prioritise the implementation of robust safeguards supported by comprehensive AI vigilance frameworks such as health-specific auditing, continuous monitoring and proactive patching.'