AI-Driven Health Advice: Evaluating the Potential of Large Language Models as Health Assistants

Yanlin Liu; Jiayi Wang

doi:10.62836/jcmea.v3i1.030106

AI-Driven Health Advice: Evaluating the Potential of Large Language Models as Health Assistants

This study aims to evaluate whether the GPT model can be a health assistant by addressing health concerns from three aspects: providing preliminary guidance, clarifying information, and offering accessible recommendations. 31 questions in total were collected from multiple online health platforms, which included diverse health concerns across different age ranges and genders. A tailored system prompt was built to guide GPT model GPT-3.5-turbo generating responses. The evaluation metrics are designed based on 3 metrics: “Preliminary Guidance”, “Clarifying Information”, and “Accessibility and Convenience”, which is used to evaluate responses with score method from 0 to 5. Lastly, the generated responses were evaluated using established metrics by an experienced medical doctor with over 20 years of experience in the fields of general and preventive care. The results indicate that LLMs demonstrated moderate performance in both the ‘preliminary guidance’ and ‘clarifying information’ aspects. Specifically, the mean score for ‘preliminary guidance’ was 3.65, implying that LLMs are capable of offering valuable insights when symptoms indicate the need for urgent or emergency care, as well as providing reassurance to patients for minor symptoms. In a similar manner, the mean score for ‘clarifying information’ was 3.87, demonstrating that LLMs effectively provide supplementary information to aid patients in making informed decisions. However, the mean score for ‘accessibility and convenience’ was notably lower at 2.65, highlighting a deficiency in LLMs’ ability to offer advice customized to the specific needs of individual patients.

Keywords: large language models; LLMs; LLM; GPT models; GPT-3.5-turbo; artificial intelligence; healthcare; general health; health assistants; digital health

References

Sezgin E, Sirrianni J, Linwood S. Operationalizing and Implementing Pretrained, Large Artificial Intelligence Linguistic Models in the US Health Care System: Outlook of Generative Pretrained Transformer 3 (GPT-3) as a Service Model. JMIR Medical Informatics 2022; 10(2): e32875. DOI: https://doi.org/10.2196/32875
Vinod V, Agrawal S, Gaurav VRP, et al. Multilingual Medical Question Answering and Information Retrieval for Rural Health Intelligence Access. arXiv 2021; arXiv:2106.01251.
Saxena S. Medical Question Answering Using Instructional Prompts; Arizona State University: Tempe, AZ, USA, 2021.
Abràmoff MD, Lavin PT, Birch M, et al. Pivotal Trial of an Autonomous AI-Based Diagnostic System for Detection of Diabetic Retinopathy in Primary Care Offices. NPJ Digital Medicine 2018; 1(1): 39. DOI: https://doi.org/10.1038/s41746-018-0040-6
Das A, Selek S, Warner AR, et al. Conversational Bots for Psychotherapy: A Study of Generative Transformer Models Using Domain-specific Dialogues. In Proceedings of the 21st Workshop on Biomedical Language Processing, Dublin, Ireland, 26 May 2022; pp. 285–297. DOI: https://doi.org/10.18653/v1/2022.bionlp-1.27
Reynolds L, McDonell K. Prompt Programming for Large Language Models: Beyond the Few-Shot Paradigm. arXiv 2021; arXiv:2102.07350. DOI: https://doi.org/10.1145/3411763.3451760
Chen Z, Fu C, Wu R, et al. LGFat-RGCN: Faster Attention with Heterogeneous RGCN for Medical ICD Coding Generation. In Proceedings of the 31st ACM International Conference on Multimedia, Ottawa, ON, Canada, 27 October 2023; pp. 5428–5435. DOI: https://doi.org/10.1145/3581783.3612564
El Abbadi A, Dobbie G, Feng Z, et al. Lecture Notes in Computer Science; Springer: Cham, Switzerland, 2023; pp. 331–343. https://doi.org/10.1007/978-3-031-35415-1_23. DOI: https://doi.org/10.1007/978-3-031-35415-1_23
Wang Y, Chen J, Wang M, et al. A Closer Look at Classifier in Adversarial Domain Generalization. In Proceedings of the 31st ACM International Conference on Multimedia, Ottawa, ON, Canada, 27 October 2023; pp. 280–289. https://doi.org/10.1145/3581783.3611743. DOI: https://doi.org/10.1145/3581783.3611743
Gu Y, Yan D, Yan S, et al. Price Forecast with High-Frequency Finance Data: An Autoregressive Recurrent Neural Network Model with Technical Indicators. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management, Virtual Event, 19–23 October 2020; pp. 2485–2492. https://doi.org/10.1145/3340531.3412738. DOI: https://doi.org/10.1145/3340531.3412738
Gu Y, Chen K. GAN-Based Domain Inference Attack. AAAI Conference on Artificial Intelligence 2023; 37(12): 14214–14222. https://doi.org/10.1609/aaai.v37i12.26663. DOI: https://doi.org/10.1609/aaai.v37i12.26663
Gu Y, Sharma S, Chen K. Image Disguising for Scalable GPU-accelerated Confidential Deep Learning. In Proceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security, Copenhagen, Denmark, 26–30 November 2023; pp. 3679–3681. Available online: https://dl.acm.org/doi/abs/10.1145/3576915.3624364 (accessed on 29 May 2024).
Du S, Chen Z, Wu H, et al. Image Recommendation Algorithm Combined with Deep Neural Network Designed for Social Networks. Complexity 2021; 2021: 5196190. https://doi.org/10.1155/2021/5196190. DOI: https://doi.org/10.1155/2021/5196190
Wang Y, Chen Z, Fu C. Synergy Masks of Domain Attribute Model DaBERT: Emotional Tracking on Time-Varying Virtual Space Communication. Sensors 2022; 22: 8450. DOI: https://doi.org/10.3390/s22218450
Shah TI, Clark AF, Seabrook JA, et al. Geographic Accessibility to Primary Care Providers: Comparing Rural and Urban Areas in Southwestern Ontario. The Canadian Geographer/Le Géographe Canadien 2020; 64(1): 65–78. DOI: https://doi.org/10.1111/cag.12557
Chintagunta B, Katariya N, Amatriain X, et al. Medically Aware GPT-3 as a Data Generator for Medical Dialogue Summarization. Machine Learning for Healthcare Conference 2021; 149: 354-372. DOI: https://doi.org/10.18653/v1/2021.nlpmc-1.9
Libbi CA, Trienes J, Trieschnigg D, et al. Generating Synthetic Training Data for Supervised De-Identification of Electronic Health Records. Future Internet 2021; 13: 136. DOI: https://doi.org/10.3390/fi13050136
Valmeekam K, Olmo A, Sreedharan S, et al. Large Language Models Still Can’t Plan (A Benchmark for LLMs on Planning and Reasoning about Change). In Proceedings of the NeurIPS 2022 Workshop on Foundation Models for Decision Making, New Orleans, SL, USA, 5 October 2022.
Brown TB, Mann B, Ryder N, et al. Language Models are Few-Shot learners. arXiv 2020; arXiv:2005.14165.
Radford A, Wu J, Child R, et al. Language Models Are Unsupervised Multitask Learners; OpenAI Blog.: San Francisco, CA, USA, 2019.
Asan O, Bayrak AE, Choudhury A. Artificial Intelligence and Human Trust in Healthcare: Focus on Clinicians. Journal of Medical Internet Research 2020; 22(6): e15154. DOI: https://doi.org/10.2196/15154

Downloads

AI-Driven Health Advice: Evaluating the Potential of Large Language Models as Health Assistants

References

Information