ChatGPT shows promise in addressing heart failure queries with accuracy and precision

In a recent study posted to the medRxiv* preprint server, researchers evaluate the accuracy and reproducibility of responses from ChatGPT versions 3.5 and 4 in answering heart failure-related questions.

Study: Appropriateness of ChatGPT in answering heart failure related questions. Image Credit: SuPatMaN / Shutterstock.com

*Important notice: medRxiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as conclusive, guide clinical practice/health-related behavior, or treated as established information.

Background

By 2030, researchers estimate that healthcare costs associated with heart failure will reach around $70 billion USD every year in the United States. About 70% of these costs are due to hospitalizations, which constitute 1-2% of all hospital admissions in the United States. Studies have shown that patients who possess more knowledge about managing their heart condition tend to have fewer and shorter hospital stays. 

With the increasing use of online resources for health information, nearly one billion healthcare-related questions are searched on Google every day. One notable artificial intelligence (AI) model known as Chat Generative Pre-Trained Transformer (ChatGPT) has recently gained popularity.

ChatGPT is a large language model (LLM) that has been trained on a diverse dataset, including medical topics, and can provide conversational responses to user queries. The medical community is actively investigating the utility of ChatGPT and similar models in the field of medicine by evaluating its knowledge and reasoning capabilities. 

About the study

In the current study, researchers collected a list of 125 commonly asked questions about heart failure from reputable medical organizations and Facebook support groups. After careful evaluation, 18 questions with duplicate content, vague phrasing, or did not address the patient’s perspective were eliminated.

The remaining 107 questions were then inputted twice into both versions of ChatGPT using the “new chat” feature, which led to the generation of two responses for every question from each model. 

To assess the accuracy of the responses, two board-certified cardiologists independently graded them using a scale consisting of four categories ranging from comprehensive, correct but inadequate, some correct and some incorrect, and completely incorrect. This evaluation process was performed for both ChatGPT-3.5 and ChatGPT-4 responses. The reproducibility of the responses was also evaluated by comparing the comprehensive and accuracy scores for both responses for each question from each model. 

Any discrepancies in grading between the reviewers were resolved by a third reviewer who is a board-certified specialist in advanced heart failure with over 20 years of clinical experience.

Study results 

The evaluation of responses from both ChatGPT models revealed that most responses were considered ‘comprehensive’ or ‘correct but inadequate.’ ChatGPT-4 exhibited a greater depth of comprehensive knowledge in the categories of ‘management’ and ‘basic knowledge’ as compared to ChatGPT-3.5.

The performance of ChatGPT-3.5 was better in the ‘other’ category, which encompassed topics like support prognosis and procedures. For example, ChatGPT-3.5 provided a general answer about the cardiac benefits of sodium-glucose cotransporter-2 (SGLT2) inhibitors, whereas ChatGPT-4 offered a more detailed yet concise response regarding the impact of these agents on diuresis and blood pressure.

About 2% of responses from ChatGPT-3.5 was graded as ‘some correct and some incorrect,’ while no responses from ChatGPT-4 fell into this category or the ‘completely incorrect’ category. When examining reproducibility, both models provided consistent responses for most questions, with the ChatGPT-3.5 version scoring more than 94% in all categories and GPT-4 achieving 100% reproducibility for all answers. 

Conclusions 

The present study reported that ChatGPT-4 demonstrated superior performance as compared to ChatGPT-3.5 by providing more comprehensive responses to heart-failure-related questions without any incorrect answers. Both models exhibited high reproducibility for most questions. These findings highlight the impressive capabilities and rapid advancement of LLMs in providing reliable and comprehensive information to patients.

ChatGPT has the potential to serve as a valuable resource for people with heart conditions by empowering them with knowledge under the guidance of healthcare providers. The user-friendly interface and human-like conversational responses make ChatGPT an appealing tool for patients seeking health-related information. The improved performance of ChatGPT-4 can be attributed to improved training, which focuses on better understanding user intent and handling complex scenarios.

While ChatGPT performed well in this study, there are important limitations to consider. Occasionally, the model may provide inaccurate but believable responses and, at times, nonsensical answers.

The accuracy of the model relies on its training dataset, which has not been disclosed, and recommendations may vary across various regions. Additional limitations include the inability to blind the reviewers to the versions of ChatGPT and the potential for bias introduced through subjective review, despite the use of a panel of multiple reviewers. 

Further research and exploration of ChatGPT’s capabilities and limitations are recommended to maximize its potential impact on improving patient outcomes. 

*Important notice: medRxiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as conclusive, guide clinical practice/health-related behavior, or treated as established information.

Journal reference:
  • Preliminary scientific report. King, R., Samaan, J. S., Yeo, Y. H., et al. (2023). Appropriateness of ChatGPT in answering heart failure related questions. medRxiv. doi:10.1101/2023.07.07.23292385. https://www.medrxiv.org/content/10.1101/2023.07.07.23292385v2 

Posted in: Device / Technology News | Medical Science News | Medical Research News | Medical Condition News

Tags: Artificial Intelligence, Blood, Blood Pressure, Glucose, Healthcare, Heart, Heart Failure, Hospital, Language, Medicine, Research

Comments (0)

Written by

Susha Cheriyedath

Susha has a Bachelor of Science (B.Sc.) degree in Chemistry and Master of Science (M.Sc) degree in Biochemistry from the University of Calicut, India. She always had a keen interest in medical and health science. As part of her masters degree, she specialized in Biochemistry, with an emphasis on Microbiology, Physiology, Biotechnology, and Nutrition. In her spare time, she loves to cook up a storm in the kitchen with her super-messy baking experiments.