[Skip to Navigation]
Sign In
Figure.  Distribution of Average Quality and Empathy Ratings for Chatbot and Physician Responses to Patient Questions
Distribution of Average Quality and Empathy Ratings for Chatbot and Physician Responses to Patient Questions

Kernel density plots are shown for the average across 3 independent licensed health care professional evaluators using principles of crowd evaluation. A, The overall quality metric is shown. B, The overall empathy metric is shown.

Table.  Example Questions With Physician and Chatbot Responsea
Example Questions With Physician and Chatbot Responsea
1.
Zulman  DM, Verghese  A.  Virtual care, telemedicine visits, and real connection in the era of COVID-19: unforeseen opportunity in the face of adversity.   JAMA. 2021;325(5):437-438. doi:10.1001/jama.2020.27304 PubMedGoogle ScholarCrossref
2.
Holmgren  AJ, Downing  NL, Tang  M, Sharp  C, Longhurst  C, Huckman  RS.  Assessing the impact of the COVID-19 pandemic on clinician ambulatory electronic health record use.   J Am Med Inform Assoc. 2022;29(3):453-460. doi:10.1093/jamia/ocab268 PubMedGoogle ScholarCrossref
3.
Tai-Seale  M, Dillon  EC, Yang  Y,  et al.  Physicians’ well-being linked to in-basket messages generated by algorithms in electronic health records.   Health Aff (Millwood). 2019;38(7):1073-1078. doi:10.1377/hlthaff.2018.05509 PubMedGoogle ScholarCrossref
4.
Shanafelt  TD, West  CP, Dyrbye  LN,  et al  Changes in burnout and satisfaction with work-life integration in physicians during the first 2 years of the COVID-19 pandemic.   Mayo Clin Proc. 2022;97(12):2248-2258. doi:10.1016/j.mayocp.2022.09.002Google ScholarCrossref
5.
Sinsky  CA, Shanafelt  TD, Ripp  JA.  The electronic health record inbox: recommendations for relief.   J Gen Intern Med. 2022;37(15):4002-4003. doi:10.1007/s11606-022-07766-0 PubMedGoogle ScholarCrossref
6.
Holmgren  AJ, Byron  ME, Grouse  CK, Adler-Milstein  J.  Association between billing patient portal messages as e-visits and patient messaging volume.   JAMA. 2023;329(4):339-342. doi:10.1001/jama.2022.24710 PubMedGoogle ScholarCrossref
7.
Singhal  K, Azizi  S, Tu  T,  et al.  Large language models encode clinical knowledge.  arXiv:2212.13138v1.
8.
Nobles  AL, Leas  EC, Caputi  TL, Zhu  SH, Strathdee  SA, Ayers  JW.  Responses to addiction help-seeking from Alexa, Siri, Google Assistant, Cortana, and Bixby intelligent virtual assistants.   NPJ Digit Med. 2020;3(1):11. doi:10.1038/s41746-019-0215-9 PubMedGoogle ScholarCrossref
9.
Miner  AS, Milstein  A, Hancock  JT.  Talking to machines about personal mental health problems.   JAMA. 2017;318(13):1217-1218. doi:10.1001/jama.2017.14151 PubMedGoogle ScholarCrossref
10.
Chat GPT. Accessed December 22, 2023. https://openai.com/blog/chatgpt
11.
Patel  AS. Docs get clever with ChatGPT. Medscape. February 3, 2023. Accessed April 11, 2023. https://www.medscape.com/viewarticle/987526
12.
Hu  K. ChatGPT sets record for fastest-growing user base - analyst note. Reuters. February 2023. Accessed April 14, 2023. https://www.reuters.com/technology/chatgpt-sets-record-fastest-growing-user-base-analyst-note-2023-02-01/
13.
Devlin  J, Chang  M, Lee  K, Toutanova  K.  BERT: pre-training of deep bidirectional transformers for language understanding.  arXiv:1810.04805v2.
14.
Ross  JS, Krumholz  HM.  Ushering in a new era of open science through data sharing: the wall must come down.   JAMA. 2013;309(13):1355-1356. doi:10.1001/jama.2013.1299 PubMedGoogle ScholarCrossref
15.
Ask Docs. Reddit. Accessed October 2022. https://reddit.com/r/AskDocs/
16.
Nobles  AL, Leas  EC, Dredze  M, Ayers  JW.  Examining peer-to-peer and patient-provider interactions on a social media community facilitating ask the doctor services.   Proc Int AAAI Conf Weblogs Soc Media. 2020;14:464-475. doi:10.1609/icwsm.v14i1.7315Google ScholarCrossref
17.
Pushshift Reddit API v4.0 Documentation. 2018. Accessed April 14, 2023. https://reddit-api.readthedocs.io/en/latest/
18.
Ayers  JW, Caputi  TC, Nebeker  C, Dredze  M. Don’t quote me: reverse identification of research participants in social media studies. Nature Digital Medicine. 2018. Accessed April 11, 2023. https://www.nature.com/articles/s41746-018-0036-2
19.
Chang  N, Lee-Goldman  R, Tseng  M. Linguistic wisdom from the crowd. Proceedings of the Third AAAI Conference on Human Computation and Crowdsourcing. 2016. Accessed April 11, 2023. https://ojs.aaai.org/index.php/HCOMP/article/view/13266/13114
20.
Aroyo  L, Dumitrache  A, Paritosh  P, Quinn  A, Welty  C. Subjectivity, ambiguity and disagreement in crowdsourcing workshop (SAD2018). HCOMP 2018. Accessed April 11, 2023. https://www.aconf.org/conf_160152.html
21.
Rasu  RS, Bawa  WA, Suminski  R, Snella  K, Warady  B.  Health literacy impact on national healthcare utilization and expenditure.   Int J Health Policy Manag. 2015;4(11):747-755. doi:10.15171/ijhpm.2015.151 PubMedGoogle ScholarCrossref
22.
Herzer  KR, Pronovost  PJ.  Ensuring quality in the era of virtual care.   JAMA. 2021;325(5):429-430. doi:10.1001/jama.2020.24955 PubMedGoogle ScholarCrossref
23.
Rotenstein  LS, Holmgren  AJ, Healey  MJ,  et al.  Association between electronic health record time and quality of care metrics in primary care.   JAMA Netw Open. 2022;5(10):e2237086. doi:10.1001/jamanetworkopen.2022.37086 PubMedGoogle ScholarCrossref
24.
McGreevey  JD  III, Hanson  CW  III, Koppel  R.  Clinical, legal, and ethical aspects of artificial intelligence-assisted conversational agents in health care.   JAMA. 2020;324(6):552-553. doi:10.1001/jama.2020.2724 PubMedGoogle ScholarCrossref
25.
Santillana  M, Zhang  DW, Althouse  BM, Ayers  JW.  What can digital disease detection learn from (an external revision to) Google Flu Trends?   Am J Prev Med. 2014;47(3):341-347. doi:10.1016/j.amepre.2014.05.020 PubMedGoogle ScholarCrossref
26.
Lazer  D, Kennedy  R, King  G, Vespignani  A.  Big data—the parable of Google Flu: traps in big data analysis.   Science. 2014;343(6176):1203-1205. doi:10.1126/science.1248506 PubMedGoogle ScholarCrossref
10 Comments for this article
EXPAND ALL
Chatbots can simulate empathy through pre-programmed responses, they cannot truly understand the emotions.
Ediriweera Desapriya, PhD | Department of Pediatrics, faculty of medicine, UBC-BC Children's Hospital
Empathy, which is the ability to understand and share the feelings of others, is a complex emotional and cognitive process that involves more than just providing information. It involves active listening, genuine concern, and the ability to understand and respond to the emotional needs of patients. While chatbots can simulate empathy through pre-programmed responses, they cannot truly understand the emotions and needs of human users in the same way that a human healthcare professional can. While chatbots may not be able to fully replicate the human element of empathy, they can still be useful tools for training healthcare professionals and improving patient communication and engagement.

The study results suggest that longer responses from healthcare professionals are more popular and, therefore, there might be a correlation between the length of chatbot responses and their ratings. While it is true that longer responses may provide more information and be perceived as more informative, it is not necessarily true that longer responses are always better or more empathetic.

Furthermore, I have a question and a concern whether longer chatbot responses are simply a result of the machine having more time to respond, rather than providing more empathetic or informative responses and machines have a plenty of time (as compared to busy clinicians), it is important to ensure that the responses provided by chatbots are not simply longer for the sake of being longer, but rather, provide relevant and useful information to researchers.
CONFLICT OF INTEREST: None Reported
READ MORE
Authors' Conflicts of Interest Disclosure
Fiore Mastroianni, MD |

Several of the people who rated the responses are authors with financial conflicts of interest related to artificial intelligence or chat bot technology. They may be more likely to be able to recognize the kinds of responses produced by chat bots due to their work in the field. Moreover, they may stand to gain financially if the computer responses were found to be better.
CONFLICT OF INTEREST: None Reported
Intriguing investigation with significant limitations
Hong Sun, PhD | Principal Data Scientist, clinalytix department, Dedalus Healthcare
Thanks for reporting this interesting comparison! As a data scientist working with Medical artificial intelligence and a big fan of ChatGPT, I am not surprised to see its encouraging performance in this report. Nevertheless, I also find this article is cited as evidence that chatbot is surpassing human physicians in some social media, therefore, I would like to raise some limitations of this study:

Firstly, the answers from a Q&A forum are not representative of real clinical practice. In addition, the answer providers in the Q&A forum are also providing short answers off their clinical practice time, their performance
should not be considered as the normal level of physicians.

Secondly, the answers from chatbot are consistently long and detailed. It gives detailed explanations and guidance compared with those from human physicians. The sensitivity test that takes physician responses longer than the 75th percentile (≥62 words) is still much lower compared to the 211 [168-245] words from the chatbot. Given such a great gap in word counts, the evaluation of empathy is very biased.

The chatbot shows its potential to improve the communication between physicians and patients, I am wondering if it would be an interesting experiment for both questions and physicians' responses to be taken as inputs and ask the chatbot to generate a reply to the patients. This would allow an assessment of whether there is still added value from physicians in this Q&A forum setting.

CONFLICT OF INTEREST: None Reported
READ MORE
ChatGPT Training Sets
Catherine Mac Lean, MD, PhD | Hospital for Special Surgery, New York, NY
First off, kudos to the authors for an interesting, informative and timely article.

Can the authors comment on whether data from Reddit's r/AksDocs might have been included in ChatGPT's training set, and if so how this should inform interpretation of the study results. I directed this question to Chat GPT and got this response:

"As an AI language model, I don't have direct access to any particular subreddit, including r/AskDocs. However, it is possible that some of the text from that subreddit was included in the diverse range of sources used to train me, along
with text from many other websites and sources."

The authors may have better information.
CONFLICT OF INTEREST: None Reported
READ MORE
Chatbot versus physician performance.
Basil Fadipe, Mbbs | Justin Fadipe Centre . Dominica.
At one level the tentative results from this study are encouraging if not inspiring.
At another level however, it may be unwise to jump to any definitive conclusions too soon. Much of the interaction between the physician and patient in real life has as much to do with the physician taking cues not only from what the patient sitting in the room but also the unspoken words that may not infrequently carry even more clues than the spoken. The experienced clinician deploys all the five senses ( perhaps a sixth even ) to mine the clinical dilemma. Can
a chatbot ‘reach beneath the surface ‘ to similar effect?
CONFLICT OF INTEREST: None Reported
READ MORE
Actual Physicians Contributing?
Vishali Ramsaroop, MD | Baystate Health
While the authors appear well-intentioned in attempting to describe a potential role for ChatBot responses in modern healthcare, these conclusions are based on an inherently flawed design to some extent.

The authors do attempt to highlight the major limitation of the study as the use of the online forum question/answers but it would appear the extent of this limitation is not explored in entirety. The subreddit responses were sourced from r/AskDocs -an online forum that simply requires photographic evidence of one's hospital identification and minimal other steps for verification before designation as a vetted physician to able to
respond to community questions. The relatively poor authentication process calls into question the validity of the identities of those offering answers as physicians on the online forum.

Additionally, if questions are in fact answered by genuine physicians; anonymized responses on an online forum with no personal interaction, relationship, or evaluation of the patient is the perfect environment for the blossoming Hawthorne effect. Were "physicians" aware his/her responses would be subject to review similar to that expected in a professional healthcare environment? Not to mention "physicians" of many different specialties are allowed to respond to broad-based questions, many of which lack important historical and clinical details. Ie. an OB/GYN can comment on neuroprognostication after cardiac arrest; is this a representation of the current standard of care in real practice?

As an academic Intensivist, I have curiously posed complex diagnostic questions to ChatGPT to investigate the responses. Frequently, verbose responses inundated with redundancies ultimately recommend "consulting a Critical Care Specialist" and other specialists to determine next management steps. Inherently, the chatbot's responses cannot be compared to actual specialist responses when those very AI-generated responses recommend real-life consultation.

Certainly, there will be emerging roles for AI in the evolving healthcare landscape in the coming years - the onus is upon us to ensure that role is genuinely beneficial to both clinician and patient.
CONFLICT OF INTEREST: None Reported
READ MORE
AI Empathy vs That of Current Patient Physician Relationships
Richard Lopchinsky, MD | Clinical Professor of Surgery (emeritus), University of Arizona, Phoenix
Having recently undergone major treatment at a well-known university hospital, I can say that once AI has been trained, it has to be better than the current system. The doctor I was referred to saw me pretreatment twice. After hospitalization I never saw her again, nor did she contact me. Every week there was a new doctor leading the team and after discharge every visit was by a different APN and once by a different doctor. The current system is broken and there is no patient-physician relationship. Anything we can do to improve the situation is worth a try.
CONFLICT OF INTEREST: None Reported
J.R.B. Hutchinson, M.D. Oto-HNS, Prior General Medical & OB-GYN Practice and USAF assignment now retired.
Jos Hutchinson, MD, FACS, | Retired Otolaryngologist
Despite the value of our newer technologies, they can't replace the human interaction and experience gained from the physician actually applying them. Body language and voice tone are not replaced by our technology advances although certainly enhanced.

We are in a time of rapid change, and that always results in disruption. Adopting and adjusting are now where we are, and that takes time. let's be patient as we travel different roads and gradually learn which new ways work, which don't, and which will need to be modified or removed depending on the results.
CONFLICT OF INTEREST: None Reported
AI in augmenting healthcare communication
Eric Shane |
This study explores the intriguing dynamic between traditional physician responses and those provided by artificial intelligence chatbots to patient queries on public social media platforms. By examining the effectiveness and accuracy of both approaches, the research sheds light on the potential role of AI in augmenting healthcare communication and information dissemination in digital spaces.

Best regards
CEO:https://technoguest.com/
CONFLICT OF INTEREST: None Reported
The Implications of Using Reddit
Raza Ali | Las Positas College
It is valuable to note that the social media site which was used in this study was Reddit. The forum r/askdocs expects significantly less professionality and empathy than a doctor’s formal communication or live visit. To control for this, perhaps a contextual modification could be made to the LLM prompt.
CONFLICT OF INTEREST: None Reported
Original Investigation
April 28, 2023

Comparing Physician and Artificial Intelligence Chatbot Responses to Patient Questions Posted to a Public Social Media Forum

Author Affiliations
  • 1Qualcomm Institute, University of California San Diego, La Jolla
  • 2Division of Infectious Diseases and Global Public Health, Department of Medicine, University of California San Diego, La Jolla
  • 3Department of Computer Science, Bryn Mawr College, Bryn Mawr, Pennsylvania
  • 4Department of Computer Science, Johns Hopkins University, Baltimore, Maryland
  • 5Herbert Wertheim School of Public Health and Human Longevity Science, University of California San Diego, La Jolla
  • 6Human Longevity, La Jolla, California
  • 7Naval Health Research Center, Navy, San Diego, California
  • 8Division of Blood and Marrow Transplantation, Department of Medicine, University of California San Diego, La Jolla
  • 9Moores Cancer Center, University of California San Diego, La Jolla
  • 10Department of Biomedical Informatics, University of California San Diego, La Jolla
  • 11Altman Clinical Translational Research Institute, University of California San Diego, La Jolla
JAMA Intern Med. 2023;183(6):589-596. doi:10.1001/jamainternmed.2023.1838
Key Points

Question  Can an artificial intelligence chatbot assistant, provide responses to patient questions that are of comparable quality and empathy to those written by physicians?

Findings  In this cross-sectional study of 195 randomly drawn patient questions from a social media forum, a team of licensed health care professionals compared physician’s and chatbot’s responses to patient’s questions asked publicly on a public social media forum. The chatbot responses were preferred over physician responses and rated significantly higher for both quality and empathy.

Meaning  These results suggest that artificial intelligence assistants may be able to aid in drafting responses to patient questions.

Abstract

Importance  The rapid expansion of virtual health care has caused a surge in patient messages concomitant with more work and burnout among health care professionals. Artificial intelligence (AI) assistants could potentially aid in creating answers to patient questions by drafting responses that could be reviewed by clinicians.

Objective  To evaluate the ability of an AI chatbot assistant (ChatGPT), released in November 2022, to provide quality and empathetic responses to patient questions.

Design, Setting, and Participants  In this cross-sectional study, a public and nonidentifiable database of questions from a public social media forum (Reddit’s r/AskDocs) was used to randomly draw 195 exchanges from October 2022 where a verified physician responded to a public question. Chatbot responses were generated by entering the original question into a fresh session (without prior questions having been asked in the session) on December 22 and 23, 2022. The original question along with anonymized and randomly ordered physician and chatbot responses were evaluated in triplicate by a team of licensed health care professionals. Evaluators chose “which response was better” and judged both “the quality of information provided” (very poor, poor, acceptable, good, or very good) and “the empathy or bedside manner provided” (not empathetic, slightly empathetic, moderately empathetic, empathetic, and very empathetic). Mean outcomes were ordered on a 1 to 5 scale and compared between chatbot and physicians.

Results  Of the 195 questions and responses, evaluators preferred chatbot responses to physician responses in 78.6% (95% CI, 75.0%-81.8%) of the 585 evaluations. Mean (IQR) physician responses were significantly shorter than chatbot responses (52 [17-62] words vs 211 [168-245] words; t = 25.4; P < .001). Chatbot responses were rated of significantly higher quality than physician responses (t = 13.3; P < .001). The proportion of responses rated as good or very good quality (≥ 4), for instance, was higher for chatbot than physicians (chatbot: 78.5%, 95% CI, 72.3%-84.1%; physicians: 22.1%, 95% CI, 16.4%-28.2%;). This amounted to 3.6 times higher prevalence of good or very good quality responses for the chatbot. Chatbot responses were also rated significantly more empathetic than physician responses (t = 18.9; P < .001). The proportion of responses rated empathetic or very empathetic (≥4) was higher for chatbot than for physicians (physicians: 4.6%, 95% CI, 2.1%-7.7%; chatbot: 45.1%, 95% CI, 38.5%-51.8%; physicians: 4.6%, 95% CI, 2.1%-7.7%). This amounted to 9.8 times higher prevalence of empathetic or very empathetic responses for the chatbot.

Conclusions  In this cross-sectional study, a chatbot generated quality and empathetic responses to patient questions posed in an online forum. Further exploration of this technology is warranted in clinical settings, such as using chatbot to draft responses that physicians could then edit. Randomized trials could assess further if using AI assistants might improve responses, lower clinician burnout, and improve patient outcomes.

Introduction

The COVID-19 pandemic hastened the adoption of virtual health care,1 concomitant with a 1.6-fold increase in electronic patient messages, with each message adding 2.3 minutes of work in the electronic health record and more after-hours work.2 Additional messaging volume predicts increased burnout for clinicians3 with 62% of physicians, a record high, reporting at least 1 burnout symptom.4 More messages also makes it more likely that patients’ messages will go unanswered or get unhelpful responses.

Some patient messages are unsolicited questions seeking medical advice, which also take more skill and time to answer than generic messages (eg, scheduling an appointment, accessing test results). Current approaches to decreasing these message burdens include limiting notifications, billing for responses, or delegating responses to less trained support staff.5 Unfortunately, these strategies can limit access to high-quality health care. For instance, when patients were told they might be billed for messaging, they sent fewer messages and had shorter back-and-forth exchanges with clinicians.6 Artificial intelligence (AI) assistants are an unexplored resource for addressing the burden of messages. While some proprietary AI assistants show promise,7 some public tools have failed to recognize even basic health concepts.8,9

ChatGPT10 represents a new generation of AI technologies driven by advances in large language models.11 ChatGPT reached 100 million users within 64 days of its November 30, 2022 release and is widely recognized for its ability to write near-human-quality text on a wide range of topics.12 The system was not developed to provide health care, and its ability to help address patient questions is unexplored.13 We tested ChatGPT’s ability to respond with high-quality and empathetic answers to patients’ health care questions, by comparing the chatbot responses with physicians’ responses to questions posted on a public social media forum.

Methods

Studying patient questions from health care systems using a chatbot was not possible in this cross-sectional study because, at the time, the AI was not compliant with the Health Insurance Portability and Accountability Act of 1996 (HIPAA) regulations. Deidentifying patient messages by removing unique information to make them HIPAA compliant could change the content enough to alter patient questions and affect the chatbot responses. Additionally, open science requires public data to enable research to build on and critique prior research.14 Lastly, media reports suggest that physicians are already integrating chatbots into their practices without evidence. For reasons of need, practicality, and to empower the development of a rapidly available and shareable database of patient questions, we collected public and patient questions and physician responses posted to an online social media forum, Reddit’s r/AskDocs.15

The online forum, r/AskDocs, is a subreddit with approximately 474 000 members where users can post medical questions and verified health care professional volunteers submit answers.15 While anyone can respond to a question, subreddit moderators verify health care professionals’ credentials and responses display the respondent’s level of credential next to their response (eg, physician) and flag a question when it has already been answered. Background and use cases for data in this online forum are described by Nobles et al.16

All analyses adhered to Reddit’s terms and conditions17 and were determined by the University of California, San Diego, human research protections program to be exempt. Informed consent was not required because the data were public and did not contain identifiable information (45 CFR §46). Direct quotes from posts were summarized to protect patient’s identities.18 Actual quotes were used to obtain the chatbot responses.

Our study’s target sample was 200, assuming 80% power to detect a 10 percentage point difference between physician and chatbot responses (45% vs 55%). The analytical sample ultimately contained 195 randomly drawn exchanges, ie, a unique member’s question and unique physician’s answer, during October 2022. The original question, including the title and text, was retained for analysis, and the physician response was retained as a benchmark response. Only physician responses were studied because we expected that physicians’ responses are generally superior to those of other health care professionals or laypersons. When a physician replied more than once, we only considered the first response, although the results were nearly identical regardless of our decision to exclude or include follow-up physician responses (see eTable 1 in Supplement 1). On December 22 and 23, 2022, the original full text of the question was put into a fresh chatbot session, in which the session was free of prior questions asked that could bias the results (version GPT-3.5, OpenAI), and the chatbot response was saved.

The original question, physician response, and chatbot response were reviewed by 3 members of a team of licensed health care professionals working in pediatrics, geriatrics, internal medicine, oncology, infectious disease, and preventive medicine (J.B.K., D.J.F., A.M.G., M.H., D.M.S.). The evaluators were shown the entire patient’s question, the physician’s response, and chatbot response. Responses were randomly ordered, stripped of revealing information (eg, statements such as “I’m an artificial intelligence”), and labeled response 1 or response 2 to blind evaluators to the identity of the responders. The evaluators were instructed to read the entire patient question and both responses before answering questions about the interaction. First, evaluators were asked “which response [was] better” (ie, response 1 or response 2). Then, using Likert scales, evaluators judged both “the quality of information provided” (very poor, poor, acceptable, good, or very good) and “the empathy or bedside manner provided” (not empathetic, slightly empathetic, moderately empathetic, empathetic, and very empathetic) of responses. Response options were translated into a 1 to 5 scale, where higher values indicated greater quality or empathy.

We relied on a crowd (or ensemble) scoring strategy,19 where scores were averaged across evaluators for each exchange studied. This method is used when there is no ground truth in the outcome being studied, and the evaluated outcomes themselves are inherently subjective (eg, judging figure skating, National Institutes of Health grants, concept discovery). As a result, the mean score reflects evaluator consensus, and disagreements (or inherent ambiguity, uncertainty) between evaluators is reflected in the score variance (eg, the CIs will, in part, be conditional on evaluator agreement).20

We compared the number of words in physician and chatbot responses and reported the percentage of responses for which chatbot was preferred. Using 2-tailed t tests, we compared mean quality and empathy scores of physician responses with chatbot responses. Furthermore, we compared rates of responses above or below important thresholds, such as less than adequate, and computed prevalence ratios comparing the chatbot to physician responses. The significance threshold used was P < .05. All statistical analyses were performed in R statistical software, version 4.0.2 (R Project for Statistical Computing).

We also reported the Pearson correlation between quality and empathy scores. Assuming that in-clinic patient questions may be longer than those posted on the online forum, we also assessed the extent to which subsetting the data into longer replies authored by physicians (including those above the median or 75th percentile length) changed evaluator preferences and the quality or empathy ratings relative to the chatbot responses.

Results

The sample contained 195 randomly drawn exchanges with a unique member-patient’s question and unique physician’s answer. The mean (IQR) length of patient questions in words averaged 180 (94-223). Mean (IQR) physician responses were significantly shorter than the chatbot responses (52 [17-62] words vs 211 [168-245] words; t = 25.4; P < .001). A total of 182 (94%) of these exchanges consisted of a single message and only a single response from a physician. A remaining 13 (6%) exchanges consisted of a single message but with 2 separate physician responses. Second responses appeared incidental (eg, an additional response was given when a post had already been answered) (eTable 1 in Supplement 1).

The evaluators preferred the chatbot response to the physician responses 78.6% (95% CI, 75.0%-81.8%) of the 585 evaluations. Summaries of example questions and the corresponding physician and chatbot responses are shown in the Table.

Evaluators also rated chatbot responses significantly higher quality than physician responses (t = 13.3; P < .001). The mean rating for chatbot responses was better than good (4.13; 95% CI, 4.05-4.20), while on average, physicians’ responses were rated 21% lower, corresponding to an acceptable response (3.26; 95% CI, 3.15-3.37) (Figure). The proportion of responses rated less than acceptable quality (<3) was higher for physician responses than for chatbot (physicians: 27.2%; 95% CI, 21.0%-33.3%; chatbot: 2.6%; 95% CI, 0.5%-5.1%). This amounted to 10.6 times higher prevalence of less than acceptable quality responses for physicians. Conversely, the proportion of responses rated good or very good quality was higher for chatbot than physicians (physicians: 22.1%; 95% CI, 16.4%-28.2%; chatbot: 78.5%; 95% CI, 72.3%-84.1%). This amounted to 3.6 times higher prevalence of good or very good responses for the chatbot.

Chatbot responses (3.65; 95% CI, 3.55-3.75) were rated significantly more empathetic (t = 18.9; P < .001) than physician responses (2.15; 95% CI, 2.03-2.27). Specifically, physician responses were 41% less empathetic than chatbot responses, which generally equated to physician responses being slightly empathetic and chatbot being empathetic. Further, the proportion of responses rated less than slightly empathetic (<3) was higher for physicians than for chatbot (physicians: 80.5%; 95% CI, 74.4%-85.6%; chatbot: 14.9%; 95% CI, 9.7-20.0). This amounted to 5.4 times higher prevalence of less than slightly empathetic responses for physicians. The proportion of responses rated empathetic or very empathetic was higher for chatbot than for physicians (physicians: 4.6%; 95% CI, 2.1%-7.7%; chatbot: 45.1%; 95% CI, 38.5%-51.8%). This amounted to 9.8 times higher prevalence of empathetic or very empathetic responses for the chatbot.

The Pearson correlation coefficient between quality and empathy scores of responses authored by physicians was r = 0.59. The correlation coefficient between quality and empathy scores of responses generated by the chatbot was r = 0.32. A sensitivity analysis showed longer physician responses were preferred at higher rates, scored higher for empathy and quality, but remained significantly below chatbot scores (eFigure in Supplement 1). For instance, among the subset of physician responses longer than the median length, evaluators preferred the response of chatbot to physicians in 71.4% (95% CI, 66.3%-76.9%) of evaluations and preferred the response of chatbot to physician responses in the top 75th percentile of length 62.0% (95% CI, 54.0-69.3) of evaluations.

Discussion

In this cross-sectional study within the context of patient questions in a public online forum, chatbot responses were longer than physician responses, and the study’s health care professional evaluators preferred chatbot-generated responses over physician responses 4 to 1. Additionally, chatbot responses were rated significantly higher for both quality and empathy, even when compared with the longest physician-authored responses.

We do not know how chatbots will perform responding to patient questions in a clinical setting, yet the present study should motivate research into the adoption of AI assistants for messaging, despite being previously overlooked.5 For instance, as tested, chatbots could assist clinicians when messaging with patients, by drafting a message based on a patient’s query for physicians or support staff to edit. This approach fits into current message response strategies, where teams of clinicians often rely on canned responses or have support staff draft replies. Such an AI-assisted approach could unlock untapped productivity so that clinical staff can use the time-savings for more complex tasks, resulting in more consistent responses and helping staff improve their overall communication skills by reviewing and modifying AI-written drafts.

In addition to improving workflow, investments into AI assistant messaging could affect patient outcomes. If more patients’ questions are answered quickly, with empathy, and to a high standard, it might reduce unnecessary clinical visits, freeing up resources for those who need them.21 Moreover, messaging is a critical resource for fostering patient equity, where individuals who have mobility limitations, work irregular hours, or fear medical bills, are potentially more likely to turn to messaging.22 High-quality responses might also improve patient outcomes.23 For some patients, responsive messaging may collaterally affect health behaviors, including medication adherence, compliance (eg, diet), and fewer missed appointments. Evaluating AI assistant technologies in the context of randomized clinical trials will be essential to their implementation, including studying outcomes for clinical staff, such as physician burnout, job satisfaction, and engagement.

Limitations

The main study limitation was the use of the online forum question and answer exchanges. Such messages may not reflect typical patient-physician questions. For instance, we only studied responding to questions in isolation, whereas actual physicians may form answers based on established patient-physician relationships. We do not know to what extent clinician responses incorporate this level of personalization, nor have we evaluated the chatbot’s ability to provide similar details extracted from the electronic health record. Furthermore, while we demonstrate the overall quality of chatbot responses, we have not evaluated how an AI assistant will enhance clinicians responding to patient questions. The value added will vary in many ways across hospitals, specialties, and clinicians, as it augments, rather than replaces, existing processes for message-based care delivery. Another limitation is that general clinical questions are just one reason patients message their clinicians. Other common messages are requests for sooner appointments, medication refills, questions about their specific test results, their personal treatment plans, and their prognosis. Additional limitations of this study include: the summary measures of quality and empathy were not pilot tested or validated; this study’s evaluators despite being blinded to the source of a response and any initial results were also coauthors, which could have biased their assessments; the additional length of the chatbot responses could have been erroneously associated with greater empathy; and evaluators did not independently and specifically assess the physician or chatbot responses for accuracy or fabricated information, though this was considered as a subcomponent of each quality evaluation and overall response preference.

The use of a public database ensures that the present study can be replicated, expanded, and validated, especially as new AI products become available. For example, we considered only unidimensional metrics of response quality and empathy, but further research can clarify subdimensions of quality (eg, responsiveness or accuracy) and empathy (eg, communicating the patient is understood or expressing remorse for patient outcomes). Additionally, we did not evaluate patient assessments whose judgments of empathy may differ from our health care professional evaluators and who may have adverse reactions to AI assistant–generated responses. Last, using AI assistants in health care poses a range of ethical concerns24 that need to be addressed prior to implementation of these technologies, including the need for human review of AI-generated content for accuracy and potential false or fabricated information.

Conclusions

While this cross-sectional study has demonstrated promising results in the use of AI assistants for patient questions, it is crucial to note that further research is necessary before any definitive conclusions can be made regarding their potential effect in clinical settings. Despite the limitations of this study and the frequent overhyping of new technologies,25,26 studying the addition of AI assistants to patient messaging workflows holds promise with the potential to improve both clinician and patient outcomes.

Back to top
Article Information

Accepted for Publication: February 28, 2023.

Published Online: April 28, 2023. doi:10.1001/jamainternmed.2023.1838

Correction: This article was corrected on May 8, 2023, to clarify in 2 instances that chatbots cannot author responses or be considered authors, rather they are generating responses and are considered responders, and to clarify that though accuracy of responses were not specifically and independently evaluated in the study, this was considered as a subcomponent of the quality evaluations and overall preferences of the evaluators.

Corresponding Author: John W. Ayers, PhD, MA, Qualcomm Institute, University of California San Diego, La Jolla, CA (ayers.john.w@gmail.com).

Author Contributions: Dr Ayers had full access to all of the data in the study and takes responsibility for the integrity of the data and the accuracy of the data analysis.

Concept and design: Ayers, Poliak, Dredze, Leas, Faix, Longhurst, Smith.

Acquisition, analysis, or interpretation of data: Ayers, Poliak, Leas, Zhu, Kelley, Faix, Goodman, Longhurst, Hogarth, Smith.

Drafting of the manuscript: Ayers, Poliak, Dredze, Leas, Zhu, Kelley, Longhurst, Smith.

Critical revision of the manuscript for important intellectual content: Ayers, Poliak, Dredze, Leas, Zhu, Faix, Goodman, Longhurst, Hogarth, Smith.

Statistical analysis: Leas, Zhu, Faix.

Obtained funding: Smith.

Administrative, technical, or material support: Poliak, Dredze, Leas, Kelley, Longhurst, Smith.

Supervision: Dredze, Smith.

Conflict of Interest Disclosures: Dr Ayers reported owning equity in companies focused on data analytics, Good Analytics, of which he was CEO until June 2018, and Health Watcher. Dr Dredze reported personal fees from Bloomberg LP and Sickweather outside the submitted work and owning an equity position in Good Analytics. Dr Leas reported personal fees from Good Analytics during the conduct of the study. Dr Goodman reported personal fees from Seattle Genetics outside the submitted work. Dr Hogarth reported being an adviser for LifeLink, a health care chatbot company. Dr Longhurst reported being an adviser and equity holder at Doximity. Dr Smith reported stock options from Linear Therapies, personal fees from Arena Pharmaceuticals, Model Medicines, Pharma Holdings, Bayer Pharmaceuticals, Evidera, Signant Health, Fluxergy, Lucira, and Kiadis outside the submitted work. No other disclosures were reported.

Funding/Support: This work was supported by the Burroughs Wellcome Fund, University of California San Diego PREPARE Institute, and National Institutes of Health. Dr Leas acknowledges salary support from grant K01DA054303 from the National Institutes on Drug Abuse.

Role of the Funder/Sponsor: The funders had no role in the design and conduct of the study; collection, management, analysis, and interpretation of the data; preparation, review, or approval of the manuscript; and decision to submit the manuscript for publication.

Data Sharing Statement: See Supplement 2.

References
1.
Zulman  DM, Verghese  A.  Virtual care, telemedicine visits, and real connection in the era of COVID-19: unforeseen opportunity in the face of adversity.   JAMA. 2021;325(5):437-438. doi:10.1001/jama.2020.27304 PubMedGoogle ScholarCrossref
2.
Holmgren  AJ, Downing  NL, Tang  M, Sharp  C, Longhurst  C, Huckman  RS.  Assessing the impact of the COVID-19 pandemic on clinician ambulatory electronic health record use.   J Am Med Inform Assoc. 2022;29(3):453-460. doi:10.1093/jamia/ocab268 PubMedGoogle ScholarCrossref
3.
Tai-Seale  M, Dillon  EC, Yang  Y,  et al.  Physicians’ well-being linked to in-basket messages generated by algorithms in electronic health records.   Health Aff (Millwood). 2019;38(7):1073-1078. doi:10.1377/hlthaff.2018.05509 PubMedGoogle ScholarCrossref
4.
Shanafelt  TD, West  CP, Dyrbye  LN,  et al  Changes in burnout and satisfaction with work-life integration in physicians during the first 2 years of the COVID-19 pandemic.   Mayo Clin Proc. 2022;97(12):2248-2258. doi:10.1016/j.mayocp.2022.09.002Google ScholarCrossref
5.
Sinsky  CA, Shanafelt  TD, Ripp  JA.  The electronic health record inbox: recommendations for relief.   J Gen Intern Med. 2022;37(15):4002-4003. doi:10.1007/s11606-022-07766-0 PubMedGoogle ScholarCrossref
6.
Holmgren  AJ, Byron  ME, Grouse  CK, Adler-Milstein  J.  Association between billing patient portal messages as e-visits and patient messaging volume.   JAMA. 2023;329(4):339-342. doi:10.1001/jama.2022.24710 PubMedGoogle ScholarCrossref
7.
Singhal  K, Azizi  S, Tu  T,  et al.  Large language models encode clinical knowledge.  arXiv:2212.13138v1.
8.
Nobles  AL, Leas  EC, Caputi  TL, Zhu  SH, Strathdee  SA, Ayers  JW.  Responses to addiction help-seeking from Alexa, Siri, Google Assistant, Cortana, and Bixby intelligent virtual assistants.   NPJ Digit Med. 2020;3(1):11. doi:10.1038/s41746-019-0215-9 PubMedGoogle ScholarCrossref
9.
Miner  AS, Milstein  A, Hancock  JT.  Talking to machines about personal mental health problems.   JAMA. 2017;318(13):1217-1218. doi:10.1001/jama.2017.14151 PubMedGoogle ScholarCrossref
10.
Chat GPT. Accessed December 22, 2023. https://openai.com/blog/chatgpt
11.
Patel  AS. Docs get clever with ChatGPT. Medscape. February 3, 2023. Accessed April 11, 2023. https://www.medscape.com/viewarticle/987526
12.
Hu  K. ChatGPT sets record for fastest-growing user base - analyst note. Reuters. February 2023. Accessed April 14, 2023. https://www.reuters.com/technology/chatgpt-sets-record-fastest-growing-user-base-analyst-note-2023-02-01/
13.
Devlin  J, Chang  M, Lee  K, Toutanova  K.  BERT: pre-training of deep bidirectional transformers for language understanding.  arXiv:1810.04805v2.
14.
Ross  JS, Krumholz  HM.  Ushering in a new era of open science through data sharing: the wall must come down.   JAMA. 2013;309(13):1355-1356. doi:10.1001/jama.2013.1299 PubMedGoogle ScholarCrossref
15.
Ask Docs. Reddit. Accessed October 2022. https://reddit.com/r/AskDocs/
16.
Nobles  AL, Leas  EC, Dredze  M, Ayers  JW.  Examining peer-to-peer and patient-provider interactions on a social media community facilitating ask the doctor services.   Proc Int AAAI Conf Weblogs Soc Media. 2020;14:464-475. doi:10.1609/icwsm.v14i1.7315Google ScholarCrossref
17.
Pushshift Reddit API v4.0 Documentation. 2018. Accessed April 14, 2023. https://reddit-api.readthedocs.io/en/latest/
18.
Ayers  JW, Caputi  TC, Nebeker  C, Dredze  M. Don’t quote me: reverse identification of research participants in social media studies. Nature Digital Medicine. 2018. Accessed April 11, 2023. https://www.nature.com/articles/s41746-018-0036-2
19.
Chang  N, Lee-Goldman  R, Tseng  M. Linguistic wisdom from the crowd. Proceedings of the Third AAAI Conference on Human Computation and Crowdsourcing. 2016. Accessed April 11, 2023. https://ojs.aaai.org/index.php/HCOMP/article/view/13266/13114
20.
Aroyo  L, Dumitrache  A, Paritosh  P, Quinn  A, Welty  C. Subjectivity, ambiguity and disagreement in crowdsourcing workshop (SAD2018). HCOMP 2018. Accessed April 11, 2023. https://www.aconf.org/conf_160152.html
21.
Rasu  RS, Bawa  WA, Suminski  R, Snella  K, Warady  B.  Health literacy impact on national healthcare utilization and expenditure.   Int J Health Policy Manag. 2015;4(11):747-755. doi:10.15171/ijhpm.2015.151 PubMedGoogle ScholarCrossref
22.
Herzer  KR, Pronovost  PJ.  Ensuring quality in the era of virtual care.   JAMA. 2021;325(5):429-430. doi:10.1001/jama.2020.24955 PubMedGoogle ScholarCrossref
23.
Rotenstein  LS, Holmgren  AJ, Healey  MJ,  et al.  Association between electronic health record time and quality of care metrics in primary care.   JAMA Netw Open. 2022;5(10):e2237086. doi:10.1001/jamanetworkopen.2022.37086 PubMedGoogle ScholarCrossref
24.
McGreevey  JD  III, Hanson  CW  III, Koppel  R.  Clinical, legal, and ethical aspects of artificial intelligence-assisted conversational agents in health care.   JAMA. 2020;324(6):552-553. doi:10.1001/jama.2020.2724 PubMedGoogle ScholarCrossref
25.
Santillana  M, Zhang  DW, Althouse  BM, Ayers  JW.  What can digital disease detection learn from (an external revision to) Google Flu Trends?   Am J Prev Med. 2014;47(3):341-347. doi:10.1016/j.amepre.2014.05.020 PubMedGoogle ScholarCrossref
26.
Lazer  D, Kennedy  R, King  G, Vespignani  A.  Big data—the parable of Google Flu: traps in big data analysis.   Science. 2014;343(6176):1203-1205. doi:10.1126/science.1248506 PubMedGoogle ScholarCrossref
×