Use of Artificial Intelligence Chatbots for Cancer Treatment Information

This survey study examines the performance of a large language model chatbot in providing cancer treatment recommendations that are concordant with National Comprehensive Cancer Network guidelines.


Use of Artificial Intelligence Chatbots for Cancer Treatment Information
Large language models (LLMs) underlying chatbots 1 can mimic human language and quickly return detailed and coherentseeming responses.These properties can obscure that chatbots might provide inaccurate information.Because patients often use the internet for self-education, 2 some will undoubtedly use LLM chatbots to find cancer-related medical information, which could be associated with the generation and amplification of misinformation.We evaluated an LLM chatbot's performance to provide breast, prostate, and lung cancer treatment recommendations concordant with National Comprehensive Cancer Network (NCCN) 3 guidelines.
Methods | We developed 4 zero-shot prompt templates to query treatment recommendations (eMethods and eFigure in Supple-

Editorial page 1341
Related article page 1437 We benchmarked the chatbot's recommendations against 2021 NCCN guidelines because this chatbot's knowledge cutoff was September 2021.Five scoring criteria were developed to assess guideline concordance (Table 1).The output did not have to recommend all possible regimens to be considered concordant; the recommended treatment approach needed only to be an NCCN option.Concordance of the chatbot output with NCCN guidelines was assessed by 3 of 4 board-certified oncologists, and majority rule was taken as the final score.In cases of complete disagreement, the oncologist who had not previously seen the output adjudicated.Data were analyzed between March 2 and March 14, 2023, using Excel, version 16.74  (Microsoft Corp).

Supplemental content
Results | Outputs of 104 unique prompts were scored on 5 criteria for a total of 520 scores.All 3 annotators agreed for 322 of 520 (61.9%) scores.Disagreements tended to arise when the output was unclear (eg, not specifying which multiple treatments to combine).Table 1 shows agreement between the 4 prompts and the distribution of scores across cancer type and extent of disease.For 9 of 26 (34.6%) diagnosis descriptions, the 4 prompts yielded the same scores for each of the 5 scoring criteria.Table 2 shows scores across the prompts.The chatbot provided at least 1 recommendation for 102 of 104 (98%) prompts.All outputs with a recommendation included at least 1 NCCN-concordant treatment, but 35 of 102 (34.3%) of these outputs also recommended 1 or more nonconcordant treatments.
Responses were hallucinated (ie, were not part of any recommended treatment) in 13 of 104 (12.5%) outputs.Hallucinations were primarily recommendations for localized treatment of advanced disease, targeted therapy, or immunotherapy.
c Lung cancer was queried separately from non-small cell lung cancer and small cell lung cancer.d Slight misalignment of categorical scores from questions 2 and 3 resulted from majority rules.For example, for question 3, NA = 70 instead of 69 (66 + 3) because of majority voting.
a role.Clinicians should advise patients that LLM chatbots are not a reliable source of treatment information.Language learning models can pass the US Medical Licensing Examination, 4 encode clinical knowledge, 5 and provide diagnoses better than laypeople. 6However, the chatbot did not perform well at providing accurate cancer treatment recommendations.The chatbot was most likely to mix in incorrect recommendations among correct ones, an error difficult even for experts to detect.
A study limitation is that we evaluated 1 model at a snapshot in time.Nonetheless, the findings provide insight into areas of concern and future research needs.The chatbot did not purport to be a medical device, and need not be held to such standards.However, patients will likely use such technologies in their self-education, which may affect shared decision-making and the patient-clinician relationship. 2Developers should have some responsibility to distribute technologies that do not cause harm, and patients and clinicians need to be aware of these technologies' limitations.

Validity of Meta-Analytical Data on Cutaneous Adverse Events With Phosphoinositide 3-Kinase Inhibitors
To the Editor We read with great interest the recent metaanalysis by Jfri and colleagues 1 assessing the association of rash with phosphoinositide 3-kinase (PI3K) inhibitors and caution the validity of their meta-analytical data.
First, most of their meta-analytical data 1 were from unapproved PI3K inhibitors instead of US Food and Drug Administration (FDA)-approved PI3K drugs, implying there could be serious bias of study selection.For example, Jfri and colleagues 1 did not include the CHRONOS-3 phase 3 trial 2 comparing copanlisib plus rituximab vs placebo plus rituximab in patients with relapsed indolent non-Hodgkin lymphoma, missing the opportunity to investigate rash events with US FDAapproved PI3K medications more systematically and limiting the clinical utility of their meta-analysis.
Second, Jfri and colleagues 1 misidentified a number of studies reporting severe rash (≥grade 3).Among their included studies, Bowles et al 2016, 3 Zelenetz et al 2017, 4 and Vuylsteke et al 2016 5 reported not only the data of all-grade rash but also the data of severe rash; however, the data of severe rash of these studies were not included in their meta-analysis of severe rash (shown in Figure 3).In Bowles et al 2016, 3 1 severe rash occurred in the sonolisib plus cetuximab arm, whereas no severe rash was observed in the cetuximab (control) arm.In Zelenetz et al 2017, 4 there were 6 patients with grade 3 rash after treatment with the idelalisib in combination with bendamustine plus rituximab, whereas there was no patient with severe rash after treatment with bendamustine plus rituximab alone.In Vuylsteke et al 2016, 5 6 cases of grade 3 rash were observed in the pictilisib plus paclitaxel arm, whereas no severe rash occurred in the placebo plus paclitaxel arm.
Third, many numeric errors in their meta-analytical data 1 were observed, rendering their results 1 questionable.In Bowles et al 2016, the events of severe rash should be 10 and 1, respectively, in the treatment vs control group, rather than 23 and 26 used in the Figure 3 of Jfri et al. 1 In Jimeno et al 2015 (sonolisib plus docetaxel), and Levy et al 2014, zero events of severe rash were reported in both treatment and control arms rather than the 1 in both arms used in the Figure 3 of Jfri et al. 1 Fourth, the definition of rash across included studies was heterogeneous.Namely, Loibi et al 2017 1 reported data on maculopapular rash only, whereas other included studies reported data on all subtypes of rash.These data reproducibility and validity issues necessitate careful clarifications.
Dan-Na Wu, MSc Guo Yu, PhD Guo-Fu Li, PhD  We thank Wu et al for their interest in our study 1 and we appreciate their comments.Although our search was conducted through August of 2021, it was limited to searching electronic databases.There are 2 potential explanations for why the CHRONOS-3 phase 3 trial 2 was not included: either our search strategy was not sensitive enough or the study had not yet been indexed in the included databases by August 2021.The latter makes the most sense because the article was only first published in May 2021, and indexing in databases takes time.

Table .
Summary of Survival in the 5-Year Final Analysis a Talimogene laherparepvec plus surgery.bSurgery alone.cGroup 1 -group 2. d Group 1 / group 2. Letters jamaoncology.com(Reprinted) JAMA Oncology October 2023 Volume 9, Number 10 ment 1).These templates do not provide the model with examples of correct responses.Templates were used to create 4 prompt variations for 26 diagnosis descriptions (cancer types with or without relevant extent of disease modifiers) for a total of 104 prompts.Prompts were input to the GPT-3.5-turbo-0301model via the ChatGPT (OpenAI) interface.In accordance with the Common Rule, institutional review board approval was not needed since human participants were not involved.

Table 1 .
Oncologist Scoring of LLM Chatbot Treatment Recommendations a This work was supported by the Woods Foundation.The funder had no role in the design and conduct of the study; collection, management, analysis, and interpretation of the data; preparation, review, or approval of the manuscript; and decision to submit the manuscript for publication.Levine DM, Tuwani R, Kompa B, et al.The diagnostic and triage accuracy of the GPT-3 artificial intelligence model.medRxiv.Preprint posted onlineFebruary  1, 2023.Accessed February 20, 2023.doi:10.1101/2023.01.30.23285067 a Data reported as No. (%) using majority rule of annotators' scores.bSlight misalignment of categorical scores from questions 2 and 3 resulted from majority rules.Letters jamaoncology.com(Reprinted)JAMA Oncology October 2023 Volume 9, Number 10 Cancer Research outside the submitted work.No other disclosures were reported.Funding/Support: Author Affiliations: Department of Pharmacy, Hainan General Hospital (Hainan Affiliated Hospital of Hainan Medical University), Haikou, Hainan, China (Wu); School of Basic Medicine and Clinical Pharmacy, China Pharmaceutical University, Nanjing, Jiangsu, China (Yu, Li).Guo-Fu Li, PhD, Guo Yu's Laboratory, School of Basic Medicine and Clinical Pharmacy, China Pharmaceutical University, 24 Tongjia Lane, Nanjing 211198, China (guofu.g.li@gmail.com).August 31, 2023.doi:10.1001/jamaoncol.2023.3384Conflict of Interest Disclosures: Dr Yu reported grants from Jiangsu Provincial Natural Science Fund for Distinguished Young Scholars (BK20200005) during the conduct of the study.No other disclosures were reported.