Leveraging Large Language Models for Decision Support in Personalized Oncology

Key Points Question Can current conversational large language models (LLMs) be used as a tool for personalized decision-making in precision oncology? Findings In this diagnostic study, treatment option identification from 4 LLMs for 10 fictional patients deviated substantially from expert recommendations. Nevertheless, LLMs correctly identified several important treatment strategies and partly provided reasonable suggestions that were not easily found by experts. Meaning These results suggest that LLMs are not yet applicable as a routine tool for aiding personalized clinical decision-making in oncology, but do improve upon existing LLM-based methods.


Introduction
Precision medicine describes the concept of personalized clinical decision-making by accounting for individual variation. 1This concept requires an evidence-based interpretation of variations as biomarkers.3][4] However, the identification of uncommon and complex molecular alterations or defined biomarkers falling outside currently established guidelines and recommendations creates challenges for clinical decision-making.These findings are frequently discussed in specialized and interdisciplinary molecular tumor boards (MTB). 5Especially in these settings, the clinical interpretation of molecular alterations remains manual work based on search engines and specialized curated databases. 6Yet, these databases contain mostly nonoverlapping information, 7 which provides strong evidence for their incompleteness.Accordingly, the selection and interpretation of evidence for less well-characterized molecular alterations create inter-interpreter heterogeneity. 8e development of new artificial intelligence systems (AI), such as large language models (LLMs), 9,10 has improved the quality of automated analysis of large and complex data sets considerably.LLMs have already been assessed in various biomedical contexts, such as clinical language understanding 11 and optimization of general clinical decision support. 12Their potential role in supporting personalized oncology remains undefined.Here, we present results from an explorative analysis of LLM-generated treatment recommendations to assist an MTB.

Methods
This diagnostic study of the development of LLM-based treatment generation followed the Standards for Reporting of Diagnostic Accuracy (STARD) reporting guideline.An overview of the workflow and additional context is provided in eMethods and eFigure 3 in Supplement 1. Review by the Universitätsmedizin Berlin ethics review board was not required because no patient data were used.

Development of Fictional Case Vignettes
We created molecular profiles for 10 fictional patients based on realistic clinical scenarios, similar to a previous study. 8Cases covered 7 different tumor entities and included 59 distinct molecular alterations largely falling outside current guidelines.An overview of all cases is available in the Table , and a detailed description is in eTable 1 in Supplement 1. Cases were designed to represent tumor types and alterations typically encountered in molecular tumor boards, including an overrepresentation of lung adenocarcinoma cases, where multigene sequencing is standard-of-care. 13

Clinical Interpretation of Molecular Data
Each case vignette was assigned to 1 expert physician of the Charité MTB for manual clinical interpretation of molecular findings, following previously described workflows. 5Additionally, 4 different LLMs were tasked to generate treatment options: BioMed LM (MosaicML; Stanford University) (LLM 1), 14 Perplexity.ai(University of California, Berkeley) (LLM 2), 15 ChatGPT (OpenAI) (LLM 3), 16 and Galactica (Meta) (LLM 4). 17These 4 were selected to compare across 4 different criteria: type of usage (local installation vs online, important regarding data privacy requirements), model size (in terms of computational resources required), openness (whether an integrated retrieval engine is used, impact on up-to-datedness), and pretraining domain (general or medical, impact on result quality) (eTable 2 in Supplement 1).

Assessment of Results in an Interdisciplinary MTB
Treatment options from the 4 LLMs were condensed into 2 types of summaries for evaluation in the MTB: (1) combined treatment option, which contained options identified by at least 2 different LLMs; and (2) clinical treatment option, containing options with at least 1 associated NCT or PubMed identifier.These 2 lists and a third, manually annotated list of treatment options were masked and presented to the MTB at Charité's Comprehensive Cancer Center.
We created an online survey for MTB members (eTable 4 in Supplement 1).Participants rated the likelihood of a treatment option coming from an AI (on a scale from 0 to 10, with 10 signifying  options most likely coming from AI).Furthermore, the MTB members selected which option they would most likely pursue further and rated the general usefulness of recommendations.

Statistical Analysis
The concordance of LLM-generated treatment options (4 individual and 2 combined options) with the manually generated treatment options as criterion standard was measured using precision, recall, and F1 score.Precision, which denotes the fraction of relevant treatment options among the suggested options, was defined as precision = true positives / (true positives + false positives).
Recall, or the fraction of all treatment options in the criterion standard found by LLMs, was defined as recall = true positives / (true positives + false negatives).The F1 score is the harmonic mean of precision and recall, and thus penalizes unbalanced precision and recall scores (ie, is higher when both precision and recall have similar values): F1 score = (2 × precision × recall) / (precision + recall).
The higher any of the 3 scores, the better the LLM has performed compared with the human recommendations, with 1 being the maximum value for each score.

Quantitative Evaluation of Treatment Options
Ten fictional cancer patients (4 with lung cancer, 6 with other cancer types) with a median (IQR) of When manually identified treatment options were set as the criterion standard, the LLMs reached F1 scores of 0.04 (LLM 1), 0.14 (LLM 2), 0.17

Qualitative Analysis
The set of all obtained treatment options was masked and presented to an interdisciplinary MTB.
MTB members were asked to rate the likelihood of treatment options coming from an LLM (on a 10-point scale, with 0 being extremely unlikely and 10 extremely likely) and their clinical usefulness.
In the 43 overall answers, MTB participants further indicated which 1 of the 3 treatment options they would most likely consider for clinical decision-making.In 37 cases, they preferred to pursue the human annotation, and in 6 cases, they indicated a preference for an LLM-generated treatment option (Figure 3).At least 1 LLM-generated treatment option per patient was considered helpful by MTB members.The accuracy of provided references was frequently cited as a reason why LLM-generated treatment options were disregarded.LLMs 1 and 4 were not able to provide any useful references in our preliminary studies, so prompting for references was eventually stopped for both.LLM 3 provided 85 unique NCT identifiers across 74 of its 86 treatment options.LLM 2 was able to provide references for 131 of its 142 treatment options, 34 of them being unique NCT identifiers and the rest being PubMed and PubMed Central identifiers or other web resources.We performed an assessment to check how many of the suggested references, specifically the NCT identifiers, linked to an existing study.Out of the 85 provided NCT identifiers by LLM 3, 27 did not exist.In contrast, none of the 34 NCT identifiers provided by LLM 2 were hallucinated (eFigure 4 in Supplement 1).

Unique Treatment Recommendations
Although the treatment options presented by LLMs did not match all recommendations from expert human annotators, 2 unique treatment options (including 1 unique treatment strategy) were pointed out as clinically useful by MTB members that were identified only by the LLMs and not by the human expert.The unique treatment strategy was antiandrogen therapy in a patient with salivary duct carcinoma with HRAS and PIK3CA variation.HRAS and PIK3CA comutated salivary duct carcinoma usually stain positive for the androgen receptor in immunohistochemistry. 19 Antiandrogen therapy was not suggested by the human expert because no immunohistochemistry results were provided.

Retrospective Analysis of an Updated LLM
To evaluate how newer models of AI assistance may affect results, we retrospectively analyzed differences in results from ChatGPT 3 with those of its most recent version, which was not available at the time of the primary analysis.The newer version generated 74 treatment options for the 10 fictional patients, compared with the 85 treatment options from ChatGPT 3. Twenty-six treatment options overlapped between both versions, showing the high instability of updated models.In comparison with the human expert, the updated LLM reached an F1 score of 0.26, surpassing all 4 LLMs we tested in the study (eFigure 5 in Supplement 1).In contrast to ChatGPT 3, its newer version reduced the number of hallucinated references: only 1 out of 17 unique NCT identifiers provided did not exist vs 27 out of 85 with ChatGPT 3.

Discussion
Artificial intelligence systems are used increasingly for health care applications. 20Previous reports have shown good performance for well-defined tasks in radiology, dermatology, or pathology. 213][24] Integrating multidimensional data beyond established guidelines is an additional challenge typically faced in precision oncology and molecular tumor boards, making this a compelling use case for LLMs.This study reports results from an analysis of LLM-supported decision-making to facilitate personalized oncology.Despite the small sample size of 10 fictional patients, we were able to generate first results for model performance that overall were consistent across LLMs.
The F1 scores reached by LLMs compared with expert recommendations were generally low (below 0.3).The best-performing LLM generated a recall value of 0.34.This result suggests that applying LLMs to prefilter treatment options for human experts is not yet efficient, as important recommendations were not reported.However, these results came close to the performance of established precision oncology knowledge databases (eFigure 5 in Supplement 1).Additionally, such an interpretation considers single-expert annotation as criterion standard, despite considerable inter-interpreter heterogeneity. 6Furthermore, at least 1 LLM-generated treatment option per patient was considered practically relevant, and 2 treatment options were identified only by an LLM suggesting their potential usefulness as a complementary search tool.For each patient, 3 options for treatment recommendations were presented to the MTB.Members of the MTB ranked each option from 0 (least likely to come from a large language model [LLM]) to 10 (most likely to come from an LLM).In addition, the totals on the right side of the plot indicate how many participants would choose the given option for the patient.
A comparison of the 4 examined LLMs shows that the smaller model BioMed LM trained only on PubMed did not reach the performance of the 3 larger general purpose LLMs trained on further corpora.This is consistent with previous results suggesting that an increase in model size is one of the key factors for improving performance. 25,26The F1 scores for extracted treatment options were similar across the 3 larger models.However, for MTB members, the quality of the provided study references was decisive for their assessment that most LLM-generated treatment options were not actionable.Analyses of LLMs for other complex medical tasks observed similar challenges. 27,28ture developments thus should focus on identifying adequate references for supporting recommendations.
Other predecessor.This comparison highlights that the performance of LLMs is highly influenced by versioning, and rapid improvements are expected in the future.
The integration of complex clinical and molecular data by LLMs, as shown here for precision oncology, also holds important implications for other fields in oncology and medicine.For example, an automated and comprehensive review of existing data could help design clinical and preclinical research. 31This approach could be especially useful in precision oncology, where a highly individual combination of biomarkers limits traditional trial design. 32

Limitations
This study had several limitations.The limited number of fictional patients, as well as the rapid development of new LLM models and versions, limit conclusions from study results.The design of highly dimensional fictional patients and an analysis plan including 4 different LLM with different technological backgrounds still allowed for a first validation of LLM for precision oncology applications.

Conclusions
In this diagnostic study of LLM-based decision support for personalized oncology, LLMs were not yet suitable to automate an MTB annotation process.However, rapid developments can be expected in the near future and LLMs could already be used to complement the screening of large biomedical data sets.Addressing the accountability of clinical evidence, data privacy and quality control remain key challenges.
(LLM 3), and 0.19 (LLM 4) (Figure 2).Because of the limited individual performance, LLM-generated treatment options were summarized into combined treatment option and clinical treatment option for further analyses.Combined treatment option considered only treatment options identified by more than 1 LLM and clinical treatment option were restricted to treatment options associated with a concrete (although possibly wrong) reference to clinical evidence.Combined treatment options reached an F1 score of 0.29, thus outperforming the best individual performance of an LLM.The clinical treatment options reached the highest recall of 0.34 of all the LLMs.

Figure 2 .
Figure 2. Quantitative Analysis of Model Performance

Figure 3 .
Figure 3. Treatment Evaluations of 10 Fictional Patients by Molecular Tumor Board (MTB) Experts Use of Large Language Models for Decision Support in Personalized Oncology JAMA Network Open.2023;6(11):e2343689. doi:10.1001/jamanetworkopen.2023.43689(Reprinted) specific requirements exist for health care applications of LLMs.Online-only models like ChatGPT and Perplexity.aiallow for a low-maintenance integration in existing workflows and provide continuous updates but require disclosing patient data to commercial services.Uncontrolled model updates furthermore make the quality of results unpredictable and destroy reproducibility of recommendations.On the other hand, stand-alone applications like BioMed LM or Galactica require local installation and maintenance but have the advantage of full data privacy and reproducibility of results.Updates can be performed in a controlled manner and follow an internal versioning control for ensuring accountability of recommendations.Selecting the most suitable tool for specific requirements therefore needs careful prior evaluation.Selection of LLMs is additionally complicated by the rapid development of the field, with new LLMs being published at an almost weekly basis.29,30Froma conceptual point of view, they rely on the same computational model as ChatGPT, but use different training corpora, different inference architecture, and different training procedures.Being up-to-date thus requires continuous repetition of assessments with new models.In a retrospective analysis, a newer model of ChatGPT reached a higher F1 score than the 4 LLMs included in the primary analysis and reduced the number of hallucinated references in comparison with its Brown T, Mann B, Ryder N, et al.Language Models are Few-Shot Learners.In: Advances in Neural Information Processing Systems.Published 2020.Accessed October 11, 2022.https://papers.nips.cc/paper/2020/hash/1457c0d6bfcb4967418bfb8ac142f64a-Abstract.html 26.Open AI.GPT-4 technical report.arXiv.Preprint posted online March 27, 2023.doi:10.48550/arXiv.2303.0877427.Kanjee Z, Crowe B, Rodman A. Accuracy of a generative artificial intelligence model in a complex diagnostic challenge.JAMA.2023;330(1):78-80.doi:10.1001/jama.2023.828828.Sarraju A, Bruemmer D, Van Iterson E, Cho L, Rodriguez F, Laffin L. Appropriateness of cardiovascular disease prevention recommendations obtained from a popular online chat-based artificial intelligence model.JAMA.2023;329(10):842-844. doi:10.1001/jama.2023.104429.Meta.Introducing Llama.Accessed October 23, 2023.https://ai.meta.com/llama/30.Google.Bard homepage.Accessed October 23, 2023.https://bard.google.com/chat31.Li T, Shetty S, Kamath A, et al.CancerGPT: few-shot drug pair synergy prediction using large pre-trained language models.arXiv.Preprint posted online April 17, 2023.doi:10.48550/arXiv.2304.1094632.Petak I, Kamal M, Dirner A, et al.A computational method for prioritizing targeted therapies in precision oncology: performance analysis in the SHIVA01 trial.NPJ Precis Oncol.2021;5(1):59.doi:10.1038/s41698-021-00191-2Detailed Descriptions of the 10 Mock Patients eTable 2. Comparison of the LLMs Used in This Study eTable 3. Prompt Templates for All LLMs in the Given Study eTable 4. Questions for the Survey eFigure 1. Number of Treatment Options per Prompt Type eFigure 2. Workflow of LLM Prompting eFigure 3. General Workflow of the Analysis eFigure 4. Number of Unique Clinical Trials Suggested by LLMs and the Oncological Experts eFigure 5. Precision, Recall, and F1 Scores for the Structured Databases and LLMs Compared With the Human 25.