SA indicates suicide attempt; SI, suicidal ideation.
AUPR indicates area under the precision-recall curve; AUROC, area under the receiver operating characteristic curve; C-SSRS, Columbia Suicide Severity Rating Scale; and VSAIL, Vanderbilt Suicide Attempt and Ideation Likelihood.
eTable 1. ICD-10-CM and ICD-9-CM Codes for Suicide Attempt (SA) and Suicidal Ideation (SI)
eTable 2. C-SSRS Logistic Regression Odds Ratios
eTable 3. Discrimination Metrics by Risk Threshold Across All Time Periods
Customize your JAMA Network experience by selecting one or more topics from the list below.
Wilimitis D, Turer RW, Ripperger M, et al. Integration of Face-to-Face Screening With Real-time Machine Learning to Predict Risk of Suicide Among Adults. JAMA Netw Open. 2022;5(5):e2212095. doi:10.1001/jamanetworkopen.2022.12095
Does prediction of suicide risk improve when combining face-to-face screening with electronic health record–based machine learning models?
In this cohort study of 120 398 adult patient encounters, an ensemble learning approach combined suicide risk predictions from the Columbia Suicide Severity Rating Scale and a real-time machine learning model. Combined models outperformed either model alone for risks of suicide attempt and suicidal ideation across a variety of time periods.
These findings suggest that health care systems should attempt to leverage the independent, complementary strengths of traditional clinician assessment and automated machine learning to improve suicide risk detection.
Understanding the differences and potential synergies between traditional clinician assessment and automated machine learning might enable more accurate and useful suicide risk detection.
To evaluate the respective and combined abilities of a real-time machine learning model and the Columbia Suicide Severity Rating Scale (C-SSRS) to predict suicide attempt (SA) and suicidal ideation (SI).
Design, Setting, and Participants
This cohort study included encounters with adult patients (aged ≥18 years) at a major academic medical center. The C-SSRS was administered during routine care, and a Vanderbilt Suicide Attempt and Ideation Likelihood (VSAIL) prediction was generated in the electronic health record. Encounters took place in the inpatient, ambulatory surgical, and emergency department settings. Data were collected from June 2019 to September 2020.
Main Outcomes and Measures
Primary outcomes were the incidence of SA and SI, encoded as International Classification of Diseases codes, occurring within various time periods after an index visit. We evaluated the retrospective validity of the C-SSRS, VSAIL, and ensemble models combining both. Discrimination metrics included area under the receiver operating curve (AUROC), area under the precision-recall curve (AUPR), sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV).
The cohort included 120 398 unique index visits for 83 394 patients (mean [SD] age, 51.2 [20.6] years; 38 107 [46%] men; 45 273 [54%] women; 13 644 [16%] Black; 63 869 [77%] White). Within 30 days of an index visit, the combined models had higher AUROC (SA: 0.874-0.887; SI: 0.869-0.879) than both the VSAIL (SA: 0.729; SI: 0.773) and C-SSRS (SA: 0.823; SI: 0.777) models. In the highest risk-decile, ensemble methods had PPV of 1.3% to 1.4% for SA and 8.3% to 8.7% for SI and sensitivity of 77.6% to 79.5% for SA and 67.4% to 70.1% for SI, outperforming VSAIL (PPV for SA: 0.4%; PPV for SI: 3.9%; sensitivity for SA: 28.8%; sensitivity for SI: 35.1%) and C-SSRS (PPV for SA: 0.5%; PPV for SI: 3.5%; sensitivity for SA: 76.6%; sensitivity for SI: 68.8%).
Conclusions and Relevance
In this study, suicide risk prediction was optimal when leveraging both in-person screening (for acute measures of risk in patient-reported suicidality) and historical EHR data (for underlying clinical factors that can quantify a patient’s passive risk level). To improve suicide risk classification, prediction systems could combine pretrained machine learning with structured clinician assessment without needing to retrain the original model.
Nearly 800 000 people die from suicide annually worldwide, with anticipated worsening during the COVID-19 pandemic.1,2 Rates of suicidal behavior are increasing in the United States, although preventive measures including universal screening with discharge follow-up, lethal means counseling, and safety planning have been shown to reduce risk.3-6 In the month prior to their death, approximately 45% of individuals had seen a primary care clinician and 19% had seen a mental health specialist.7 Mental health diagnoses are often absent in records of those who subsequently die from suicide, and many patients will not proactively disclose suicidal thoughts and behaviors because of stigma.8,9 Better risk identification and prognostication might improve outreach and prevention.
Universal screening in emergency departments (EDs) has demonstrated feasibility and led to a near 2-fold increase in suicide risk detection in adults.4 Despite the challenges of implementing universal suicide risk screening (eg, additional training requirements, competing medical priorities, and limited availability of mental health resources), it is often recommended for health care settings, particularly primary care, medical specialty clinics, and EDs.10,11 Endorsed by the US Centers for Disease Control and Prevention in 2011 and the US Food and Drug Administration in 2012, the Columbia Suicide Severity Rating Scale (C-SSRS) is a standardized assessment of suicidal ideation and behavior.12 In a multisite analysis of existing suicide risk scales, Posner et al13 found the C-SSRS to have comparatively high sensitivity and predictive validity in the classification of both historical and future suicidal ideation and behavior.13 Others have criticized the score as overly simplified with the potential to miss a subset of patients with suicidal ideation.14
Predictive models might augment face-to-face screening and leverage electronic health record (EHR) data toward the automated, early detection of individuals at risk of suicide. Kessler et al15 used administrative data to predict deaths from suicide following outpatient mental health visits in an active-duty military population. In a civilian population, a model incorporating longitudinal EHR data16 predicted future suicidal behavior in both inpatient and outpatient visits with good sensitivity (33%-45%) and specificity (90%-95%). The Mental Health Research Network used data across 7 health systems, including the 9-item Patient Health Questionnaire (PHQ-9),17 to predict suicide attempt and death within 90 days, with the top 5% of risk scores accounting for 43% to 48% of suicide attempts. In a pediatric ED, a model combining EHR data and brief suicide screening (Ask Suicide Questionnaire [ASQ])36 outperformed screening alone in the prediction of subsequent suicide-related visits.
Integrating clinical and statistical risk predictions might provide more effective suicide risk detection.18 However, it remains to be established whether these alternative methods are “competing, complementary, or merely duplicative.”18 Vanderbilt University Medical Center (VUMC) has implemented universal screening in the ED with the C-SSRS since 2018. In parallel, our team has deployed the Vanderbilt Suicide Attempt and Ideation Likelihood (VSAIL) model, which has generated real-time suicide risk predictions across the enterprise since June 2019. In this study, we evaluated the ability of the C-SSRS alone to predict suicide attempt (SA) and suicidal ideation (SI) and compared this performance to our real-time suicide risk prediction model. Finally, we analyzed the performance of our risk model combined with the C-SSRS to evaluate whether synergistic effects improve performance.
We studied an observational cohort of adult patients (aged ≥18 years) at VUMC from June 2019 to September 2020, extracted from the Vanderbilt Research Derivative, a clinical research repository.20 We extracted C-SSRS response data from all screened patients from the Clarity Database in the EPIC Systems EHR. At each index visit in which the C-SSRS was administered, we also extracted the corresponding VSAIL risk scores generated at the start of each encounter. We included patient encounters in the inpatient (general medical and psychiatric), ambulatory surgical, and ED settings. The institutional review board at VUMC approved the analyses needed in this study. Because of the impracticality of obtaining patient consent in this EHR-based study, analyses were approved with waiver of consent. This study followed the Transparent Reporting of a Multivariable Prediction Model for Individual Prognosis or Diagnosis (TRIPOD) reporting guidelines.35
The primary outcomes in this study were SA and SI, defined as separate events by coded self-injurious thoughts and behaviors (SITBs) occurring within 7, 30, 60, 90, and 180 days after the discharge date of each documented visit during the time period. SITBs were extracted from encounter diagnosis documentation encoded as International Classification of Diseases, Ninth Revision, Clinical Modification (ICD-9-CM) and International Statistical Classification of Diseases, Tenth Revision, Clinical Modification (ICD-10-CM) stored in our EHR. To avoid circularity when using the C-SSRS to predict future SI and SA, we defined these outcomes exclusively through ICD codes without any reference to clinical screening. In eTable 1 in the Supplement, we provide a reference list of codes used for SI and SA from published sources.21
VUMC implemented universal screening using the C-SSRS in June 2018 in line with Joint Commission mandates.22 The timing of screening during clinical workflows varied considerably between sites and according to competing clinical priorities. Screening was performed during triage for ED encounters and during every shift for patients admitted to the psychiatric hospital. In outpatient clinical settings, screening was performed sporadically as dictated by clinical need (eg, patient-reported symptoms prompting suicide risk screening). Registered nurses typically administered the C-SSRS to patients face to face, eg, at triage in the ED.
The brief version of the C-SSRS consists of 6 cascading questions related to suicidal thoughts and behaviors. The 6 components of the brief C-SSRS implemented at VUMC were: question (Q) 1 (wish to be dead), Q 2 (suicidal thoughts), Q 3 (suicidal thoughts with method without specific plan or intent), Q 4 (suicidal intent without specific plan), Q 5 (suicidal intent with specific plan), and Q 6 (suicidal behavior). All screened patients were asked Qs 1, 2, and 6. Qs 3 to 5 were only asked if patients answered yes to Q 2.
Each question was asked with respect to occurrences within the past month or since the last assessment. Since there are many permutations of answers, C-SSRS scores were aggregated into an ordered series of risk groups determined by the VUMC safety board: green, indicating no reported ideation or behavior; yellow, wish to be dead, suicidal thoughts without method or plan, or suicidal behavior more than 1 year ago; orange, suicidal thoughts with method but without plan or suicidal behavior between 3 months and 1 year ago; and red, suicidal intent with or without plan or suicidal behavior within 3 months. If patients scored yellow, orange, or red, interventions were recommended by automated clinician-focused advisories in the patient medical record. These C-SSRS risk classifications used at VUMC do not correspond directly to the scoring methods validated by Posner et al.13
The VSAIL model was initially trained using the random forest algorithm on a heterogenous mix of adult VUMC patients that included 3250 manually validated SA cases and a set of 12 695 controls with no history of SA.23 The model uses historical EHR data, including the following features: demographic data (age, sex, race), diagnostic codes, medication data, past health care utilization, and area deprivation index by patient zip code. Full model training and validation details have been published elsewhere.19,23
Since its deployment in 2019 to our production EHR environment (Epic Systems Corporation), the VSAIL model has silently calculated predictions of suicide risk at the start of routine clinical visits. Clinical decision support prompted by VSAIL has been designed in parallel but was not active during the period of this research.24 Although the VSAIL model was initially trained to predict SA and SI at 30 days, it has been validated for SA and SI across a variety of time periods (eg, 7, 30, 60, 90, and 180 days).23,25 Therefore, for each index visit, we used the associated VSAIL score as the predicted risk of SA and SI occurring within each corresponding time interval. A comprehensive evaluation is provided for 30-day outcomes based on previous selection of this prediction target.
The data were transformed to the encounter level, with each row representing a unique patient visit. Only the responses associated with the index visit were considered, and a patient’s historical screening responses were ignored. Patient responses to the 6 C-SSRS components were represented as binary features with yes represented as 1 and 0 otherwise. An unstructured text field containing screener comments was also transformed to a binary feature in which nonnull values were represented as 1 and null values as 0. Using these covariate features, we fit separate logistic regression models for SA and SI within each time period. We calculated 95% CIs for the odds ratio (OR) of each covariate and tested the statistical significance of the observed effect sizes.
To assess the predictive validity of the C-SSRS triage system used at VUMC to estimate risk of SA and SI, we developed a rule-based model that classified patients scoring red as high-risk (predicted class 1) and all others as low-risk (predicted class 0). To optimally compare and combine face-to-face screening assessments with VSAIL predictions, we also trained a logistic regression model using the C-SSRS component features and predicted suicide risk at each index visit in the cohort. We generated out-of-sample predictions for each observation by using 5-fold cross-validation and combining the hold out set predictions from each fold. Based on an initial performance comparison and the continuous output of the VSAIL model, we used the similarly formatted C-SSRS regression model predictions, rather than the binary output of the C-SSRS rule-based model, to develop ensemble models.
Since the VSAIL model was already implemented in production and calculated in real-time, we avoided retraining in favor of ensemble methods that combined the C-SSRS regression and VSAIL predictions. This scenario is an example of late fusion, which combines the results of models trained separately into a single model.34 We used the mean and the maximum of these 2 predictions to test simple aggregations. We also calculated a weighted average equal to the C-SSRS regression for medium- to high-risk categories (ie, yellow, orange, or red), and the VSAIL prediction for low-risk categories (ie, green). To accommodate safety concerns with real-time deployment, the weighted average favored the C-SSRS for patients with active suicidal thoughts and behaviors, while favoring historical EHR data otherwise. Finally, we trained a lasso regression model with 5-fold cross-validation that used both the C-SSRS regression and VSAIL predictions as input features and produced a final risk prediction. To evaluate the lasso model, we combined hold out set predictions from each cross-validation fold.
We assessed standard discrimination performance metrics: area under the receiver operating characteristic curve (AUROC), area under the precision-recall curve (AUPR), sensitivity, specificity, positive predictive value (PPV), and negative predictive (NPV). Since AUROC can be problematic in classification problems with high case imbalance, it was used only as a preliminary comparison between models.26 Instead, we focused our primary evaluation on PPV and sensitivity at various prediction cutoffs to provide more informative metrics for clinical decision-making and risk detection with rare outcomes. To assess the added value of ensemble models, we calculated the integrated discrimination improvement (IDI), a measure of the improvement in integrated sensitivity and specificity.43
Statistical analyses were conducted using Python version 3.8.2 (Python Software Foundation). Hypotheses tests were 2-sided and specified at a significance level of .05.
Our cohort included 120 398 unique index-visits for 83 394 patients (mean [SD] age, 51.2 [20.6] years; 38 107 [46%] men; 45 273 [54%] women; 13 644 [16%] Black; 63 869 [77%] White). SA was documented in 84 (0.07%), 205 (0.17%), 272 (0.23%), 356 (0.30%), and 514 (0.43%) cases at 7, 30, 60, 90, and 180 days, respectively. SI was documented in 614 (0.51%), 1486 (1.23%), 2036 (1.69%), 2433 (2.02%), 3126 (2.60%) cases at 7, 30, 60, 90, and 180 days, respectively (Table 1).
For SA and SI across all time intervals, positive responses to Q 1, Q 2, and Q 6 had the largest effect sizes and were most frequently statistically significant (Figure 1). For SA at 30 days, statistically significant C-SSRS components were Q 1 (OR, 8.80; 95% CI, 3.96-19.58), Q 5 (OR, 2.35; 95% CI, 1.45-3.82), and Q 6 (OR, 2.46; 95% CI, 1.51-4.03). For SI at 30 days, significant C-SSRS components were Q 1 (OR, 10.59; 95% CI, 7.98-14.04), Q 2 (OR, 2.04; 95% CI, 1.51-2.76), and Q 6 (OR, 1.56; 95% CI, 1.31-1.87). The full logistic regression results are included in eTable 2 in the Supplement.
Across all time intervals, the logistic regression model trained with C-SSRS question-response features outperformed the C-SSRS triage system in terms of AUROC and AUPR. The combination of the C-SSRS regression and VSAIL models outperformed either alone in terms of AUROC and AUPR for both SA and SI at all time intervals. For example, within 30 days of an index visit, the combined models had an AUROC for SA of 0.874 to 0.887 and for SI of 0.869 to 0.879, while the VSAIL model had an AUROC for SA of 0.729 and for SI of 0.773 and the C-SSRS regression model had an AUROC for SA of 0.823 and for SI of 0.777. The weighted average had the highest AUROC among the ensemble models, while the unweighted average generally performed the best in terms of AUPR. The maximum prediction was equivalent to the C-SSRS regression and therefore disregarded in further evaluation. Both C-SSRS–based predictions outperformed the VSAIL model, except in terms of AUROC for longer-term SI. The VSAIL model showed equivalent or higher AUROC as time increased, while the AUROC for C-SSRS predictions decreased over time (Table 2 and Figure 2)
At lower-risk cutoffs for outcomes occurring within 30 days, the VSAIL model had higher sensitivity and specificity than the C-SSRS regression model. At higher-risk thresholds, the C-SSRS regression model offered substantially higher PPV than the VSAIL model and continued to identify the majority of cases up until the 95th percentile. The ensemble models had higher sensitivity, specificity, and PPV than both the C-SSRS regression and VSAIL models across all risk thresholds, except for the top 1%, where the best ensemble model was equivalent to the C-SSRS regression. In the top 10% of SA risk, ensemble models had higher sensitivity (SA: 77.6%-79.5%, SI: 67.4%-70.1%) and PPV (SA: 1.3%-1.4%; SI: 8.3%-8.7%) than the C-SSRS regression (sensitivity for SA: 76.6%; sensitivity for SI: 68.8%; PPV for SA: 0.5%; PPV for SI: 3.5%) and VSAIL models (sensitivity for SA: 28.8%; sensitivity for SI: 35.1%; PPV for SA: 0.4%; PPV for SI: 3.9%). The red C-SSRS triage tier identified most SA cases (53.7%) with a PPV (3.8%) that rivaled the C-SSRS regression and ensemble models at the 99th risk percentile (Table 3). Full discrimination metrics by risk threshold are shown for all time periods in eTable 3 in the Supplement.
For SA occurring within 30 days, VSAIL had a higher AUPR (0.235) than the C-SSRS regression model (0.029) among psychiatric ED encounters. In the ED and inpatient settings, the C-SSRS regression had AUPR scores of 0.019 and 0.031, respectively, while the VSAIL model had AUPR values of 0.007 and 0.011, respectively. The VSAIL model performed better among White patients than patients with other racial or ethnic identities (AUPR: 0.107 vs 0.024), while the disparity was smaller for the C-SSRS regression (AUPR: 0.036 vs 0.024).
For SA at 30 days, the IDI was 0.053 between the lasso and C-SSRS regression models (P < .001) and 0.510 between the lasso and VSAIL models (P < .001). For SI at 30 days, the IDI was 0.024 between the lasso and C-SSRS regression models (P < .001) and 0.462 between the lasso and VSAIL models (P < .001).
Of 514 SAs occurring within 180 days, the C-SSRS regression uniquely identified 237 cases within the highest-risk decile. The VSAIL model correctly stratified 73 cases that were not identified by the C-SSRS regression within the highest risk decile. Of the cases predicted by the C-SSRS alone, 220 (92.8%) had at least 1 prior visit with a diagnostic code for SA or SI, compared with 55 (75.3%) predicted by VSAIL alone. Cases identified by VSAIL alone, compared with the C-SSRS alone, were disproportionately male (61 [83.6%] vs 144 [60.8%]) and had a higher median (IQR) number of total visits (126 [78-302] vs 77 [21-137]) and years of EHR data (17.14 [3.97-26.85] vs 12.00 [2.05-19.54]).
We analyzed the predictive validity of the C-SSRS and evaluated the utility of combining face-to-face screening with an automated suicide risk prediction model. The primary finding was that the combination of the C-SSRS and VSAIL models outperformed either alone in the prediction of SA and SI at all time intervals. By leveraging the complementary strengths of historical EHR data and face-to-face screening, ensemble learning improved discrimination at various risk thresholds. In the highest risk decile for SA at 30 days, only the ensemble models surpassed thresholds (for PPV and sensitivity) required for suicide prediction models to deliver health economic benefit.30 We found this improvement (especially in PPV) to be clinically significant, although the costs and benefits of our ensemble approach will vary greatly between health care sites.
While the sensitivity of the C-SSRS regression decreased over time and increased for the VSAIL model, ensemble models showed consistent performance across time. There was a distinct difference in the predictive time scale of the C-SSRS, which records dynamic, short-term indicators of a patient’s suicidal thoughts and behaviors, and the EHR-based model, which learns a patient’s underlying risk level by incorporating longitudinal clinical features. At the 50th risk percentile cutoff, the VSAIL model had higher sensitivity and PPV than the C-SSRS regression model for both SA and SI. At the 95th and 99th percentile cutoffs, the C-SSRS regression performed notably better. Ensemble models might have benefited from the relative strengths of the VSAIL and C-SSRS regression models at lower and higher risk thresholds, respectively. The C-SSRS predictions might have been limited by the commonality of patients denying SI despite being at high risk of SA and death.27 Performance of the VSAIL model may have suffered because some observations in the analysis did not have extensive historical clinical data. Ensemble methods seemed to mitigate the respective weaknesses of both data sources by exploiting their diversity while also synthesizing their independent, complementary strengths.
Simon et al17 found suicide risk screening (patient self-report via the PHQ-9) to be an important predictive feature when included alongside other clinical variables in initial EHR model training. Our work builds on this by demonstrating that structured clinician assessment (ie, the C-SSRS) can be combined with existing risk prediction models for improved performance, even when screening data are not available at the time of EHR model training. Through our novel ensemble approach, we isolated and quantified the additive benefit of face-to-face screening when incorporated with EHR-based predictions. By also analyzing the discordance and potential synergies between these 2 existing risk prediction methods, we make novel contributions toward the reconciliation of statistical and clinical risk prediction, as outlined by Simon et al.18
A meta-analysis by Franklin et al33 showed poor predictive validity for screening instruments such as the C-SSRS. However, these instruments continue to be used clinically, and measuring their local predictive validity remains worthwhile. We found that the C-SSRS performs well locally, but its combination with automated scalable risk modeling performs better than either alone. Rapid assessments, such as the brief version of the C-SSRS used here or the ASQ, are likely preferable to longer assessments like the full C-SSRS (when used alone or in combination with EHR-based models).37
Clinical screening and EHR-based models have strengths and weaknesses beyond their predictive performance. In-person screening requires time, mental health resources (which are often limited), training on standardized assessments (eg, the C-SSRS), support from health care administrators, and workflow modifications.37,40,41 An important benefit of clinical screening is that it creates an opportunity for patient-physician dialogue that can lead to individualized treatment interventions. Although EHR-based machine learning can be automated at scale, developing, validating, and implementing predictive models requires a substantial initial resource investment. Ethical and legal issues around privacy, data usage, and accountability also hinder the adoption of machine learning in health care.42
Our findings highlight specific ways that face-to-face screening and EHR-based models might be used in tandem to overcome the challenges of predicting rare phenomena like SA and death. The low PPV currently offered by suicide risk models limits their potential utility in clinical practice, as falsely classifying patients as high-risk for suicide might worsen stigma and unnecessary interventions.28,29 Although this has been proposed elsewhere, our results empirically support using an EHR-based model as an initial detection mechanism that prompts further in-person screening.16,19 At the 50th risk percentile cutoff, the VSAIL model would have identified 18% to 35% more individuals with SA (25%-42% for ensemble models) that were not detected by C-SSRS triage. However, the higher PPV offered by the C-SSRS will likely be necessary to confidently recommend preventive care for individuals with high risk. Using EHR-based machine learning and face-to-face screening in this hierarchical series seems to leverage their complementary nature in a way that augments, rather than replaces, clinician-centered care.
Alternatively, it might be more natural in some settings (eg, those with universal screening) to use a series implementation that applies statistical prediction to patients screening negative. The C-SSRS outperformed the VSAIL model (especially in the short term), and SA cases identified by VSAIL alone were more likely to have no history of suicidal behavior (24.7% vs 7.2%). In settings where screening is widely administered, statistical prediction might be used secondarily to identify cases without prior suicidal behavior or with low screening risk due to nondisclosure. In-person screening and EHR-based models could also be (and often would be) implemented in parallel and combined with an ensemble model that outputs a final risk prediction and triggers clinical action. This would provide a marginally higher PPV than our C-SSRS triage system but would introduce many significant obstacles associated with using machine learning alone to dictate clinical interventions.31 Our results suggest that EHR-based models should incorporate available in-person screening data to improve sensitivity and PPV (especially at higher risk thresholds). For the majority of health care systems implementing face-to-face screening alone, incorporating EHR-based models can improve sensitivity at lower risk thresholds, provide continuous output for more specific decision cutoffs, and identify cases typically overlooked by clinician assessment (eg, instances of patient nondisclosure).
A primary strength of this study is the collection of a large sample of EHR data with significant overlapping implementation periods for an automated prediction model and universal screening. Our cohort included a relatively large number of SA cases for an observational EHR-based study, which enabled us to develop and evaluate predictive risk models. This analysis greatly benefited from an existing prediction model developed on-site and supported by multiple validation studies.19,23 The C-SSRS is a validated screening tool that predicts suicidal behavior, which likely enabled us to demonstrate notable performance improvements when integrating face-to-face screening and an EHR-based model.13 SA and SI are more prevalent than suicide death, and targeting these outcomes might lead to more clinically useful risk models with acceptable PPV.28
Immediate extensions of this work include ongoing research using the VSAIL model to prompt suicide risk screening in a pragmatic clinical trial and concurrent efforts to incorporate the C-SSRS response data within our real-time modeling pipeline. Rigorous statistical analysis and replication are needed to further evaluate the discordance between statistical and clinical risk prediction across various patient cohorts. Our work leaves ample opportunity to explore alternate ways of integrating statistical and clinical risk predictions. Although lasso was chosen as a simple ensemble method, more complex algorithms, such as nonnegative least squares (NNLS), might improve performance.38 Additional attempts could allow clinicians to heuristically combine EHR-based predictions with in-person screening, in contrast to our ensemble learning approach. To validate the use of automated risk prediction and face-to-face screening in series, one must evaluate how screening methods would perform on the cohort initially identified by statistical risk prediction alone.
Limitations of this work include its confinement to a single medical center, with reliance on universal screening (potentially infeasible for some health systems) and a pretrained model, which may limit generalizability. Although we estimated that screening was administered for 98% of adult ED visits, our cohort excluded a larger set of patient visits (particularly those outside the ED) that did not include the screening. This underestimated the broader applicability of the VSAIL model and may have inflated PPV given the higher case prevalence in this cohort. Since patients with more than 1 visit were included multiple times, discrimination metrics might have been affected by the repeated-observations problem.39 By defining the outcomes of interest (SA or SI) based on ICD codes associated with follow-up visits at VUMC, we excluded cases occurring outside our health care system. ICD codes may have introduced diagnostic imprecision or excluded incompletely coded cases of suicide (eg, patients presenting in cardiac arrest or in overdose without obvious suicidal association).32 The ICD-10-CM codes used to define SA include language of “intentional self-harm” (eg, T36) or “suicide attempt” (T14). A caveat remains that intent to die cannot be gleaned from an ICD-10-CM code for “intentional self-harm” alone. Holistic ascertainment of suicide death was unavailable during the study period and therefore not considered as an outcome The C-SSRS was used to recommend preventive interventions, and this likely confounded the predictive performance of the C-SSRS model, as patients screening positive might have received treatments that prevented future suicidal events.
In this study, ensemble models combining the C-SSRS and an EHR-based machine learning model outperformed either alone in the prediction of SA and SI. Differences across various risk thresholds, time periods, and characteristics of identified cases seem to underly a synergy between clinical and statistical risk prediction. The improvement (especially in PPV) from combining in-person screening and historical EHR data was clinically significant, although the costs and benefits of our ensemble approach will vary greatly between health care sites. Further research is needed to compare alternate ways of combining clinical and statistical risk prediction and to analyze the practical implications of implementing them in clinical systems.
Accepted for Publication: March 28, 2022.
Published: May 13, 2022. doi:10.1001/jamanetworkopen.2022.12095
Open Access: This is an open access article distributed under the terms of the CC-BY License. © 2022 Wilimitis D et al. JAMA Network Open.
Corresponding Author: Colin G. Walsh, MD, MA, Department of Biomedical Informatics, Vanderbilt University Medical Center, 2525 West End Ave, Ste 1475, Nashville, TN 37203 (email@example.com).
Author Contributions: Mr Wilimitis and Dr Walsh had full access to all of the data in the study and take responsibility for the integrity of the data and the accuracy of the data analysis.
Concept and design: Wilimitis, Turer, Fielstein, Kurz, Walsh.
Acquisition, analysis, or interpretation of data: Wilimitis, Turer, Ripperger, McCoy, Sperry, Fielstein, Walsh.
Drafting of the manuscript: Wilimitis, Walsh.
Critical revision of the manuscript for important intellectual content: All authors.
Statistical analysis: Wilimitis, Turer, Sperry, Walsh.
Obtained funding: Walsh.
Administrative, technical, or material support: Ripperger, Fielstein, Kurz, Walsh.
Conflict of Interest Disclosures: Dr Walsh reported receiving grants from the National Institutes of Health, the US Food and Drug Administration, the Military Suicide Research Consortium, Wellcome Leap, the Selby Stead Fund, Vanderbilt University Medical Center, and the Tennessee Department of Health; receiving personal fees from Southeastern Home Office Underwriters Association and Hannover Re; and holding equity in Sage AI outside the submitted work. No other disclosures were reported.
Funding/Support: The study was funded by Evelyn Selby Stead Fund for Innovation, Vanderbilt University Medical Center (grant R01 MH121455: Distinguishing clinical and genetic risk of suicidal ideation from attempts to inform prevention and grant R01 MH116269: Leveraging Electronic Health Records for Pharmacogenomics of Psychiatric Disorders) and grant W81XWH-10-2-0181 (Optimized Suicide Risk Detection and Management in Military Primary Care) from the Military Suicide Research Consortium. Funding for the Research Derivative and BioVU Synthetic Derivative is through UL1 RR024975/RR/NCRR from the National Center for Research Resources.
Role of the Funder/Sponsor: The funders had no role in the design and conduct of the study; collection, management, analysis, and interpretation of the data; preparation, review, or approval of the manuscript; and decision to submit the manuscript for publication.