Prospective Validation of an Electronic Health Record–Based, Real-Time Suicide Risk Model

This cohort study evaluates the performance of a suicide attempt risk prediction model implemented in a vendor-supplied electronic health record to predict subsequent suicidal ideation and suicide attempt.


Introduction
Suicide prevention begins with risk identification and prognostication. The standard of care remains face-to-face screening and routine clinical interaction. Yet rates of suicidal ideation, attempts, and deaths continue to rise internationally despite increased monitoring and intervention efforts. 1 The coronavirus disease 2019 (COVID-19) pandemic exacerbated contributing factors for suicide and will continue to do so in the post-COVID-19 era. [2][3][4] Numerous prognostic models of suicide risk have been published. 5 Few have been implemented into real-world clinical systems outside of integrated managed care settings. [5][6][7] In some settings, universal screening might reduce risk of downstream suicidality. 8 But in-person screening takes time and attention and can be conducted with variable quality. 9 Concealed distress also subverts risk identification in face-to-face screening. 10 Furthermore, those at risk might not be identified despite health care encounters as recently as the day they die from suicide. [11][12][13] Linking scalable, automated risk prognostication with real-world clinical processes might improve suicide prevention. 14 The most prominent example of an operational suicide risk prediction with implemented prevention is REACH VET (Recovery Engagement and Coordination for Health-Veterans Enhanced Treatment) from the Veterans Health Administration. 6 Similarly, Army STARRS (Study to Assess Risk and Resilience in Servicemembers) demonstrated algorithmic potential in active duty service members. 15,16 A number of groups, including ours, have published modeling studies for civilians both nationally (eg, the Mental Health Research Network) 5,17-20 and internationally. 21 A recent brief report 7 estimated the increased potential workload of a suicide risk prediction model to generate alerts in an integrated managed care setting, Kaiser Permanente. In Europe, linking mobile health and predictive modeling for suicide prevention has been described, 22 as have predictive modeling studies developed for national and single-payer cohorts. 21,23 While some risk models rely on face-to-face screening data (eg, the Patient Health Questionnaire-9) to calculate risk, 17 generating these important predictors relies on existing or changing clinical workflow-a difficult task. In some hospitals, universal screening occurs in the emergency department alone. A model reliant solely on routine, passively collected clinical data, such as medication and diagnostic data, might scale to any clinical setting regardless of screening practices. Few real-world data exist on successes and pitfalls of translating such models into operational clinical systems in the presence or absence of universal screening. 7 Like any prognostic test, such as radiographic imaging and laboratory studies, electronic health record (EHR)-based risk models serve as an additional data point for clinical decision-making. When linked to guideline-informed and evidence-based education along with actionable, user-centered decision support, they might improve provision of suicide prevention. Such systems might prompt care outside of routine health encounters, eg, a prioritized telephone call to a high-risk patient who missed an appointment or guidance on assessing means to a primary care clinician who does not do so regularly. Ideally, these systems would improve quality of care while reducing burden on clinicians to respond appropriately at the right times.
Part of a larger technology-enabled suicide prevention program, our work applied the multiphase framework for action-informed artificial intelligence 24 to suicide attempt prognostication. We completed phases 1 and 2 in initial model development 20 followed by phase 3, replicative studies. 18,25 The fourth phase includes design, usability, and feasibility testing for the operational platform before effectiveness testing and practice improvement in the final phase.
We evaluate prospectively the real-time EHR risk prediction platform here (fourth phase) to answer the question, "How well do EHR-based suicide risk models perform in the clinical setting, and is performance generalizable?" Models that fail to validate at this phase, or those not studied in this fashion prior to implementation, might covertly hinder clinical decision-making. Predictive models might be evaluated similarly to any novel prognostic data point (eg, a laboratory or imaging result). 26 This validation should account for clinical context, setting, and the presence of universal screening. 8

Methods
We studied an observational, prospective cohort of clinical inpatient, emergency department, and ambulatory surgery encounters at a major academic medical center in the mid-South, Vanderbilt University Medical Center (VUMC), from June 2019 to April 2020. Predictions were prompted by the start of routine clinical visits in the EHR. Because model validity was untested outside of research systems, 27 model predictions did not trigger EHR alerts or decision support.
The VUMC Institutional Review Board approved 2 protocols with waiver of consent given the infeasibility of consenting these EHR-driven analyses across a health system. Only clinical production-grade systems were used to protect privacy and demonstrate feasibility. This study followed the Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) reporting guidelines. 28

Study Outcomes
In this study, the predictive model trained on suicide attempt risk was used to predict both suicide attempt (primary) and suicidal ideation (secondary) within 30 days of discharge. 25  Our previously published approach was internally valid at multiple time points (eg, 30 days vs 90 days). 20 Thirty-day outcomes were selected as the prediction target with input from behavioral health experts involved in local suicide prevention.

Inclusion/Exclusion Criteria
We included all adult patients seen in inpatient, emergency department, and ambulatory surgery settings. Individuals with death dates in the Social Security Death Index were right-censored if deaths occurred within 30 days of discharge. Cause-of-death data were not available across the enterprise, so deaths from suicide were not included as prediction targets.

Implementation
Full modeling details have been published 20 and are in the eMethods in the Supplement. Briefly, we trained random forests, a nonparametric ensemble machine learning algorithm, on a heterogeneous, retrospective group of adult cases and controls prior to 2017 stored in a deidentified research repository, the Synthetic Derivative. 31 These models were validated with a variant of bootstrapping with optimism adjustment, in which each bootstrap iteration was tested against a true holdout set to lessen overfitting. 32 Model performance in training to predict suicide attempt within 30 days showed area under the receiver operating characteristic curve (AUROC) of 0.9 (95% CI, 0.9-0.91) on a deidentified data set of 3250 cases of manually validated suicide attempts and 12 695 adults with no history of suicide attempt. 20 Predictors included the following: • Past health care utilization (counts of inpatient, emergency department, and ambulatory surgery visits over the preceding 5 years) • Area Deprivation Indices 33 by patient zip code

Real-Time Prediction
At registration for inpatient, emergency department, or ambulatory surgery visits, the modeling pipeline used 5 years of historical data to build a vector of predictors. Preliminary analyses showed the 5-year lookback window to perform similarly to models using all historical data. The predictive model then generated a probability of subsequent suicide attempt in 30 days. Here, we validate that probability to predict encounters for suicide attempt or suicidal ideation in the subsequent 30 days.

Recalibration
Calibration measures how well predicted probabilities reflect real-world outcome (eg, a 1% risk of suicide attempt means 1 of 100 similar individuals from that population should have the outcome).
Miscalibrated models hamper clinical decision-making. 34 To enrich signal, the research-grade model 20

Results
The in ICD-10-CM with an interrater agreement κ of 1, notably higher than the PPV of 0.58 for ICD-9 in a medical record validation of 5543 medical records in prior work at VUMC. 20

Model Performance
Cohort criteria affect model performance, as we and others have shown. 38 Analyses considered duration of EHRs per patient, clinical settings (eg, inpatient vs emergency department), and universal screening. Demographic characteristics of sex (not gender, which lacks reliable identification in most EHRs) and race were also considered. Performance by length of EHR is shown in aggregate for each outcome ( Table 2).

Risk Concentration and NNS
Risk concentration plots for all encounters are shown (Figure 1) with NNS, the reciprocal of PPV, per quantile. The highest risk quantiles have an NNS of 23 and 271 for suicidal ideation and suicide attempt, respectively.

Evaluation by Status of Universal Screening
Metrics by predicted risk quantile are shown for suicide attempt risk ( Table 3). In settings with universal screening, the lowest risk quantile (n = 6795) with predicted risk threshold near 0 had a PPV of 0.1% for suicidal ideation and approximately 0 for suicide attempt. The highest risk quantile (n = 5457) above a threshold of 3.2% had a PPV of 3% for suicidal ideation and 0.3% for suicide attempt.
In settings without universal screening, the highest risk quantile (n = 4220) above a threshold of 3.2% had a PPV of 4.3% for suicidal ideation and 0.4% for suicide attempt. The lowest risk quantile (n = 23 589) of predicted risk near 0 had a PPV of 0.1% for suicidal ideation and 0 for suicide attempt.

Risk Prediction Performance by Demographic Subgroup
The NNS for suicide attempt in the highest risk quantiles for men and women in the medical centerwide cohort were 256 and 323, respectively. By race, as coded in the EHR (White, Black), the NNS was 373 for White patients, 176 for Black patients, and 407 for non-White and non-Black patients.

Discussion
This study validated performance of a published suicide attempt risk model 20 using real-time clinical prediction in the background of a vendor-supplied EHR. Primary findings include accuracy at scale regardless of face-to-face screening in nonpsychiatric settings. We note feasible NNS in the highest predicted risk quantiles with potential for reduced screening workload for those at lowest risk.
Overall performance was not sensitive to temporal length of EHRs. The decision of minimum length of EHR to display an alert or prediction for an individual patient, however, will be the subject of future decision support testing. The potential implications of this work influence screening practices, clinical decision-making, and care coordination. Regarding screening, both false negatives and false positives have been considered weaknesses of suicide-focused risk models in systematic review. 5 Here, we note very low false-negative rates in the lowest risk tiers both within (0.02%) and without (0.008%) universal screening settings (Table 3). Assuming that face-to-face screening takes, on average, 1 minute to conduct, automated screening for the lowest quantile alone would release 50 hours of clinician time per month. Regarding false positives, the NNS of 271 was feasible in the highest risk group. Suicidal ideation, even more common, had a better NNS of 23. For context, NNS was 418 for screening for dyslipidemia to prevent cardiac death when it was introduced. 39 The present study provides further evidence that current models might be best suited to direct prevention to suicidal ideation and attempts-more common yet still in the causal pathway for death from suicide. 5 A representative screening protocol is shown in Figure 2. might be well suited to counterfactual prediction in the future. 44 Future work should include development and validation of site-specific predictive models-or models that will be "site aware" in deployment.
Without attention to these differences and analyses conducted here, intervention might be linked to misspecified models. Moreover, it becomes difficult to assess pure model performance once an intervention is prompted by it. Future iterations of these models (1) might be updated based on site-specific cohorts to improve performance and (2) should include the interventions available to prevent suicide within models themselves to prevent model drift even when the care delivered is accomplishing its intended purpose. 27 Strengths of this study include a large, real-world vendor-supplied EHR setting. It incorporates prospective validation on natural cohorts of individuals receiving routine care over the study period.
The study included visits across the breadth of a major academic medical center, which improves generalizability. These results complete only part of the fourth phase of action-informed artificial intelligence 24 to help prevent suicide. We have designed our models with usability and feasibility in mind, but these have not yet been tested. Our modeling requires no additional screening (eg, the Patient Health Questionnaire-9 or Columbia Suicide Severity Rating Scale), although future versions might incorporate them to improve risk prognostication. Yet, impact will not be achieved without careful attention to the people and clinical processes to leverage these predictions to prevent suicide.

Limitations
Limitations of this work include a single-center study with low outcome prevalence, particularly for suicide attempt. Predictors included in this model were chosen to optimize scalability and potential generalizability. They rely on structured data ubiquitous in EHRs (diagnostic codes, medications, past utilization, demographic characteristics). However, they also limit the model's ability to predict suicide attempt risk by failing to capture important predictors recorded in unstructured free text notes, for example, or outside the EHR. Implemented risk models were initially trained on a noncomprehensive subsample of the medical center population. Ascertainment is limited to care at a single medical center, so events occurring at external health systems were not captured. However, this bias is conservative for model performance analyses; suicide attempts that occur outside the study site are false negatives, far more likely to affect and worsen apparent performance metrics given the low number of cases than the inverse. Deaths from suicide were not ascertained in this study. Future work to improve ascertainment and continuously evaluate these models in production is paramount.
Multiple opportunities to expand this work remain. Better understanding of misclassification of risk will improve model performance and potential impact. Novel means of ascertaining suicidality both in and out of individual health systems through health information exchange-such as that available in the Veterans Health Administration; in large health systems, such as HCA Healthcare, Tenet Healthcare, or Kaiser Permanente; or in states, such as Connecticut 45 -might lead to improved model evaluation and improved performance. Through partnerships such as the Tennessee Department of Health-VUMC Experience, 46 we are beginning to devise a system that would bridge the current gap preventing ascertainment of death from suicide.
Suicide prevention will not be achieved through a predictive model alone, regardless of its analytic performance. Pragmatic trials to study real-world effectiveness of these predictive models in concert with thoughtful, user-centered clinical decision support remains the path to achieving clinical impact in suicide prevention.

Conclusions
In this study, implementation of validated predictive models of suicide attempt risk showed reasonable performance at scale and feasible NNS for subsequent suicidal ideation or suicide attempt