Validation of a Proprietary Deterioration Index Model and Performance in Hospitalized Adults

Key Points Question How does the Deterioration Index perform in predicting patient clinical deterioration across diverse hospital settings and demographic groups? Findings In this prognostic study involving more than 5 million DTI predictions for 13 737 patients, the DTI had acceptable discrimination at the observation level but poor discrimination at the encounter level. Performance also varied by demographic subgroup, with worse performance for patients identifying as American Indian or Alaska Native and those who chose not to disclose their ethnicity. Meaning These findings highlight the need to consider patient demographic characteristics and variations in care practices when implementing predictive models such as the DTI.


Introduction
Failure to recognize or act upon clinical deterioration is associated with an estimated 15% of avoidable deaths in the hospital. 1 While nurses and physicians often do not recognize deterioration, their performance can improve when presented with information from an electronic early warning system. [2][3][4] However, traditional scoring systems such as the National Early Warning System are neither sensitive nor specific and consequently produce alert fatigue. 3,5 In response, model developers have turned to machine learning to gain better predictive performance. 6 The Deterioration Index (DTI; Epic Systems Corporation; hereinafter, Epic) is a proprietary machine learning model that has been adopted across hundreds of hospitals since its release in 2017. 7 Given widespread use of the DTI to assist in care delivery, it is important to measure its performance across as many health care settings as possible. 8 Notably, the optimal DTI score at which to trigger clinical interventions remains unclear, and the vendor does not report performance across subgroups of race, ethnicity, age, or sex. 9 This study aimed to (1) describe the overall and lead time performance of the DTI among a large cohort of patients across 8 heterogenous Midwestern US hospitals, (2) evaluate performance measures at suggested thresholds to support clinical decision-making, and (3) assess bias in predictions among demographic subgroups.

Methods
This prognostic study was approved by the institutional review board of the University of Minnesota, which waived the requirement for consent because there was no more than minimal risk to participants, no way to practically conduct the research without the waiver, and no adverse effect on the rights and welfare of the participants without the waiver. Only patients who had not chosen to opt out of research during their initial interaction with our health care system, such as a first clinic appointment or emergency department registration, were included in the study. This study followed the Transparent Reporting of a Multivariable Prediction Model for Individual Prognosis or Diagnosis (TRIPOD) reporting guideline for prediction model validation.

The Deterioration Index
The DTI is a proprietary ordinal logistic regression model that was trained to predict patient clinical deterioration, defined as escalation of care (intensive care unit [ICU] transfer, rapid response team [RRT] activation, or code team activation) within 12 hours or death (defined as discharge time) within 38 hours. 10 The model outputs a probabilistic score between 0 and 100 every 15 minutes, with a higher score indicating higher risk of deterioration. Detailed demographic information used for the training data set are not available publicly. A list of variables used by the DTI is provided in eMethods in Supplement 1.

Setting and Study Population
In 2020, our organization silently turned on the DTI model, which ran in the background of our electronic health record (EHR). Consequently, clinicians did not see or act on the scores generated by the DTI. For this study, we included all patients 18 years or older who were hospitalized at MHealth Fairview (the health system of the University of Minnesota) between January 1 and May 31, 2021, and who had at least 1 DTI score recorded. The 8 hospitals in our study were a mix of academic (n = 2) and community (n = 6) hospitals, including 3 rural hospitals (as defined by the Health Resources and Services Administration rural-urban commuting area code criteria). 11 We excluded predictions made in preoperative, intraoperative, ICU, or labor and delivery locations because surgical procedures were likely to involve planned mechanical intubation, ICU patients had already deteriorated, and labor and delivery patients were excluded from the DTI training data.

Definition of Outcomes
We defined our primary outcome, deterioration, as mechanical ventilation, ICU transfer, or death (defined as time of discharge). [12][13][14][15][16] Deterioration events occurring after comfort care orders were excluded from analyses. Although the DTI was trained to also detect RRT or code team activations, our EHR did not discretely capture these events; consequently, they could not be incorporated into our definition of deterioration. Nevertheless, to gain insights into the occurrence of such events in patients experiencing deterioration across our hospitals, we conducted a manual records review of 50 cases (eMethods and eTable 2 in Supplement 1). This approach, although exploratory, allowed us to better understand the underlying factors associated with deterioration, the preventability of deterioration, and the frequency of RRT and code team activations among patients experiencing deterioration.

Model Validation
We used area under the receiver operating characteristic curve (AUROC) and area under the precision recall curve (AUPRC) to describe the overall performance of the DTI among the patient population. The AUROC is the probability that the model correctly ranks deterioration risk among 2 randomly selected samples from the outcome and nonoutcome groups. The AUPRC is the weighted mean of precisions (ie, positive predictive values [PPVs]) achieved at each threshold, in which weight is the increase in recall (ie, sensitivity) from the previous threshold.
We used both observation-and encounter-level evaluation methods for our primary analysis, consistent with existing evaluation studies of deterioration prediction models [16][17][18][19][20] (eMethods in Supplement 1). At the observation level, we included all DTI scores across each hospitalization and asked whether the outcome occurred within 12 hours of each prediction. At the encounter level, we included only the highest DTI score from each hospitalization and asked whether the outcome occurred any time thereafter. In all cases, scores derived after deterioration, including scores in the ICU, were excluded. We tested model performance at the observation level across several lead times (3 hours, 6 hours, 12 hours, 24 hours, 38 hours, and 72 hours) 16,17,21,22 for the composite definition of deterioration and each of its component events. Calibration was visualized by plotting calibration curves with 20 bins each. 23

Threshold Analysis
As an ordinal model, the DTI was designed to predict 2 distinct events: care escalation and death. To accommodate these different outcomes, the vendor allows hospitals to set both a medium-and high-risk score threshold. Accordingly, we evaluated DTI performance at both a medium-risk threshold targeting 50% sensitivity and a high-risk threshold targeting 10% PPV. Hospitals may also prefer a single threshold to trigger an intervention, 24 in which striking a balance between sensitivity and PPV is essential. We used the maximum F1 score (the harmonic mean of sensitivity and PPV) to identify an optimal single score threshold. We report the number needed to evaluate (NNE) at each threshold, calculated as the inverse of PPV. The NNE describes the number of predictions with a score higher than the threshold that must be evaluated to identify 1 case of deterioration. 25

Bias Assessment
We evaluated the ability of the DTI to make unbiased predictions by calculating bias measures across subgroups of race, ethnicity, biological sex, and age. Race and ethnicity were self-reported in the EHR. Race and ethnicity data were relevant to the assessment of bias because disparities in deterioration among certain racial and ethnic groups may be perpetuated or worsened by biased algorithmic predictions. 26,27 Ethnicity was reduced from more than 100 options (eTable 1 in Supplement 1) into 2 categories, Hispanic or Latino (hereinafter, Hispanic) or other ethnicity, with the latter category including all those who did not report Hispanic ethnicity. Age was dichotomized to younger than 60 years or 60 years or older.
To measure potential bias, we calculated 3 bias measures 28 for each subgroup. Within each subgroup, we defined the protected group as the subset that did not identify as White race, other ethnicity, male sex, or age 60 years or older. First, we calculated AUROC parity, defined as the ratio of the AUROC of the protected group to the AUROC of the reference group. Second, we calculated sensitivity parity (also known as equal opportunity), defined as the ratio of the model's sensitivity in the protected group to its sensitivity in the reference group. Third, we calculated PPV parity (also known as false discovery rate parity), defined as the ratio of the model's PPV in the protected group to its PPV in the reference group. We studied these 3 measures because a deterioration model must discriminate well (AUROC parity) while identifying the greatest number of patients experiencing deterioration (sensitivity parity) and avoiding false-positive cases (PPV parity). We measured sensitivity and PPV parity at the threshold that maximized F1 score.

Statistical Analysis
We abstained from conducting formal statistical tests on model performance measures because our focus was solely on evaluating a single model. We calculated 95% CIs for proportional point estimates using the Wilson score. 29 Differences in model performance across lead times and subgroups should be considered exploratory. All analyses were performed using Stata software, version 16

Results
A total of 13 737 patients from 8 hospitals, representing 14 834 encounters and 5 143 513 DTI predictions, met the inclusion criteria (Figure 1). Once we removed predictions made after deterioration and encounters in which comfort care orders preceded deterioration, the remaining data comprised 13 918 encounters for analysis at the encounter level. Among 13 918 encounters, the

Threshold Analysis
Observation-level performance measures at clinically informed score thresholds are shown in We deliberately selected a suggested high-risk score threshold (68.3) to achieve a PPV of 10%, which

Bias Assessment
Bias measures for protected subgroups are shown in Figure 2.

Discussion
The purpose of this prognostic study was to comprehensively evaluate the performance of the DTI among a large cohort of patients across 8 heterogenous midwestern US hospitals. We discovered 3 key findings: (1) the DTI had acceptable discrimination in predicting deterioration at the observation level and poor discrimination at the encounter level, with discrimination improving closer to deterioration; (2) achieving clinically appropriate DTI score thresholds is challenging due to large tradeoffs between sensitivity and PPV; and (3) DTI performance varied across certain demographic subgroups.
Overall DTI performance fluctuated widely by encounter-vs observation-level evaluation methods, as it has in other settings. For example, Epic initially used an encounter-level evaluation method to validate the DTI and reported its mean AUROC as 0.801. 10 In late 2022, Epic switched to an observation-level evaluation method because it "more accurately represents the experience clinicians have when using the model." 10  Our strategic selection of suggested score thresholds revealed large tradeoffs in sensitivity and PPV. At the medium-risk score threshold, we prioritized sensitivity to ensure that a substantial proportion of patients at risk was identified. However, the high NNE at this threshold could lead to resource strain due to a high number of false-positive cases. By optimizing the model's maximal F1 score, we balanced sensitivity and PPV to create a single decision threshold. A single-threshold approach is best suited for a well-calibrated model; however, our calibration curves revealed that the DTI does not exhibit a linear association between predicted risk and deterioration. Given the ordinal nature of the model, poor calibration may be expected, which supports the use of 2 distinct score thresholds. Accordingly, we chose our high-risk threshold to prioritize PPV at the expense of identifying a substantially smaller proportion of patients at risk.
Bias in predictive models manifests as a systematic favoritism in the model's predictions, often because of insufficient data collection, selection, or model training. 30 While bias is suggested if parity measure values approach 0.2 with regard to the reference group, the interpretation of such inequality in parity must be contextually aligned with the specific application and potential implications of the observed disparity. 31

Limitations
This study has several limitations. We did not stratify our analysis by COVID-19 positivity. Because the DTI was developed before the emergence of COVID-19, researchers did not know how it would perform in patients with the disease. However, both an internal validation by the developer 10 and a subsequent external validation 16 on patients with COVID-19 revealed comparable if not superior DTI performance among those with COVID-19. For this study, we analyzed the DTI based on the manner in which it was likely to be implemented (ie, without accounting for COVID-19 status) to provide an assessment of its performance within the clinical practice setting.
In our bias assessment, we did not control for multiple protected features simultaneously, which limits our ability to understand how the DTI performs across specific intersectional groups (eg, patients who are young, of Black race, and female). 32 Although we attempted such an analysis, the exponential increase in the number of subgroups when combining protected features led to many subgroups having no or few observations, making the results unreliable. New techniques to robustly assess intersectional bias are needed. 28

Conclusions
This prognostic study found that a proprietary deterioration model had poor to acceptable discrimination when applied to patient data from a large midwestern US health care system, and its performance generally decreased over longer lead times. Large tradeoffs were found between sensitivity and PPV as well as variable performance among certain demographic subgroups. These findings highlight the need for health care organizations to thoroughly evaluate the performance and equity of externally developed predictive models.