CAM-ICU indicates Confusion Assessment Method for the Intensive Care Unit; ICD-9, International Classification of Diseases, Ninth Revision; ICU, intensive care unit; Nu-DESC, Nursing Delirium Screening Scale.
Model performance was evaluated on a prospective test set (receiver operating characteristic curves shown are determined using the subset of the test set with AWOL [age, inability to spell world backward, orientation, illness severity] measurements). ANN indicates artificial neural network; GBM, gradient boosting machine; LR, penalized logistic regression; RF, random forest; and SVM, support vector machine.
eFigure 1. Prevalence of Comorbidities by Elixhauser Comorbidities Index in Train and Test Sets
eFigure 2. Number of Included Hospital Stays (CSNs) by Month of Discharge
eFigure 3. Area Under the Receiver Operating Characteristic Curve (AUC) for Machine Learning Models and AWOL Stratified by Age
eFigure 4. Receiver Operating Characteristic (ROC) Curves for Machine Learning Models and AWOL Stratified by Age
eFigure 5. Model Performance Using More Sensitive Delirium Outcome (Nu-DESC≥1)
eTable 1. Continuous Predictor Characteristics
eTable 2. Categorical Predictor Characteristics
eTable 3. Confusion Matrix Metrics
eTable 4. Confusion Matrix for Gradient Boosting Machine Using 90% Specificity Threshold
eTable 5. Confusion Matrix for Gradient Boosting Machine Using 90% Sensitivity Threshold
eTable 6. Confusion Matrix for Penalized Logistic Regression Using 90% Specificity Threshold
eTable 7. Confusion Matrix for Penalized Logistic Regression Using 90% Sensitivity Threshold
eTable 8. Confusion Matrix for Random Forest Using 90% Specificity Threshold
eTable 9. Confusion Matrix for Random Forest Using 90% Sensitivity Threshold
eTable 10. Confusion Matrix for AWOL Using AWOL≥2 Threshold
Customize your JAMA Network experience by selecting one or more topics from the list below.
Wong A, Young AT, Liang AS, Gonzales R, Douglas VC, Hadley D. Development and Validation of an Electronic Health Record–Based Machine Learning Model to Estimate Delirium Risk in Newly Hospitalized Patients Without Known Cognitive Impairment. JAMA Netw Open. 2018;1(4):e181018. doi:10.1001/jamanetworkopen.2018.1018
Can machine learning be used to predict incident delirium in newly hospitalized patients using only data available in the electronic health record shortly after admission?
In this cohort study, classification models were trained using 5 different machine learning algorithms on 14 227 hospital stays and validated on a prospective test set of 3996 hospital stays. The gradient boosting machine model performed best, with an area under the receiver operating characteristic curve of 0.855.
Machine learning can accurately predict delirium risk using electronic health record data on admission and outperforms the nurse-administered prediction rules currently used.
Current methods for identifying hospitalized patients at increased risk of delirium require nurse-administered questionnaires with moderate accuracy.
To develop and validate a machine learning model that predicts incident delirium risk based on electronic health data available on admission.
Design, Setting, and Participants
Retrospective cohort study evaluating 5 machine learning algorithms to predict delirium using 796 clinical variables identified by an expert panel as relevant to delirium prediction and consistently available in electronic health records within 24 hours of admission. The training set comprised 14 227 adult patients with non–intensive care unit hospital stays and no delirium on admission who were discharged between January 1, 2016, and August 31, 2017, from UCSF Health, a large academic health institution. The test set comprised 3996 patients with hospital stays who were discharged between August 1, 2017, and November 30, 2017.
Patient demographic characteristics, diagnoses, nursing records, laboratory results, and medications available in electronic health records during hospitalization.
Main Outcomes and Measures
Delirium was defined as a positive Nursing Delirium Screening Scale or Confusion Assessment Method for the Intensive Care Unit score. Models were assessed using the area under the receiver operating characteristic curve (AUC) and compared against the 4-point scoring system AWOL (age >79 years, failure to spell world backward, disorientation to place, and higher nurse-rated illness severity), a validated delirium risk–assessment tool routinely administered in this cohort.
The training set included 14 227 patients (5113 [35.9%] aged >64 years; 7335 [51.6%] female; 687 [4.8%] with delirium), and the test set included 3996 patients (1491 [37.3%] aged >64 years; 1966 [49.2%] female; 191 [4.8%] with delirium). In total, the analysis included 18 223 hospital admissions (6604 [36.2%] aged >64 years; 9301 [51.0%] female; 878 [4.8%] with delirium). The AWOL system achieved a baseline AUC of 0.678. The gradient boosting machine model performed best, with an AUC of 0.855. Setting specificity at 90%, the model had a 59.7% (95% CI, 52.4%-66.7%) sensitivity, 23.1% (95% CI, 20.5%-25.9%) positive predictive value, 97.8% (95% CI, 97.4%-98.1%) negative predictive value, and a number needed to screen of 4.8. Penalized logistic regression and random forest models also performed well, with AUCs of 0.854 and 0.848, respectively.
Conclusions and Relevance
Machine learning can be used to estimate hospital-acquired delirium risk using electronic health record data available within 24 hours of hospital admission. Such a model may allow more precise targeting of delirium prevention resources without increasing the burden on health care professionals.
Delirium is common in hospitalized patients, with a prevalence of 18% to 35% and incidence of 11% to 14% in general medical wards, and is independently associated with poor health outcomes.1 It contributes between $38 billion and $152 billion per year to US health care costs.2 Current data suggest hospital-acquired incident delirium can be prevented in up to 53% of patients.3 Prevention strategies, however, are nonpharmacologic and therefore resource and personnel intensive.4 Accurate prediction of delirium risk could allow more precise targeting of high-risk patients and thereby greater resource stewardship and, potentially, improved patient outcomes.
Existing clinical delirium risk prediction tools have achieved areas under the receiver operating characteristic curve (AUCs) of 0.69 to 0.81.5-13 For example, UCSF Health (the University of California, San Francisco, Medical Center system) uses the AWOL screening tool to calculate delirium risk for newly admitted patients.12 This tool assigns 1 point for each of the following criteria: age greater than 79 years; inability to spell world backward; disorientation to city, state, county, hospital name, or floor; and nurse-rated moderate or severe illness severity. A score of 2 points or greater indicates high risk and helps direct hospital resources for delirium prevention (eg, rehabilitation services, patient care assistants, volunteers). A recent prospective cohort study at our institution found AWOL achieved an AUC of 0.73 on hospitalized patients aged 50 years or older.13
However, AWOL and other score-based delirium risk prediction tools often rely on questionnaires administered by health care professionals (eg, Mini-Mental State Examination), nonroutine clinical data (nursing subjective illness severity assessment), or additional calculations (eg, Acute Physiology and Chronic Health Evaluation score), making their integration into routine clinical workflow impractical. An external validation study of 4 such risk stratification tools describes the need to adapt and simplify prediction rules to allow use with routine clinical assessment data.8 Additionally, these tool development studies contain several limitations, including small sample size (N < 500), limitation of potential predictors to only those known a priori to be associated with delirium, and substantially lower performance on prospective validation compared with the retrospective cohort.
Furthermore, existing tools recapitulate well-studied delirium risk factors, such as cognitive impairment at baseline, delirium on admission, and severe illness.5-13 For this subpopulation of patients with unambiguous risk of developing hospital-acquired delirium, UCSF Health routinely provides delirium prevention precautions. However, it remains of crucial importance to identify and intervene on behalf of patients with elevated risk of incident delirium who lack these apparent risk factors on admission.
We developed and validated a machine learning model to predict hospital-acquired incident delirium in patients without baseline cognitive impairment, based only on data available in the electronic health record (EHR) within 24 hours of admission. To our knowledge, our data set of 18 223 hospitalization records represents the largest used to train and validate any delirium prediction model. Such an approach allows for (1) analysis of hundreds of clinical variables, (2) automated prediction without additional screening steps, thus reducing the burden on health care professionals, and (3) an application that may be readily integrated into the EHR for clinical decision support.
The institutional review board at UCSF reviewed the protocol for this study and approved it as a quality improvement investigation. A waiver of written informed consent was granted by the UCSF institutional review board for this study. All data used in the study were deidentified prior to use.
Study data were collected retrospectively from UCSF Health’s EHRs. Unique hospitalizations, defined by contact serial numbers (CSNs), were included for adult patients discharged from UCSF Health between January 1, 2016, and November 30, 2017, and who had at least 1 Nursing Delirium Screening Scale (Nu-DESC) or Confusion Assessment Method for the Intensive Care Unit (CAM-ICU) screen performed within 30 days of admission. Inclusion and exclusion criteria are summarized in Figure 1. We excluded CSNs if patients were admitted with delirium, altered mental status, or illness severity requiring ICU admission, defined by 1 or more of the following: (1) a Nu-DESC score of 2 or greater within the first 24 hours; (2) an admission diagnosis or problem list including delirium, psychosis, or other alteration of consciousness (International Classification of Diseases, Ninth Revision [ICD-9] code 290.3, 290.11, 290.41, 291.0, 291.1, 292.81, 293.x, 295.x, 296.x, 297.x, 298.x, 300.11, 308, 780.09, or 780.39); (3) a Glasgow Coma Scale best verbal response score less than 4 on admission; (4) patient not alert and oriented to person, time, and place on admission; (5) patient admitted to the ICU, or transferred to the ICU within 24 hours after admission; and (6) patient spent time in the ICU and was unable to be assessed by CAM-ICU at any point. The first 5 criteria were chosen to exclude patients who were delirious on admission, those with obvious cognitive impairment, and patients receiving delirium interventions as part of routine care because of their presentation; the last criterion was chosen to avoid false-negatives in ICU patients.
The training set encompassed CSNs from discharges between January 1, 2016, and August 31, 2017; the test set comprised discharges between August 1, 2017, and November 30, 2017.
Race and ethnicity information was collected from the EHR patient demographics. Patients are asked to self-report their race and ethnicity at the time of hospital registration.
Nurses at UCSF Health collect Nu-DESC14 and CAM-ICU scores every 12 hours in medical-surgical units and the ICU, respectively, to screen for incident delirium.15 Incident delirium was defined as a Nu-DESC score of 2 or greater or a positive CAM-ICU result between 24 hours and 30 days after admission. We also performed a sensitivity analysis defining delirium as a Nu-DESC score of 1 or greater, which has a higher sensitivity for detecting delirium with a mild decrease in specificity.16
We compiled 796 clinical variables identified by an expert panel of health care professionals as relevant to delirium prediction and available in the EHR within 24 hours of admission, including admission diagnoses, medications, laboratory values, vital signs, and demographic and nursing data obtained during the admission assessment (eg, mobility, visual and hearing function, Glasgow Coma Scale, lines and tubes); microbiology, radiology, pathology, and procedures were not included (eTables 1 and 2 in the Supplement).
Apart from age, no AWOL criteria were included within our variable list. Only variables available within the first 24 hours of admission were considered to simulate timely prediction in the clinical setting. Admission diagnoses and problem lists were retrieved from the EHR in ICD-9 format and were discretized into Boolean values for each of the 30 Elixhauser Comorbidity Index17 indicators using the R icd package (R Project for Statistical Computing). Home and admission medications were separately processed into Boolean values corresponding to 1 of 47 discrete categories based on the AHFS Pharmacologic-Therapeutic Classification,18 with the possibility of each medication being assigned multiple categories. For categorical variables, missing values were assigned to their own null category. For continuous variables, missing values were set to 0 and an indicator variable was added. The first value in alphabetical order for each categorical variable was chosen as the reference category, and the lowest value was chosen as the reference category for continuous variables.
We tested performance of 5 machine learning models in comparison to AWOL. Algorithms (R package implementation) comprised penalized logistic regression (glmnet), gradient boosting machine (gbm), artificial neural network with a single hidden layer (nnet), linear support vector machine (e1071), and random forest (randomForest). Using the R caret package,19 hyperparameters for each model were optimized with 3 repeats of 5-fold cross-validation, then fit to the entire training set. We then assessed each model by computing the AUC on the complete test set and the subset of hospitalizations in which an AWOL was performed. Model reporting complies with the Transparent Reporting of a Multivariable Prediction Model for Individual Prognosis or Diagnosis (TRIPOD) reporting guideline.20 Code and models have been made available at https://github.com/ayoung01/delirium.
We compared AUCs using a DeLong test for 2 correlated receiver operating characteristic (ROC) curves.21 A 2-sided level of significance of .05 was applied to general comparisons. All analyses were performed using R statistical software version 3.4.1 (R Project for Statistical Computing).
From 29 359 CSNs, we excluded 11 136 CSNs for delirium on admission or admission to the ICU (Figure 1). The rate of delirium in the cohort prior to application of the exclusion criteria was 13.5%. Of those excluded, 1205 CSNs (10.8%) had a Nu-DESC score of 2 or greater in the first 24 hours after admission. Among the remaining 9931 excluded CSNs (89.2%), the rate of incident delirium was 2909 of 9931 (29.3%) at a median (interquartile range [IQR]) of 2.3 (1.1-5.0) days after admission. Among included CSNs, the rate of incident delirium was 878 of 18 223 (4.8%) at a median (IQR) of 3.0 (1.8-5.7) days after admission, and the mean (SD) age was 57.1 (17.2) years. Of these 18 223 patients, 6604 (36.2%) were older than 64 years and 9301 (51.0%) were female. The training set comprised 14 227 adult patients with non-ICU hospital stays and no delirium on admission who were discharged between January 1, 2016, and August 31, 2017, from UCSF Health (5113 [35.9%] aged >64 years; 7335 [51.6%] female; 687 [4.8%] with delirium). The test set comprised 3996 patients with hospital stays who were discharged between August 1, 2017, and November 30, 2017 (1491 [37.3%] aged >64 years; 1966 [49.2%] female; 191 [4.8%] with delirium). Demographic characteristics did not differ meaningfully between the training and test sets (Table 1). The frequency of comorbidities was also similar between the 2 groups (eFigure 1 in the Supplement). eFigure 2 in the Supplement reports the number of included CSNs discharged each month by delirium outcome.
Figure 2 summarizes the performance of each model. The AWOL system achieved an AUC of 0.678 with a sensitivity of 32.8% and a specificity of 90.5% at AWOL of 2 or greater. Scores on AWOL of 3 or greater achieved sensitivities of 14.4% and 2.4% and specificities of 97.9% and 99.8%, respectively. Gradient boosting machine (GBM), penalized logistic regression (LR), and random forest (RF) models performed best, with AUCs of 0.855, 0.854, and 0.848, respectively, on the complete test set, with no statistically significant difference between AUCs. The GBM, LR, and RF models achieved AUCs of 0.848, 0.845, and 0.843, respectively (P < .001 vs AWOL for each model), on the subset of the test set with an AWOL score within 24 hours of admission (n = 3356). eFigures 3 and 4 in the Supplement summarize the performance of these models stratified by age 18 to 64 years vs age greater than 64 years; our GBM model achieves an AUC of 0.856 and an AUC of 0.804 on these subgroups, respectively.
At the 90% specificity threshold, GBM achieved 59.7% (95% CI, 52.4%-66.7%) sensitivity, 90.0% (95% CI, 89.0%-90.9%) specificity, 23.1% (95% CI, 20.5%-25.9%) positive predictive value, 97.8% (95% CI, 97.4%-98.1%) negative predictive value, and a number needed to screen (NNS) of 4.8. Eighty-three of 191 cases of incident delirium (43.5%) were missed at this threshold. Forty-six of 114 true positives (40.4%) in patients younger than 65 years were correctly predicted at this threshold. At the 90% sensitivity threshold, GBM achieved 90.0% (95% CI, 84.9%-93.9%) sensitivity, 56.6% (95% CI, 55.0%-58.2%) specificity, 9.4% (95% CI, 8.9%-10.0%) positive predictive value, 99.1% (95% CI, 98.7%-99.4%) negative predictive value, and an NNS of 12. The confusion matrix metrics describing the performance of GBM, LR, and RF and AWOL of 2 or greater are reported in eTable 3 in the Supplement, and the corresponding confusion matrices are reported in eTables 4 to 10 in the Supplement.
From 796 initial variables, GBM selected 345 variables, LR selected 114, and RF selected 588. The 40 most predictive variables occurring in at least 10 samples from GBM are summarized in Table 2 and Table 3. In addition, we report whether these predictors were selected among the top 50 variables by LR and RF.
Using a more sensitive definition of delirium (replacing Nu-DESC score ≥2 with Nu-DESC score ≥1), AWOL achieved a baseline AUC of 0.666, and the AUCs for GBM, LR, RF, artificial neural networks (ANN), and support vector machine models achieved AUCs of 0.822, 0.820, 0.811, 0.736, and 0.759, respectively, on the complete test set (eFigure 5 in the Supplement). The P values for a DeLong test comparing ROC curves calculated using the definitions of Nu-DESC score greater than or equal to 1 and Nu-DESC score greater than or equal to 2 are .19, .19, .12, .44, and .046, for GBM, LR, RF, ANN, and support vector machine models, respectively.
We conducted a sensitivity analysis to test for bias introduced by patients with multiple hospitalizations by removing the 702 medical record numbers (19.9%) in the test set that overlapped with those of the training set, but performance of the GBM model was unaffected (AUC, 0.857).
This study demonstrates that machine learning models outperform current clinical tools used to assess delirium risk. In comparison with AWOL, which was found to have an NNS of 11.1 at the threshold of AWOL greater than or equal to 2, our GBM model achieves an NNS of 4.8 while maintaining a higher sensitivity than AWOL, suggesting that fewer than half as many patients would need to be treated for 1 to benefit from delirium prevention interventions. Machine learning models have the additional advantage of not requiring a health care professional to perform a bedside delirium risk assessment.
As with any diagnostic test, the choice of threshold for specificity or sensitivity depends on the interventions triggered by a positive screen. A high specificity threshold may be preferred for delirium prevention interventions that are resource intensive; this would correspond to a high negative predictive value and require fewer interventions to be performed. However, a higher specificity threshold comes at a cost in sensitivity: 83 of 191 cases of incident delirium (43.5%) were missed by the model with 90% specificity. Conversely, high sensitivity may be preferred for low-cost, low-risk interventions in which the goal is to capture all potential delirium cases, while acknowledging a higher NNS and the intervention being administered unnecessarily to more patients.
Our GBM model recovers many known delirium risk factors including advanced age, illness severity, functional or mobility impairment, alcohol misuse, and psychoactive or sedative drugs, and results were largely consistent between top-performing models.1 We excluded patients with delirium on presentation and obvious baseline cognitive dysfunction (ie, not oriented to person, time, or place) because these patients would receive delirium prevention measures without the need for a risk-assessment tool; therefore, a clear marker of dementia was not expected to be recovered in our model. Nevertheless, it is likely that some of the recovered variables are surrogates for baseline cognitive dysfunction, such as dependence for activities of daily living. The large sample size also allowed identification of variables less commonly associated with delirium, including nursing data fields (eg, urinary incontinence), vital signs, medications (eg, antimanic agents including lithium and valproic acid), and select comorbidities (eg, peripheral vascular disease).
Although delirium is usually considered to disproportionately affect the elderly, it also occurs in younger patients, with a prevalence of 4.7%22 and an incidence as high as 14% in high-risk groups.23 Unlike previous studies that focus only on older populations, our study does not exclude patients based on age. At the 90% specificity threshold, our GBM model predicted delirium correctly in patients as young as 22 years, with 46 of 114 of true positives (40.4%) in patients younger than 65 years, suggesting that our model is accurately predicting delirium, even in populations younger than those traditionally studied.
The incidence of delirium reported in our data set (4.8%) is lower than the national incidence (11%-14%). This discrepancy is likely due to the younger age (mean [SD] age, 57.1 [17.2] years) of our study population as well as the strict exclusion criteria of the study. Indeed, the rate of incident delirium in the overall cohort prior to application of exclusion criteria was 13.5%. The goal of this study was to develop a model to predict incident delirium within the hospital to implement preventive measures prior to delirium onset. Thus, our exclusion criteria were specifically chosen to eliminate any patients who were delirious on admission or known to have high risk of developing incident delirium. In practice, nonpharmacologic delirium prevention measures are already applied to both these subsets of patients. The high prevalence of delirium among excluded patients, which translates to an NNS of 2.7, suggests the exclusion criteria correctly identified the group of patients known to have an elevated risk.
It is possible that some cases of delirium were missed using the Nu-DESC because it was not performed, performed incorrectly, or performed correctly but with false-negative results. In addition, some cases were missed because several general medical units that have the highest rates of delirium only began routine delirium screening in January 2017.
Although they represent important risk factors for delirium, microbiology, radiology, pathology, and procedures were not included as potential predictors because of their high dimensionality or unavailability within the first 24 hours of admission. However, some of these risk factors may be inferred from other variables in our data set: for example, fever, leukocytosis, and treatment with anti-infective agents would suggest infection otherwise captured on blood cultures. Deliriogenic interventions such as feeding tubes, Foley catheters, and physical restraints are captured by our data set.
We recognize that newer predictive models such as ANNs have been shown to outperform older models such as GBM, RF, and LR in prediction accuracy.24,25 However, such models require more computational power and larger training data sets and are far more technically challenging to integrate into clinical workflow. With the goal of creating a usable clinical tool in mind, the use of simpler models is more appropriate for many institutions at this time. However, the use of more advanced models for delirium prediction remains promising and should be explored in the future. Ensemble learning techniques have been shown to boost performance in models trained using fewer predictors,26,27 but were not pursued because of computational constraints.
Incomplete EHR data, another limitation, was mitigated by explicitly modeling missing data through indicator variables, a method that was chosen for its simplicity and computational efficiency and has been shown to be effective for recurrent neural networks.28 Like recurrent neural networks, GBM, ANN, and RF can model interactions between missingness indicators and other observation inputs. However, linear models can only learn hard substitution rules with indicator variables and may provide biased results and lead to overfitting29; future experiments using alternative missing data methods such as imputation30 may yield better performance.
Our test set includes only hospital stays discharged between August 1, 2017, and November 30, 2017, and is derived from the same institution as our training set. Higher incidence of delirium has been reported during the winter, which may limit generalizability to other times of year.31 Notably, the incidence of delirium in our training and test sets is identical across the calendar year, and there is no evidence of seasonality of delirium in our cohort (eFigure 2 in the Supplement). Finally, we recognize that an external validation would provide valuable insight into how our model performs in other health systems. However, variation in delirium screening, data availability, and EHR capabilities limits the ability to immediately generalize our model to other health systems. Collecting a larger data set across multiple sites may help overcome overfitting and improve generalization of our model in the future.
Our study demonstrates the feasibility of accurate incident delirium risk prediction from routine hospitalization data available in the EHR within 24 hours of admission and provides a list of putative delirium-related variables other institutions can use to develop their own models. Such a model may allow more precise targeting of delirium prevention resources to patients likely to benefit most.
Accepted for Publication: May 9, 2018.
Published: August 3, 2018. doi:10.1001/jamanetworkopen.2018.1018
Open Access: This is an open access article distributed under the terms of the CC-BY License. © 2018 Wong A et al. JAMA Network Open.
Corresponding Author: Albert T. Young, BA, School of Medicine, University of California, San Francisco, 505 Parnassus Ave, San Francisco, CA 94143 (email@example.com).
Author Contributions: Messrs Wong and Young had full access to all of the data in the study and take responsibility for the integrity of the data and the accuracy of the data analysis. Messrs Wong and Young are co–first authors and contributed equally to this study.
Concept and design: Wong, Young, Gonzales, Douglas, Hadley.
Acquisition, analysis, or interpretation of data: All authors.
Drafting of the manuscript: Wong, Young, Liang, Gonzales, Hadley.
Critical revision of the manuscript for important intellectual content: Wong, Young, Gonzales, Douglas, Hadley.
Statistical analysis: Wong, Young, Hadley.
Obtained funding: Wong, Hadley.
Administrative, technical, or material support: Wong, Liang, Gonzales, Douglas, Hadley.
Supervision: Wong, Gonzales, Douglas, Hadley.
Conflict of Interest Disclosures: None reported.
Funding/Support: This study was supported in part by the Resource Allocation Program for Trainees, UCSF Medical Education, and funding from the Strategic Improvement Office at UCSF Health.
Role of the Funder/Sponsor: The funders had no role in the design and conduct of the study; collection, management, analysis, and interpretation of the data; preparation, review, or approval of the manuscript; and decision to submit the manuscript for publication.
Additional Contributions: We gratefully acknowledge the assistance of the Delirium Reduction Committee and every individual contributing to the UCSF Delirium Reduction Campaign. None of these individuals received compensation for their contributions.