Development of Electronic Health Record–Based Prediction Models for 30-Day Readmission Risk Among Patients Hospitalized for Acute Myocardial Infarction

Key Points Question Can machine learning deployed in electronic health records be used to improve readmission risk estimation for patients following acute myocardial infarction? Findings In this cohort study examining externally validated machine learning risk models for 30-day readmission of 10 187 patients following hospitalization for acute myocardial infarction, good discrimination performance was noted at the development site, but the best discrimination did not result in the best calibration. External validation yielded significant declines in discrimination and calibration. Meaning The findings of this study highlight that robust calibration assessments are a necessary complement to discrimination when machine learning models are used to predict post–acute myocardial infarction readmission; challenges with data availability across sites, even in the presence of a common data model, limit external validation performance.


Introduction
Coronary heart disease leads to approximately 14% of deaths in the US; an acute myocardial infarction (AMI) occurs every 34 seconds and an AMI-related death occurs every 84 seconds. 1 Approximately 635 000 individuals within the US have their first AMI each year, and almost half experience a recurrent AMI the same year. 1 One in 5 patients with AMI is rehospitalized within 30 days of discharge. [1][2][3] The 2010 Patient Protection and Affordable Care Act identified hospital readmission as a target for revising fee-for-service hospital reimbursement and reducing health care spending. 4 Unplanned readmissions account for 17% of Medicare hospital reimbursement, costing approximately $17.4 billion annually. 5 Approximately 20% of Medicare beneficiaries who experience an AMI are re-hospitalized within 30 days after discharge. 6 Although quality improvement initiatives and financial incentives have led to a decline in 30-day readmissions, emergency department visits and observation stays after index admission have increased. [7][8][9] The success of readmission reduction programs is therefore uncertain, highlighting the need to identify patients who may benefit from additional health care resources to reduce the risk of readmission.
Risk prediction models are important to promote efficient resource allocation for high-risk patients and readmission prevention. 4,5,10 There is increasing opportunity to embed predictive models into electronic health records (EHRs) to support automated readmission prediction for use by clinical teams, yet challenges persist. [11][12][13] Portability of predictive models between different EHRs is often limited [10][11][12] owing to systematic differences in the patient case-mix between institutions, differences in EHR data storage methods, 14 and changes in clinical and data practices within a health care system over time. 15 Strategies to address portability include a systematic approach to data management (eg, common data models and exchange formats) and model maintenance. 16,17 Current models developed for 30-day readmission after AMI focus on risk factors derived from claims data or restricted patient populations, which limit performance and deployment. 18 Many studies have had limited discrimination in predicting 30-day readmissions using either Medicare claims (C statistic, 0.63) or state registries with only structured data (C statistic, 0.64-0.67). [18][19][20][21][22] Machine learning models can address some of these limitations by accounting for variable interactions and nonlinearity; however, there is limited transparency of model outputs. 23 However, the ability to incorporate such models into EHRs, offload manual calculation burdens, and carefully select methods that are understandable have increased clinical usability. 24,25 In this study, we sought to improve 30-day readmission risk prediction among hospitalized patients with AMI by analyzing a broad set of data collected from EHRs standardized within a common data model and comparing a robust set of machine learning methods.

Methods
We conducted this study at Vanderbilt University Medical Center (VUMC), a large, tertiary care academic hospital system in Nashville, Tennessee, with a catchment area that includes a 9-state DHMC unique patients. To supplement ascertainment of 30-day readmissions following hospitalization, the cohorts were linked to Medicare Provider Analysis and Review inpatient claims. 26 All data were collected within the VUMC EHR, StarPanel 27,28 and DHMC local EHR aggregated to a clinical data warehouse. Source EHR data were transformed at each institution into the Observational Medical Outcome Partnership (OMOP) Common Data Model. At VUMC, this was done by VICTR and supported by institutional and National Institutes of Health Clinical and Translational Science Award funding. At DHMC, this was done by the research team. The OMOP supports normalization of data variables encoded using different terminologies and data structures and is well known for its usefulness for clinical data. 17,29 All structured data fields in this study were extracted from each site's OMOP instance.
We followed the Transparent Reporting of a Multivariable Prediction Model for Individual Prognosis or Diagnosis (TRIPOD) reporting guideline for cohort studies. 30 This study was approved by the VUMC and Dartmouth College institutional review boards under expedited review with a waiver of informed consent. Informed consent was waived because the work could not feasibly be done with direct informed consent, and the study represented minimal risk to the participants as determined by the institutional review boards.

Candidate Predictors
Main effect predictors and definitions are listed in eTable 1 in the Supplement, and these included 4 demographic variables, 9 medication orders, 86 administration variables, 9 composite score variables, and 33 laboratory tests. Variables were defined using ICD-9-CM, ICD-10-CM, Current Procedural Terminology, and Healthcare Common Procedure Coding System codes, and translated to SNOMED-CT using the Unified Medical Language System crosswalk, where the crosswalks existed, to query the OMOP tables. To more fairly compare the parametric methods, which cannot automatically evaluate interaction terms and nonlinear variable representations in the way that the nonparametric methods can, we evaluated first-order interaction terms using forward and backward step logistic regression, with α = .10 as a threshold for retention of the interaction term variable. A full list of variable candidate predictors is available in eTable 8 in the Supplement. All candidate predictors generated at VUMC were replicated at DHMC.

Outcome
The main outcome of interest was 30-day hospital readmission. Using the Centers for Medicare & Medicaid Services definition, readmission was defined to be a subsequent stay in the hospital for observation or an acute inpatient stay within 30 days from the index AMI discharge, and excluding rehabilitation admissions, nursing home admissions, or scheduled admissions for surgeries or procedures. The dates and causes for readmission were derived from each hospital's administrative databases, including the admitting hospital's state and surrounding state inpatient data sets, and Medicare claims, ensuring complete ascertainment of 30-day readmissions. Outcome derivation was the same at VUMC and DHMC.

Missing Values
The final analytic file contained 37 variables with missing values at VUMC and 30 at DHMC. We addressed these issues through a combination of assumptions and imputation techniques. Imputation, when data are missing at random and isolated to the predictor variables, is necessary, and multiple imputation provides robust results. 31 Except for laboratory test data, clinical information was assumed to be negative or not present when null in the EHR data. For laboratory variables, SAS, version 9.4 (SAS Institute Inc) was used to create 20 imputed data sets using Markov-chain Monte Carlo methods, assuming all imputed variables have a multivariate normal distribution. 32 Missing data were derived by drawing from a conditional multivariate normal distribution. With sufficiently large samples, this method often leads to reliable estimates, even if the assumption of normality is not fully met. 32 The final files at VUMC and DHMC contained a total of 241 variables, including 141 main effects, observations at VUMC and 80 480 observations at DHMC. Having 20 data files allows for enough uncertainty about the missing values to be confident about the variables' influence on outcome. The final analytic data files were imported into R, version 3.6.0 (R Foundation) for machine learning development and execution.

Model Development
Five machine learning models were developed and included both parametric models (elastic net Hyperparameters for the parametric models included α and λ. The α hyperparameters are, by definition, fixed to 1 for LASSO and 0 for RR. Following a grid search using the caret package, α was set to 0.55 for EN. Each parametric model was assessed using both λ minimum and 1 SE from λ minimum. Hyperparameters for RF were the number of drawn candidate variables in each split (11.87), the sample size of observations (n), whether observations are drawn with replacement (true), node size (1), number of trees (500), and splitting rule (Gini impurity, P value, random). 33 A grid search resulted in changing the number of trees to 1000 and the number of drawn candidate variables in each split to 7; default values were retained for the remaining hyperparameters.
Hyperparameters for GB were the number of trees (100), interaction depth/maximum node per tree (4), minimum number of samples in tree terminal nodes (10), the fraction of training observations randomly selected for the subsequent tree (0.5), the train fraction (1), and learning rate (0.01). The grid search optimized the number of trees to 7, interaction depth/maximum node per tree (1), and shrinkage parameter (0.537). 34 Before deploying each machine learning model, the final analytic data file was randomly split into two-thirds training set and one-third testing set in each of the 20 imputed data sets. Parametric models were developed in R, using the glmnet and caret packages. 35 Nonparametric models used RandomForest and gbm libraries. 34,36

Model Assessment and Scoring
Each model was trained using 10-fold cross validation on the full training set with 5 repeats. Model performance was determined using the full holdout test set. The area under the receiver operating characteristic curve (AUROC), 95% CIs, and SE were calculated from the test set for each imputed data set (1)(2)(3)(4)(5)(6)(7)(8)(9)(10)(11)(12)(13)(14)(15)(16)(17)(18)(19)(20) for each model. The AUROC, SE, and 95% CIs were pooled across all imputed files using Rubin indexes to generate a single metric for each distinct machine learning model. Various thresholds for predicted probabilities were investigated. Calibration was assessed using calibration curve belts and percentage of calibration (proportion of predictions in which the calibration belt crossed the observed/expected 1 line) for each model on the training and testing data sets. 37 We report these calibration results for model comparisons. The pooled Brier score was then assessed, which is a global metric that combines discrimination and calibration performance. 38 The bestperforming model was defined as the model with the most observations falling within the calibration band. 39 After deployment of each model, scoring was performed using the DHMC data. The models were scored on the full DHMC data set, using models with default and optimized hyperparameters. For the parametric models EN, LASSO, and RR, the testing sets' AUROC level was between 0.686 and 0.695, and for the nonparametric models, the testing sets' AUROC level was from 0.686 for RF to 0.704 for GB ( Table 2). The best-performing EN, LASSO, RR, and GB models occurred with default hyperparameters, and the best-performing RF models occurred with optimized hyperparameters. Among the external validation cohort, the best-performing parametric and nonparametric models occurred with optimized hyperparameters. The AUROC for parametric models was between 0.558 to 0.655 and, for the nonparametric models, the AUROC was between 0.606 and 0.608 (Table 2).

JAMA Network Open | Cardiology
For calibration assessments on VUMC testing data, the best-performing model was LASSO, which had the highest percentage of calibrated observations (31.64%), followed closely by EN (30.24%), with both using default hyperparameters (Figure 1). The model with the highest percentage of calibration among the external validation cohort was LASSO (17.0%) using optimized hyperparameters. Figure 2 illustrates the calibration curves for the best-performing LASSO models within both cohorts. Additional calibration curves for the other 4 models can be found in eFigure 1 and eFigure 2 in the Supplement. For VUMC, multiple thresholds were tested for sensitivity, specificity, positive predictive value, negative predictive value, and the F1 score for the bestperforming LASSO model ( Table 3). Predictors that were repeatedly among the strongest in the models were discharge location, age, hospital score, hemoglobin level, and troponin level.

Discussion
In this study, we developed EHR-derived machine learning risk prediction models that performed better than previously published models in the derivation site while retaining good calibration. 22,40-42 Moreover, we externally validated the machine learning risk models at an independent site, highlighting challenges in retaining adequate calibration at nonderivation sites. We chose LASSO as the model with the best fit owing to several factors, because the AUROCs were not statistically significantly different, and we primarily targeted calibration plots' proximity to the diagonal fit line. Among the options, our selection of LASSO as the optimal model may seem counterintuitive because it did not offer the highest AUROC, but it appears to represent a balanced prioritization between discrimination and calibration performance. Our hope is that robust calibration assessment becomes the norm for risk model development. 39 Discrimination metrics among the external validation cohort were poor for LASSO and RR models; however, EN, RF, and GB Previous studies have highlighted the need for ongoing surveillance and updating of these models during their use, as there are systematic data collection differences between sites and clinical practice drift over time. 15,43 Although external validation is an important metric for risk models, this study highlights that even when developed using a common data model to align data definitions and, in the same time period, a model derivation, portability to other sites can be compromised. It is likely that almost all models will require adaptation to local environments and continued updating over time to achieve clinical utility and safety. c HOSPITAL indicates hemoglobin level at discharge, discharge from an oncology service, sodium level at discharge, procedure during the index admission, index type of admission, number of admissions during the past 12 months, and length of stay. Possible score range is 0 to 13. parametric and nonparametric. This type of robust comparison has seen recent use in predictive modeling but is not yet widely practiced. 15

Limitations
There are several limitations to this study. There were data quality limitations at the external validation site, such that candidate predictor variables that were available at VUMC could not be populated at DHMC. This factor limited the number of available candidate predictors for the VUMC  models, which affected their performance and thus the quality of available variables at DHMC, thereby also affecting the model scoring performance. Despite the use of a common data model, standardized variable definitions, and code sharing, nuances in local EHR mappings limited the availability of data at the external validation site. Focus on methods to establish enhanced data interoperability across sites may be warranted in future studies.
In addition, there are limitations to deploying this model in clinical practice. In the absence of significant data quality differences across sites, the use of rigorous EHR-mapped variables using OMOP Common Data Model can incorporate EHR-structured variables into an automated risk prediction toolkit. Multiple imputation was used for missing variables, and this feature may be unavailable in a real-time production environment; thus, an alternative strategy of simple imputation might be needed. In addition, implementation of any model in clinical practice would require surveillance and potential recalibration to the local environment.

Conclusions
In this study, we developed and externally validated an EHR-derived readmission risk prediction model for use among patients hospitalized for AMI. We developed the models within a framework for comparison of candidate modeling methods and selected the method that maintained a balance between calibration and discrimination among the candidates. This model development framework can assist in selecting a model for deployment within an EHR environment to support prioritization of limited resources for reducing the likelihood of readmission among these patients.