Internal and External Validation of a Machine Learning Risk Score for Acute Kidney Injury

Key Points Question What is the accuracy of a single-center machine learning algorithm for predicting acute kidney injury (AKI) when internally and externally tested? Findings In this multicenter diagnostic study of approximately 500 000 admissions from 6 hospitals in 3 health systems, the machine learning algorithm had similarly high discrimination in both internal and external validation cohorts. Alert thresholds fired nearly a day and a half before the event. Meaning These findings demonstrate that the AKI algorithm is generalizable to patients in the center in which it was derived and to patients from other hospitals, suggesting that implementation could prompt early identification and therapy aimed at decreasing preventable AKI.


Introduction
Acute kidney injury (AKI) is a common clinical syndrome in hospitalized patients and is associated with increased morbidity, mortality, and cost of care. [1][2][3] Consensus criteria define AKI by either an increase in serum creatinine (SCr) concentration or a decrease in urine output. 4 Biomarkers that detect AKI prior to these changes have been investigated for several years. However, to date, there has been limited large-scale validation and implementation of these tools. Detection of AKI prior to the changes in SCr concentration may provide a crucial window of opportunity to prevent further injury and allow clinicians to intervene in the hopes of improving patient outcomes.
While work on urinary and serum biomarkers of early AKI continues, [5][6][7] several groups have reported on the accuracy of electronic health record-based risk scores that can identify AKI before changes in SCr concentration. [8][9][10][11][12][13][14] The scope of these published algorithms has varied, with some focusing on only ward or intensive care unit (ICU) patients and others on postoperative AKI. [8][9][10][11][12][13][14] Additionally, these algorithms range from rule-based, more parsimonious scores to complex, machine learning-based scores. 8,10,13 However, regardless of the individual score, there has been limited external validation of these risk assessment tools. Our group has previously published a gradient boosted machine learning AKI prediction model for all hospitalized patients (ie, patients in the emergency department, ward, and ICU) using single-center data at the University of Chicago (UC). 10 We subsequently simplified this risk score and clinically implemented the streamlined version to prompt early nephrology consultation as part of a single-center randomized controlled trial. 15 In this study, we aim to both internally (at UC) and externally validate the simplified version of our AKI score using retrospective cohorts from independent health systems (Loyola University Medical

Study Population
We included 3 distinct adult (Ն18 years) patient cohorts in this retrospective cohort study of prospectively collected data. All admitted adult patients at UC (an urban tertiary referral hospital) who were part of the validation cohort (2008 to 2016) in our previously published AKI algorithm development study 10 10 The study protocol was approved by the UC, LUMC, and NUS institutional review boards with a waiver of informed consent based on minimal harm and impracticability. We followed

AKI Definitions
We defined AKI by the SCr-based criteria from the KDIGO consensus definition. 4 Baseline SCr concentration was defined as the admission SCr value and was updated on a rolling basis for 48-hour and 7-day criteria, as per the KDIGO guidelines. 4,9,10 Statistical Analysis Patient characteristics, laboratory values, and outcomes were compared among the 3 cohorts (NUS, LUMC, and UC). These same factors were compared within the individual cohorts between patients who developed AKI and those who did not. We used t tests, Wilcoxon rank sum tests, analyses of variance, Kruskal-Wallis tests, and χ 2 tests for these comparisons, as appropriate, based on the distributions of the variables.
Next, the simplified version of our previously developed gradient boosted machine model, which was derived only using UC data, was applied to the UC internal validation cohort and the LUMC and NUS external validation cohorts. As previously described, the originally published gradient boosted machine model was developed using discrete time survival analysis, included 97 variables, and was developed and validated solely using UC data. 10 This model was simplified to 59 variables, with model development performed as described in the prior publication 10 using the same derivation cohort, with 10-fold cross-validation in the derivation data used to tune the model hyperparameters.
Predictors in the simplified model include demographic characteristics, vital signs, routine chemistry and hematology laboratory values, trends of vital sign and laboratory values (eg, highest heart rate in previous 24 hours), and nursing documentation (eg, Braden score) (eTable 1 in the Supplement).
Missing data were handled as previously described, with the median (for continuous data) or mode (for categorical data) by location being imputed for missing predictor values that remained after carry-forward imputation. 10    well calibrated at all sites, except for the highest risk decile at UC and the top 2 risk deciles at LUMC and NUS. Table 4 demonstrates the sensitivity, specificity, and positive and negative predictive values (PPV and NPV) for each probability cutoff using the maximum score for each admission to predict stage 2 AKI during the admission. Several probability cutoff values provided high sensitivity and specificity, with a cutoff of at least 0.057 providing a sensitivity of 87.1%, an NPV of 99.5%, and a PPV of 27.0% in the UC cohort. Similar or slightly lower accuracy results were seen in the LUMC and NUS cohorts across different thresholds (Table 4). eTable 4 in the Supplement demonstrates the same performance metrics using every observation in the test data sets for whether the outcome occurred within 48 hours at each individual probability cutoff across all 3 cohorts.
The utility of the model as a decision support tool, with an illustration of the percentage of observations that crossed each alert threshold by the sensitivity of that threshold for predicting the development of stage 2 AKI within 48 hours, is shown in the Figure. As shown, relatively fewer alerts would fire at UC and NUS compared with LUMC if a high (Ն60%) sensitivity was desired. In a timeto-event analysis, a cutoff of at least 0.057 predicted the later onset of stage 2 AKI a median (IQR) of 27 (6.5-93) hours before the eventual doubling in SCr concentration in the UC cohort, 34. 5 (19-85) hours in the NUS cohort, and 39 (19-108) hours in the LUMC cohort. Table 4 provides time-to-event analysis for all cutoffs across all cohorts.

Discussion
In this large, multicenter study across 6 hospitals, 3 health systems, and nearly 500 000 patient admissions, we performed an internal and external validation of a machine learning risk algorithm that predicts the development of AKI across all hospitalized patients. Our findings demonstrate consistent, high discrimination across all sites, hospital locations, and baseline SCr values as well as higher discrimination for the more severe forms of AKI (ie, stage 3 AKI and the need for KRT). Importantly, the model identified patients at risk of AKI nearly a day and a half earlier than the current criterion standard, ie, SCr concentration. This advanced notice could potentially allow for preemptive interventions for patients at high risk of AKI, which could improve outcomes. Our model, which has now been validated in 2 unique, external health systems, uses clinical data that is readily available in the electronic health record and can be implemented for real-time use. 15 Although model accuracy often decreases during external validation, we found similar results for predicting severe AKI in the internal and external validation cohorts. This may be because AKI is defined using SCr concentrations, and our model included mostly generalizable physiologic variables.
As expected, discrimination was slightly higher in the UC internal validation cohort in some   As such, it is reasonable to expect that the implementation of an early electronic AKI risk score may similarly improve AKI outcomes. However, these novel tools should be implemented and then thoroughly investigated to determine their utility.
Using a tool like ours would involve implementing an intervention the first time a patient reaches a unique risk threshold to augment our ability to prevent AKI. Table 4 demonstrates the test performance for the first time patients meet unique probability cutoffs. As shown with these results, there are several thresholds with adequate PPV and sensitivity values that could be used in clinical practice. Future work to determine the optimal threshold for clinical action that balances detection rates and false alarms, which will require interventional trials, is needed.

Limitations
Our study has limitations. We only defined AKI through changes in SCr concentration because of the inability to obtain accurate hourly urine output measurements in all hospitalized patients to comply with KDIGO definitions. 4 However, this is in line with several other previously published AKI risk scores. 8,13 Additionally, given the limitations of all 3 data sets (eg, only having access to inpatient data), we defined baseline SCr concentration using the admission values as opposed to outpatient minority shareholder of AgileMD, which develops clinical decision support tools for hospitals; being the chair for the American Heart Association Get With the Guidelines adult research task force; and having an ownership interest in Quant HC, which is developing products for risk stratification of hospitalized patients. No other disclosures were reported.

JAMA Network Open | Nephrology
Funding/Support: Drs Churpek, Edelson, and Koyner were supported by grant R21DK113420 from the National Institute of Diabetes and Digestive and Kidney Diseases. Drs Churpek, Edelson, Winslow, Shah, and Afshar and Mr Carey were supported by grant R01 GM123193 from the National Institute of General Medicine Sciences.
Role of the Funder/Sponsor: The funders had no role in the design and conduct of the study; collection, management, analysis, and interpretation of the data; preparation, review, or approval of the manuscript; and decision to submit the manuscript for publication.