Validation of Prediction Models for Critical Care Outcomes Using Natural Language Processing of Electronic Health Record Data

Key Points Question Can a prediction model for mortality in the intensive care unit be improved by using more laboratory values, vital signs, and clinical text in electronic health records? Findings In this cohort study of 101 196 patients in the intensive care unit, a machine learning–based model using all available measurements of vital signs and laboratory values, plus clinical text, exhibited good calibration and discrimination in predicting in-hospital mortality, yielding an area under the receiver operating characteristic curve of 0.922. Meaning Applying methods from machine learning and natural language processing to information already routinely collected in electronic health records, including laboratory test results, vital signs, and clinical free-text notes, significantly improves a prediction model for mortality in the intensive care unit compared with approaches that use only the most abnormal vital sign and laboratory values.

Heart rate (bpm) a GCS of 3, but never improves. In contrast, using the slopes of the linear trends associated with each of these three series of GCS could stratify the mortality risk.
Similarly, Panel b) depicts the time course of heart rate (HR) for two patients admitted to the ICU following myocardial infarction. Patient A experiences atrial fibrillation with rapid ventricular response, which is managed with β-blockers to achieve a target rate of 110 bpm. Patient B, meanwhile, experiences intermittent ventricular tachycardia, marked by periods of elevated HR >160 bpm before returning to a relatively stable baseline. Both patients are recorded as having a SWV of HR roughly in the same range, resulting in similar risk estimates. However, the persistent variability (even though rate control is achieved) of Patient A's HR, as indicated by the standard deviation of the series, may indicate a poorer prognosis compared to Patient B.
Panel c) depicts serum creatinine (SCr) values over time for two patients. Two measurements of SCr are obtained for each patient. Patient A is initially admitted to the ICU with comorbid chronic kidney disease following surgery. His SCr on admission is 1.9mg/dL, and a repeat measurement 24 hours later is unchanged. However, Patient B experiences an rapid elevation of SCr from 0.7 to 1.9mg/dL in the 24 hours following admission, indicating the onset of acute kidney injury caused by, e.g., renal hypoperfusion associated with septic shock. Both patients attain the same SWV of SCr, even though Patient B has a poorer prognosis, as indicated by using the difference between first and last measurements within this 24-hour window.

eFigure 2. Calibration of the 2 Models Validated on Data From All 3 Institutions Using 10-fold Cross Validation
Includes models 2 and 3 in Methods. The error bars depict the bootstrap 95% confidence intervals. Note that the model with notes tends to underpredict mortality for the lower-risk deciles as compared to the model without notes, and vice versa for patients in the middle deciles of risk, but these differences do not appear significant.

. Results of the Sensitivity Analysis Comparing Validation Using All Patients With Using Only Those Alive at 24 Hours
To assess whether the models were unduly influenced by data from patients who died within the 24-hour period for which we collected data, we conducted a sensitivity analysis by developing models using only patients alive at 24 hours following ICU admission. We re-validated the models in this separate cohort and compared our results to those obtained in the original cohort. For each model, we compared the difference in AUC increases observed for each cohort relative to that of the model validated in the previous step -i.e., the difference in AUC increase for model #2 as compared to model #1 (baseline single worst value model), and similarly, model #3 (model #2 + NLP) was compared to model #2. The incremental gains in AUC as additional data is made available to the predictive algorithms when including patients who died in the first 24 hours were very similar to the incremental gains when excluding such patients.