The 3 algorithms are logistic regression of structured data (algorithm 3), machine learning of unstructured data (algorithm 4), and machine learning of a combination of structured and unstructured data (algorithm 5); also included are points for 2 algorithms that represent binary classification: heart failure on problem list (algorithm 1) and presence of 1 of 3 clinical characteristics (algorithm 2).
Quality metrics were assessment of ejection fraction (EF) with echocardiography, discharge medication of an angiotensin-converting enzyme (ACE) inhibitor or angiotensin receptor blocker (ARB) for patients with a documented EF of 40% or less, and discharge medication of a heart failure–specific β-blocker for patients with documented EF of 40% or less. The figure displays true-positive results and does not account for false-positive results; for instance, false-positive results for EF measurement were 6, 318, 41, 56, and 71, with a corresponding positive predictive value of 0.92, 0.30, 0.71, 0.71, and 0.67 for algorithms 1 through 5, respectively. Algorithms were heart failure on problem list (algorithm 1), presence of 1 of 3 clinical characteristics (algorithm 2), logistic regression of structured data (algorithm 3), machine learning of unstructured data (algorithm 4), and machine learning of a combination of structured and unstructured data (algorithm 5).
eMethods. Machine Learning Model Development
eTable 1. Classifiers of Heart Failure, Using Logistic Regression of Structured Data (Algorithm 3)
eTable 2. Top 25 Features for Heart Failure Classification, Using a Machine-Learning Algorithm on Unstructured Data (Algorithm 4)
eTable 3. Top 25 features for Heart Failure Classification, Using a Machine-Learning Algorithm on Both Structured and Unstructured Data (Algorithm 5)
Customize your JAMA Network experience by selecting one or more topics from the list below.
Blecker S, Katz SD, Horwitz LI, et al. Comparison of Approaches for Heart Failure Case Identification From Electronic Health Record Data. JAMA Cardiol. 2016;1(9):1014–1020. doi:10.1001/jamacardio.2016.3236
What is the best way to identify hospitalized patients with heart failure in real time?
In this study of 47 119 hospitalizations, inclusion of heart failure on the problem list had a sensitivity of 0.40 and a positive predictive value (PPV) of 0.96. A logistic regression model with clinical data was associated with a sensitivity of 0.68 and PPV of 0.90, whereas a machine-learning algorithm that used free text had a sensitivity of 0.83 and a PPV of 0.90.
Machine learning with clinical notes has the best predictive accuracy for identification of hospitalized patients with heart failure in real time.
Accurate, real-time case identification is needed to target interventions to improve quality and outcomes for hospitalized patients with heart failure. Problem lists may be useful for case identification but are often inaccurate or incomplete. Machine-learning approaches may improve accuracy of identification but can be limited by complexity of implementation.
To develop algorithms that use readily available clinical data to identify patients with heart failure while in the hospital.
Design, Setting, and Participants
We performed a retrospective study of hospitalizations at an academic medical center. Hospitalizations for patients 18 years or older who were admitted after January 1, 2013, and discharged before February 28, 2015, were included. From a random 75% sample of hospitalizations, we developed 5 algorithms for heart failure identification using electronic health record data: (1) heart failure on problem list; (2) presence of at least 1 of 3 characteristics: heart failure on problem list, inpatient loop diuretic, or brain natriuretic peptide level of 500 pg/mL or higher; (3) logistic regression of 30 clinically relevant structured data elements; (4) machine-learning approach using unstructured notes; and (5) machine-learning approach using structured and unstructured data.
Main Outcomes and Measures
Heart failure diagnosis based on discharge diagnosis and physician review of sampled medical records.
A total of 47 119 hospitalizations were included in this study (mean [SD] age, 60.9 [18.15] years; 23 952 female [50.8%], 5258 black/African American [11.2%], and 3667 Hispanic/Latino [7.8%] patients). Of these hospitalizations, 6549 (13.9%) had a discharge diagnosis of heart failure. Inclusion of heart failure on the problem list (algorithm 1) had a sensitivity of 0.40 and a positive predictive value (PPV) of 0.96 for heart failure identification. Algorithm 2 improved sensitivity to 0.77 at the expense of a PPV of 0.64. Algorithms 3, 4, and 5 had areas under the receiver operating characteristic curves of 0.953, 0.969, and 0.974, respectively. With a PPV of 0.9, these algorithms had associated sensitivities of 0.68, 0.77, and 0.83, respectively.
Conclusions and Relevance
The problem list is insufficient for real-time identification of hospitalized patients with heart failure. The high predictive accuracy of machine learning using free text demonstrates that support of such analytics in future electronic health record systems can improve cohort identification.
Accurate, real-time identification of the diseases or conditions of a hospitalized patient is important for direct patient care, quality improvement, in-hospital registries, and electronic health record (EHR) interventions, such as clinical decision support. Problem lists ostensibly offer the ability to readily identify patients with a condition such as heart failure; as a result, problem list documentation has been associated with improved quality of care.1,2 Unfortunately, problem lists are often incomplete3,4 and fail to capture a significant number of individuals with a given disease.5
A number of algorithms and interventions have been developed to improve disease cohort identification.6-9 Most studies10,11 examining real-time identification of hospitalized patients with heart failure have relied on a small number of clinical factors and have had mixed success. More recently, approaches have incorporated machine learning and natural language processing of unstructured text from clinical documentation,9 with some suggestion of improvement in predictive accuracy.12 However, such approaches can be challenging to implement within EHR systems, which may pose some limitation to their clinical utility in real time. Head-to-head comparisons of the accuracy of different approaches to cohort identification are lacking but are critical to help health care systems determine whether any increased accuracy of a machine-learning approach is worth the increased complexity of design and implementation.
The purpose of this study was to develop algorithms to identify hospitalized patients with heart failure. We compared algorithms of increasing complexity to determine the relative benefit associated with using more advanced approaches. In comparing algorithms, we took the perspective of a health care system needing to efficiently identify patients with heart failure for real-time clinical quality improvement. Therefore, we prioritized high positive predictive value (PPV) to minimize false-positive results, which impede staff efficiency in medical record reviews and inhibit uptake of clinical decision support. We focused on patients with acute and chronic heart failure because all hospitalized patients with heart failure are at high risk of insufficient quality of care and poor postdischarge outcomes.13
We performed a retrospective study of hospitalizations at New York University Langone Medical Center using data obtained from EHR (Epic Systems). We included hospitalizations for patients 18 years or older admitted on or after January 1, 2013, and discharged by February 28, 2015. We excluded hospitalizations that lasted less than 24 hours or were on the obstetrics service. We also excluded hospitalizations of patients who died during hospitalization or were discharged to hospice care because these patients are typically excluded from quality improvement metrics.14 The study was approved by the New York University School of Medicine Institutional Review Board, which approved a waiver of consent. Data were not deidentified.
From these 47 119 hospitalizations, we sampled 315 for physician medical record review; these records were used for initial validation of hospital classifiers, including discharge diagnoses.15,16 Of the remaining hospitalizations, we randomly selected 75% for model development and 25% for model validation.
We developed classification algorithms for identification of hospitalized patients who have heart failure, including cases in which heart failure was the primary reason for hospitalization and cases in which heart failure was a secondary condition. For model development, heart failure was defined using standard International Classification of Diseases, Ninth Revision (ICD-9), discharge diagnosis codes14,17 in any position; on the basis of physician review of 315 records, we determined that this definition had a sensitivity of 71.4% and a specificity of 98.7% and performed similarly to other simple phenotype approaches using discharge characteristics.16 For model development, we used variables present up to the second midnight of hospitalization.18
Potential structured data elements used for heart failure classification were demographics, laboratory results, vital signs, problem list diagnoses, and medications used in the treatment of heart failure. For laboratory results and vital signs, we included an indicator of the presence or absence of results and the value. We included an indicator of the presence of an echocardiogram test; however, we did not include ejection fraction (EF) in the model because this data element was not structured in our database. Problem list diagnoses included heart failure, acute myocardial infarction, and atherosclerosis. We also included variables of a prior discharge diagnosis of heart failure, with separate variables for a principal diagnosis of heart failure (ie, an admission specifically for heart failure) and any prior discharge diagnosis. Medications included loop diuretics, angiotensin-converting enzyme (ACE) inhibitors or angiotensin receptor blockers (ARBs), β-blockers, and evidence-based heart failure–specific β-blockers used as an inpatient or outpatient. Unstructured data elements included admission notes, physician progress notes, echocardiogram reports, chest imaging reports, and consultation notes.
We developed 5 algorithms for identification of patients with heart failure at the second midnight of hospitalization based on increasing complexity of data and analysis. The first algorithm was based exclusively on the presence of heart failure on the problem list. The second was the presence of at least 1 of the following characteristics: heart failure on the problem list, inpatient oral or intravenous loop diuretic use, or brain natriuretic peptide (BNP) (N-terminal fragment of the prohormone BNP assay; Roche Diagnostics) level of 500 pg/mL or greater (to convert to nanograms per liter, multiply by 1). The third algorithm used logistic regression with clinically relevant structured variables. The fourth algorithm used a machine-learning approach with unstructured data. The fifth algorithm used a machine-learning approach with structured and unstructured data.
As part of model validation, we performed a physician medical record review to enhance our criterion standard. Classification of hospitalizations was based on guidelines developed and validated in the Atherosclerosis Risk in Communities study; cases were those adjudicated as acute decompensated or chronic stable heart failure.19 Two physicians (A.G. and Matthew Durstenfeld, MD) independently reviewed hospitalizations with an overlap of 50 medical records that demonstrated a κ of 0.86. Differences for these medical records were adjudicated by a third reviewer (S.B.).
To estimate the potential benefit of each algorithm in clinical practice, we calculated the number of patients with heart failure who were not adherent with 3 care measures. We then projected the number of these patients who would be correctly identified by each algorithm. The care measures were an echocardiogram to measure EF before or during hospitalization, an ACE inhibitor or ARB at discharge for patients with a documented EF of 40% or less and no contraindication, and an evidence-based β-blocker at discharge for patients with an EF of 40% or less and no allergy.14,20 Contraindications to an ACE inhibitor or ARB included an allergy to either medication class, creatinine level greater than 2.5 mg/dL (to convert to micromoles per liter, multiply by 88.4) at discharge, potassium level greater than 5.0 mEq/L (to convert to millimoles per liter, multiply by 1) at discharge, or systolic blood pressure less than 90 mm Hg at discharge.21 For this analysis, heart failure was based on a discharge diagnosis, and EF was obtained using text extraction for most echocardiograms that used a standard format.
Each of the 5 classification algorithms was developed using the development set. For the third algorithm, we developed a logistic regression model using a heart failure discharge diagnosis as the dependent variable and structured data elements as the independent variables. We developed the fourth algorithm using an L1-regularized logistic regression model that searched all free-text words in a computationally efficient variable selection (eMethods in the Supplement).22 Our fifth algorithm was developed using L1-regularized logistic regression on the structured data used in algorithm 3 and the unstructured data used in algorithm 4.
We evaluated performance characteristics of the 5 algorithms compared with discharge diagnoses in the development and validation sets. Sensitivity and PPV were calculated for each algorithm based on 2 criterion standards: (1) discharge diagnosis and (2) physician medical record review. For algorithms 3 through 5, which provide a continuous-valued prediction, we calculated the cutoff values in the development set for each algorithm to fix the PPV at 0.8 and determined the corresponding sensitivity; we then calculated the PPV and sensitivity in the validation set for these cutoff values. For these algorithms, we also measured the area under the receiver operating characteristic curve (AUC) using the discharge diagnosis criterion standard to determine discrimination.
We then calculated the sensitivity and PPV for the determined cutoff value using medical record review in the validation set based on the method used by Wright and colleagues.6 In this approach, we divided the validation set into 3 groups. The first group consisted of hospitalizations that almost certainly had a true heart failure diagnosis; these hospitalizations were defined as having all 3 of the following characteristics: a discharge diagnosis of heart failure, a loop diuretic, and an echocardiogram. On the basis of medical record review, this group had a PPV for hospitalizations of patients with heart failure of 94%.16 The second group contained patients who almost certainly did not have heart failure, defined as the absence of these 3 characteristics. On the basis of medical record review, this group had a negative predictive value for hospitalizations with heart failure of 99%.16 The third group represented the remaining patients, for whom heart failure status was not obvious at the time of discharge. From these hospitalizations, we randomly selected 100 algorithm-positive and 100 algorithm-negative hospitalizations for each of the 5 algorithms. These 200 records were reviewed by a physician and adjudicated as heart failure or not heart failure based on Atherosclerosis Risk in Communities criteria; we then determined the number of true- and false-positive results and true- and false-negative results for these records. We weighted these results based on the sampling scheme and combined them with groups 1 and 2 to determine the sensitivity and PPV for each algorithm.6
To estimate the number of patients for whom there was an opportunity for care improvement, we took all hospitalizations in the validation set that had a discharge diagnosis of heart failure. We calculated the number of these hospitalizations that were not adherent with each of the 3 care measures. We then counted the number of these nonadherent hospitalizations that were identified by each of the algorithms. We also determined the PPV for the quality metric of echocardiogram. Analyses were performed using STATA software, version 13 (StataCorp), SAS software, version 9.3 (SAS Institute Inc), and Python software, version 2.7 (Python Software Foundation).
A total of 47 119 hospitalizations were included in this study (mean [SD] age, 60.9 [18.15] years; 23 952 female [50.8%], 5258 black/African American [11.2%], and 3667 Hispanic/Latino [7.8%] patients). After 315 records were excluded for initial record review, there were 35 114 hospitalizations in the development set and 11 690 hospitalizations in the validation set. Of these hospitalizations, 6549 (13.9%) carried a diagnosis of heart failure in any position and 1214 (2.6%) carried a principal diagnosis of heart failure (Table 1).
The inclusion of heart failure on the problem list (algorithm 1) was associated with a sensitivity of 0.52 and a PPV of 0.96 for identification of heart failure based on the discharge diagnosis code criterion standard in the validation set (Table 2). Heart failure on the problem list had a sensitivity of 0.40 and a PPV of 0.96 in the validation set using the criterion standard of sampling with physician medical record review. Algorithm 2, defined as the presence of heart failure on the problem list, an inpatient loop diuretic, or a BNP level of 500 pg/mL or higher, was associated with sensitivities of 0.84 and 0.77 and PPVs of 0.58 and 0.64 compared with discharge diagnosis and physician review criterion standards in the validation set, respectively.
The third algorithm, in which heart failure was classified using logistic regression, included 30 structured data elements in the model. Variables that had an association with heart failure included heart failure on the problem list, any prior diagnosis of heart failure, inpatient diuretics, outpatient heart failure β-blocker use, and high BNP level (eTable 1 in the Supplement). This algorithm had an AUC of 0.953 in validation, a sensitivity of 0.76, and a PPV of 0.8 (Table 2 and Figure 1). In validation using the physician review criterion standard, the algorithm had a sensitivity of 0.68 with a PPV of 0.90 (Table 2).
The fourth algorithm, which used a machine-learning approach on free text, included 1118 elements in the final model. The top prognostic factors in the algorithm were all clinically relevant and included the terms chf, hf, nyha, failure, congestive, and Lasix (eTable 2 in the Supplement). This model had an AUC of 0.969 in validation and a sensitivity of 0.84 with a PPV of 0.80 in the validation set using the discharge diagnosis criterion standard.
The fifth algorithm used a machine-learning approach to identify 947 unstructured and structured data elements in the final model. The top prognostic factor for this model was heart failure in the problem list, followed by mention of chf and hf in free text (eTable 3 in the Supplement). This algorithm had an AUC of 0.974. The algorithm had a sensitivity of 0.86 with a PPV of 0.80 using the discharge diagnosis and a sensitivity of 0.83 with a PPV of 0.90 using the physician review.
Of 1631 hospitalizations for a principal or secondary diagnosis of heart failure in the validation set, 195 (12.0%) did not have a prior echocardiogram. Of these hospitalizations, 66 (33.8%) had heart failure listed on the problem list (algorithm 1). Algorithm 3 increased the number of these patients identified as having heart failure by 34, whereas algorithms 2, 4, and 5 increased the number of patients identified by between 69 and 74 over algorithm 1 (Figure 2). The PPV for identification of heart failure among patients without an echocardiogram was 0.92, 0.30, 0.71, 0.71, and 0.67 for algorithms 1 through 5, respectively. Among 430 hospitalizations for a diagnosis of heart failure and a known EF of 40% or less, patients in 109 hospitalizations (25.3%) were not discharged with an ACE inhibitor or ARB, whereas 91 (21.2%) were not discharged with an evidence-based β-blocker. With the use of the problem list alone, heart failure was classified in 76 heart failure hospitalizations (69.7%) with no ACE inhibitor or ARB and 44 (48.3%) with no β-blocker (Figure 2). The second algorithm classified heart failure in 100 hospitalizations (91.7%) with no ACE inhibitor or ARB and 76 (83.5%) with no β-blocker. Algorithm 3, which was developed to have a lower sensitivity but improved PPV, correctly identified heart failure in 97 (88.9%) and 69 (75.8%) of these hospitalizations, respectively. Algorithms 4 and 5 correctly classified 106 (97.2%) and 105 (96.3%) hospitalizations with no ACE inhibitor or ARB, respectively; both algorithms correctly classified 82 (90.9%) of those with no β-blocker (Figure 2).
Patients with heart failure are hospitalized approximately 4 million times annually and are at high risk of postdischarge readmission and mortality.13,17,20 Given the risk of poor outcomes among these patients, great opportunity exists to improve the care and outcomes among these patients. To implement interventions to improve outcomes, we need rapid identification of patients with heart failure early in their hospitalization.
Although accurate disease cohort identification is generally complex, identification of patients with heart failure in real time can be particularly challenging.23 Unlike other chronic diseases, such as diabetes, heart failure is a clinical diagnosis with no biometric criterion standard and no medications that are specific to this disease. The challenge in making even a clinical diagnosis of heart failure is evidenced by our criterion standard of medical record review, which had a high but imperfect level of concordance among 2 physicians (κ = 0.86), a finding that is consistent with a previous study.24 Defining heart failure through automated analysis of EHR data adds complexity to the phenotyping task. As a result, increasingly sophisticated approaches to cohort identification are being used.12,25
However, the implementation of complex algorithms into an EHR for real-time identification of heart failure may require special expertise and resources. As a result, there may be a tradeoff of cost of implementation and benefit of improvement in cohort identification with sophisticated approaches. We compared the performance of 5 algorithms based on increasing complexity to identify a heart failure phenotype among hospitalized patients. Our results suggest that the best approach may depend on clinical and operational needs. We demonstrated that the problem list had a high PPV for hospitalizations of patients with heart failure, which may be useful for initial cohort development. However, this approach was limited in that only approximately half of all hospitalizations for heart failure had the diagnosis listed on the problem list; similarly, prior studies3-5 have found limitations in the use of problem lists. An insufficient problem list can significantly limit the benefits of an EHR, which typically rely on problem lists for important functionalities, such as clinical decision support. A second algorithm, defined by the presence of at least 1 clinical characteristic commonly observed in a heart failure hospitalization,26 might be useful for an initial screening tool given its high sensitivity. However, the low PPV for this algorithm necessitates a confirmatory test to be useful in clinical practice.
The last 3 algorithms had improved accuracy for identification of heart failure. The third algorithm of clinically relevant structured data performed extremely well in terms of AUC, a measurement of global classification. This third algorithm is relatively easy to implement because its resulting risk score is a linear combination of structured data elements. The fourth and fifth algorithms are likely more difficult to implement because they rely on processing of unstructured data. Nonetheless, this cost may be worth the improved performance, depending on clinical needs. Notably, we estimated that using machine-learning algorithms with unstructured data doubled the number of early-identified heart failure hospitalizations with no echocardiography or no evidenced-based β-blocker at discharge when compared with the problem list alone. Furthermore, compared with more simple algorithms, the machine-learning algorithms improved identification while maintaining high PPV, an important consideration because too many false-positive results could adversely affect quality improvement initiatives, such as decision support.27 Given these benefits, at our institution, we plan to implement a machine-learning approach to facilitate interventions that target hospitalized patients with heart failure. To deploy these algorithms, we will export EHR data to a secure server on which the algorithms will run; the resulting identifier will be put back into the EHR to be used for care delivery. As a result, deployment is independent of EHR vendor, and our algorithms can be replicated at other institutions once validated with local data.
Study results should be interpreted in the context of limitations. First, there were limitations in both validation criterion standards. We first used discharge diagnosis codes, which are subject to misclassification, including false-positive results related to upcoding.28 For our second criterion standard, physician medical record review, we observed an imperfect reliability among physicians, and our approach may have been subject to sampling bias. Nonetheless, our sampling approach was based on an established method.6 Second, our study took place at a single institution, so findings may not be generalizable to other hospitals. Third, we may have missed potential contraindications to quality metrics; therefore, some patients may have been appropriately nonadherent to these metrics. Fourth, the algorithms were developed to identify hospitalized patients with heart failure and were not validated on outpatients, although a similar approach could be tailored to outpatients with heart failure.
As the focus of hospitals has shifted from acute care to the acute and postacute period, early identification of disease during hospitalization has become paramount to initiate transitional care. Our findings suggest that the problem list, which identified only half of hospitalized patients with heart failure, is insufficient for real-time identification of this population. Relying on analysis of free-text notes and procedure reports appears to have the best predictive accuracy. As a result, there are opportunities for improvement in real-time identification of disease cohorts, particularly as EHR vendors begin to natively support more sophisticated algorithms.
Corresponding Author: Saul Blecker, MD, MHS, Department of Population Health, New York University School of Medicine, 227 E 30th St, Room 648, New York, NY 10016 (firstname.lastname@example.org).
Accepted for Publication: July 22, 2016.
Published Online: October 5, 2016. doi:10.1001/jamacardio.2016.3236
Author Contributions: Dr Blecker had full access to all the data in the study and takes responsibility for the integrity of the data and the accuracy of the data analysis.
Concept and design: Blecker, Katz, Sontag.
Acquisition, analysis, or interpretation of data: All authors.
Drafting of the manuscript: Blecker, Katz, Sontag.
Critical revision of the manuscript for important intellectual content: Katz, Horwitz, Kuperman, Park, Gold, Sontag.
Statistical analysis: Blecker, Park, Sontag.
Obtaining funding: Blecker, Katz.
Administrative, technical, or material support: Kuperman.
Study supervision: Katz.
Conflict of Interest Disclosures: All authors have completed and submitted the ICMJE Form for Disclosure of Potential Conflicts of Interest. Drs Sontag and Blecker reported having a patent pending for a machine-learning algorithm to predict diabetes. No other disclosures were reported.
Funding/Support: This work was supported by grant K08HS23683 from the Agency for Healthcare Research and Quality (Dr Blecker).
Role of the Funder/Sponsor: The funding source had no role in the design and conduct of the study; collection, management, analysis, and interpretation of the data; preparation, review, or approval of the manuscript; and the decision to submit the manuscript for publication.
Additional Contributions: Matthew Durstenfeld, MD, New York University School of Medicine, performed medical record review for adjudication of heart failure and received a small compensation.