Comparison of Machine Learning Methods With Traditional Models for Use of Administrative Claims With Electronic Medical Records to Predict Heart Failure Outcomes | Cardiology | JAMA Network Open | JAMA Network
[Skip to Navigation]
Sign In
Figure 1.  Observed Risk for Each Outcome of Interest in the Testing Data Within Deciles of Gradient-Boosted Model Predicted Risk Strata
Observed Risk for Each Outcome of Interest in the Testing Data Within Deciles of Gradient-Boosted Model Predicted Risk Strata

These plots compare observed probability (outcome event rates per person-year) in 10 risk groups based on gradient-boosted model predictions using claims-only and claims + electronic medical record (EMR) predictors for mortality (A), heart failure (HF) hospitalization (B), high cost (C), and home time loss (D). Greater values of observed probability in higher-risk strata indicate a higher-yield model.

Figure 2.  Most Influential Predictors From the Gradient-Boosted Models for Each Outcome of Interest
Most Influential Predictors From the Gradient-Boosted Models for Each Outcome of Interest

Relative influence values range from 0 to 100 and indicate the proportional contribution of a variable in predicting the outcome of interest. Relative influence from gradient-boosted models using claims-only and claims + electronic medical record (EMR) predictors are plotted for the top 10 predictors of mortality (A), heart failure (HF) hospitalization (B), high cost (C), and home time loss (D). BNP indicates B-type natriuretic protein; BUN, blood urea nitrogen; LVEF, left-ventricular ejection fraction; and SES, socioeconomic status.

Table 1.  Baseline Characteristics of Medicare-Enrolled Patients With HF Included in the Study, 2007-2014
Baseline Characteristics of Medicare-Enrolled Patients With HF Included in the Study, 2007-2014
Table 2.  Comparison of Models in Predicting Outcomes in Patients With Heart Failure in the Testing Data Set
Comparison of Models in Predicting Outcomes in Patients With Heart Failure in the Testing Data Set
Table 3.  Subgroup-Specific ROC of the Gradient-Boosted Models in the Testing Data Set
Subgroup-Specific ROC of the Gradient-Boosted Models in the Testing Data Set
1.
Mozaffarian  D, Benjamin  EJ, Go  AS,  et al; Writing Group Members; American Heart Association Statistics Committee; Stroke Statistics Subcommittee.  Executive summary: heart disease and stroke statistics—2016 update: a report from the American Heart Association.  Circulation. 2016;133(4):447-454. doi:10.1161/CIR.0000000000000366PubMedGoogle ScholarCrossref
2.
Benjamin  EJ, Virani  SS, Callaway  CW,  et al; American Heart Association Council on Epidemiology and Prevention Statistics Committee and Stroke Statistics Subcommittee.  Heart disease and stroke statistics—2018 update: a report from the American Heart Association.  Circulation. 2018;137(12):e67-e492. doi:10.1161/CIR.0000000000000558PubMedGoogle ScholarCrossref
3.
Blecker  S, Paul  M, Taksler  G, Ogedegbe  G, Katz  S.  Heart failure–associated hospitalizations in the United States.  J Am Coll Cardiol. 2013;61(12):1259-1267. doi:10.1016/j.jacc.2012.12.038PubMedGoogle ScholarCrossref
4.
Rahimi  K, Bennett  D, Conrad  N,  et al.  Risk prediction in patients with heart failure: a systematic review and analysis.  JACC Heart Fail. 2014;2(5):440-446. doi:10.1016/j.jchf.2014.04.008PubMedGoogle ScholarCrossref
5.
Frizzell  JD, Liang  L, Schulte  PJ,  et al.  Prediction of 30-day all-cause readmissions in patients hospitalized for heart failure: comparison of machine learning and other statistical approaches.  JAMA Cardiol. 2017;2(2):204-209. doi:10.1001/jamacardio.2016.3956PubMedGoogle ScholarCrossref
6.
Greiner  MA, Hammill  BG, Fonarow  GC,  et al.  Predicting costs among Medicare beneficiaries with heart failure.  Am J Cardiol. 2012;109(5):705-711. doi:10.1016/j.amjcard.2011.10.031PubMedGoogle ScholarCrossref
7.
Lee  H, Shi  SM, Kim  DH.  Home time as a patient-centered outcome in administrative claims data.  J Am Geriatr Soc. 2019;67(2):347-351. doi:10.1111/jgs.15705PubMedGoogle ScholarCrossref
8.
Greene  SJ, O’Brien  EC, Mentz  RJ,  et al.  Home-time after discharge among patients hospitalized with heart failure.  J Am Coll Cardiol. 2018;71(23):2643-2652. doi:10.1016/j.jacc.2018.03.517PubMedGoogle ScholarCrossref
9.
Hennessy  S.  Use of health care databases in pharmacoepidemiology.  Basic Clin Pharmacol Toxicol. 2006;98(3):311-313. doi:10.1111/j.1742-7843.2006.pto_368.xPubMedGoogle ScholarCrossref
10.
McCormick  N, Lacaille  D, Bhole  V, Avina-Zubieta  JA.  Validity of heart failure diagnoses in administrative databases: a systematic review and meta-analysis.  PLoS One. 2014;9(8):e104519. doi:10.1371/journal.pone.0104519PubMedGoogle Scholar
11.
Ouwerkerk  W, Voors  AA, Zwinderman  AH.  Factors influencing the predictive power of models for predicting mortality and/or heart failure hospitalization in patients with heart failure.  JACC Heart Fail. 2014;2(5):429-436. doi:10.1016/j.jchf.2014.04.006PubMedGoogle ScholarCrossref
12.
Kim  DH, Schneeweiss  S, Glynn  RJ, Lipsitz  LA, Rockwood  K, Avorn  J.  Measuring frailty in Medicare data: development and validation of a claims-based frailty index.  J Gerontol A Biol Sci Med Sci. 2018;73(7):980-987. doi:10.1093/gerona/glx229PubMedGoogle ScholarCrossref
13.
Bonito  A, Bann  C, Eicheldinger  C, Carpenter  L. Creation of New Race-Ethnicity Codes and Socioeconomic Status (SES) Indicators for Medicare Beneficiaries: Final Report, Sub-Task 2. Rockville, MD: Agency for Healthcare Research and Quality; January 2008. AHRQ publication 08-0029-EF.
14.
Gopalakrishnan  C, Gagne  JJ, Sarpatwari  A,  et al.  Evaluation of socioeconomic status indicators for confounding adjustment in observational studies of medication use.  Clin Pharmacol Ther. 2019;105(6):1513-1521. doi:10.1002/cpt.1348PubMedGoogle ScholarCrossref
15.
Steyerberg  EW, van Veen  M.  Imputation is beneficial for handling missing data in predictive models.  J Clin Epidemiol. 2007;60(9):979. doi:10.1016/j.jclinepi.2007.03.003PubMedGoogle ScholarCrossref
16.
Sterne  JA, White  IR, Carlin  JB,  et al.  Multiple imputation for missing data in epidemiological and clinical research: potential and pitfalls.  BMJ. 2009;338:b2393. doi:10.1136/bmj.b2393PubMedGoogle ScholarCrossref
17.
Austin  PC.  Using the standardized difference to compare the prevalence of a binary variable between two groups in observational research.  Commun Stat Simul Comput. 2009;38(6):1228-1234. doi:10.1080/03610910902859574Google ScholarCrossref
18.
Chand  S. On tuning parameter selection of LASSO-type methods—a Monte Carlo study. Paper presented at: Applied Sciences and Technology (IBCAST) 2012 9th International Bhurban Conference; January 9-12, 2012; Islamabad, Pakistan. https://ieeexplore.ieee.org/document/6177542. Accessed January 31, 2018.
19.
Zhang  Y, Li  R, Tsai  C-L.  Regularization parameter selections via generalized information criterion.  J Am Stat Assoc. 2010;105(489):312-323. doi:10.1198/jasa.2009.tm08013PubMedGoogle ScholarCrossref
20.
Oyeyemi  GM, Ogunjobi  EO, Folorunsho  AI.  On performance of shrinkage methods—a Monte Carlo study.  Int J Stat Appl. 2015;5(2):72-76. doi:10.5923/j.statistics.20150502.04Google Scholar
21.
Hothorn  T, Hornik  K, Zeileis  A.  Unbiased recursive partitioning: a conditional inference framework.  J Comput Graph Stat. 2006;15(3):651-674. doi:10.1198/106186006X133933Google ScholarCrossref
22.
Breiman  L.  Random forests.  Mach Learn. 2001;45(1):5-32. doi:10.1023/A:1010933404324Google ScholarCrossref
23.
Hastie  T, Tibshirani  R, Friedman  J. Boosting and additive trees. In:  The Elements of Statistical Learning. New York, NY: Springer; 2009:337-387. doi:10.1007/978-0-387-84858-7_10
24.
Steyerberg  EW, Vickers  AJ, Cook  NR,  et al.  Assessing the performance of prediction models: a framework for traditional and novel measures.  Epidemiology. 2010;21(1):128-138. doi:10.1097/EDE.0b013e3181c30fb2PubMedGoogle ScholarCrossref
25.
DeLong  ER, DeLong  DM, Clarke-Pearson  DL.  Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach.  Biometrics. 1988;44(3):837-845. doi:10.2307/2531595PubMedGoogle ScholarCrossref
26.
Saito  T, Rehmsmeier  M.  The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets.  PLoS One. 2015;10(3):e0118432. doi:10.1371/journal.pone.0118432PubMedGoogle Scholar
27.
Vickers  AJ, Elkin  EB.  Decision curve analysis: a novel method for evaluating prediction models.  Med Decis Making. 2006;26(6):565-574. doi:10.1177/0272989X06295361PubMedGoogle ScholarCrossref
28.
Christodoulou  E, Ma  J, Collins  GS, Steyerberg  EW, Verbakel  JY, Van Calster  B.  A systematic review shows no performance benefit of machine learning over logistic regression for clinical prediction models.  J Clin Epidemiol. 2019;110:12-22. doi:10.1016/j.jclinepi.2019.02.004PubMedGoogle ScholarCrossref
29.
Rajkomar  A, Oren  E, Chen  K,  et al.  Scalable and accurate deep learning with electronic health records.  NPJ Digit Med. 2018;1(1):18. doi:10.1038/s41746-018-0029-1PubMedGoogle ScholarCrossref
Limit 200 characters
Limit 25 characters
Conflicts of Interest Disclosure

Identify all potential conflicts of interest that might be relevant to your comment.

Conflicts of interest comprise financial interests, activities, and relationships within the past 3 years including but not limited to employment, affiliation, grants or funding, consultancies, honoraria or payment, speaker's bureaus, stock ownership or options, expert testimony, royalties, donation of medical equipment, or patents planned, pending, or issued.

Err on the side of full disclosure.

If you have no conflicts of interest, check "No potential conflicts of interest" in the box below. The information will be posted with your response.

Not all submitted comments are published. Please see our commenting policy for details.

Limit 140 characters
Limit 3600 characters or approximately 600 words
    Original Investigation
    Cardiology
    January 10, 2020

    Comparison of Machine Learning Methods With Traditional Models for Use of Administrative Claims With Electronic Medical Records to Predict Heart Failure Outcomes

    Author Affiliations
    • 1Division of Pharmacoepidemiology and Pharmacoeconomics, Department of Medicine, Brigham and Women’s Hospital, Harvard Medical School, Boston, Massachusetts
    • 2Heart and Vascular Center, Brigham and Women’s Hospital, Harvard Medical School, Boston, Massachusetts
    • 3Market Access, Bayer AG, Wuppertal, Germany
    JAMA Netw Open. 2020;3(1):e1918962. doi:10.1001/jamanetworkopen.2019.18962
    Key Points español 中文 (chinese)

    Question  Can prediction of patient outcomes in heart failure based on routinely collected claims data be improved with machine learning methods and incorporating linked electronic medical records?

    Findings  In this prognostic study including records on 9502 patients, machine learning methods offered only limited improvement over logistic regression in predicting key outcomes in heart failure based on administrative claims. Inclusion of additional predictors from electronic medical records improved prediction for mortality, heart failure hospitalization, and loss in home days but not for high cost.

    Meaning  Models based on claims-only predictors may achieve modest discrimination and accuracy in prediction of key patient outcomes in heart failure, and machine learning approaches and incorporation of additional predictors from electronic medical records may offer some improvement in risk prediction of select outcomes.

    Abstract

    Importance  Accurate risk stratification of patients with heart failure (HF) is critical to deploy targeted interventions aimed at improving patients’ quality of life and outcomes.

    Objectives  To compare machine learning approaches with traditional logistic regression in predicting key outcomes in patients with HF and evaluate the added value of augmenting claims-based predictive models with electronic medical record (EMR)–derived information.

    Design, Setting, and Participants  A prognostic study with a 1-year follow-up period was conducted including 9502 Medicare-enrolled patients with HF from 2 health care provider networks in Boston, Massachusetts (“providers” includes physicians, clinicians, other health care professionals, and their institutions that comprise the networks). The study was performed from January 1, 2007, to December 31, 2014; data were analyzed from January 1 to December 31, 2018.

    Main Outcomes and Measures  All-cause mortality, HF hospitalization, top cost decile, and home days loss greater than 25% were modeled using logistic regression, least absolute shrinkage and selection operation regression, classification and regression trees, random forests, and gradient-boosted modeling (GBM). All models were trained using data from network 1 and tested in network 2. After selecting the most efficient modeling approach based on discrimination, Brier score, and calibration, area under precision-recall curves (AUPRCs) and net benefit estimates from decision curves were calculated to focus on the differences when using claims-only vs claims + EMR predictors.

    Results  A total of 9502 patients with HF with a mean (SD) age of 78 (8) years were included: 6113 from network 1 (training set) and 3389 from network 2 (testing set). Gradient-boosted modeling consistently provided the highest discrimination, lowest Brier scores, and good calibration across all 4 outcomes; however, logistic regression had generally similar performance (C statistics for logistic regression based on claims-only predictors: mortality, 0.724; 95% CI, 0.705-0.744; HF hospitalization, 0.707; 95% CI, 0.676-0.737; high cost, 0.734; 95% CI, 0.703-0.764; and home days loss claims only, 0.781; 95% CI, 0.764-0.798; C statistics for GBM: mortality, 0.727; 95% CI, 0.708-0.747; HF hospitalization, 0.745; 95% CI, 0.718-0.772; high cost, 0.733; 95% CI, 0.703-0.763; and home days loss, 0.790; 95% CI, 0.773-0.807). Higher AUPRCs were obtained for claims + EMR vs claims-only GBMs predicting mortality (0.484 vs 0.423), HF hospitalization (0.413 vs 0.403), and home time loss (0.575 vs 0.521) but not cost (0.249 vs 0.252). The net benefit for claims + EMR vs claims-only GBMs was higher at various threshold probabilities for mortality and home time loss outcomes but similar for the other 2 outcomes.

    Conclusions and Relevance  Machine learning methods offered only limited improvement over traditional logistic regression in predicting key HF outcomes. Inclusion of additional predictors from EMRs to claims-based models appeared to improve prediction for some, but not all, outcomes.

    Introduction

    With aging of the global population, heart failure (HF) is being recognized as an increasing clinical and public health problem associated with significant mortality, morbidity, and health care expenditures, particularly among patients aged 65 years and older.1 Heart failure is estimated to contribute to 1 in every 8 deaths in the United States.2 Despite progress in reducing HF-related mortality through therapeutic development, hospitalizations for HF remain frequent.3 Total costs of care related to the treatment and management of HF in the United States were estimated to be $31 billion in 2012, with more than two-thirds attributable to direct medical costs.2 A need for optimizing treatment and improving outcomes has led to a large field of predictive modeling in HF. In a systematic review, Rahimi et al4 identified a total of 64 different models predicting either mortality or hospitalizations in patients with HF. Although these models differ substantially in terms of the target populations (eg, inpatients or outpatients, reduced or preserved ejection fraction [EF], or younger or older ages) and prediction risk window (eg, 30-day mortality risk, 1-year mortality risk), they share the ultimate objective of facilitating risk stratification of patients with HF and are noted to have variable success rates with discrimination indices in the range of 0.60 to 0.89.4

    There are several shortcomings of the currently available risk prediction models for HF. First, most previous models are developed using traditional statistical approaches, such as regression modeling, and newer alternatives, such as machine learning–based prediction models, have remained underused.5 Second, most models are developed to contain only a small number of important predictors that clinicians can easily access or order to compute a risk score at the bedside to determine the appropriate treatment course for a particular patient. As a result, these models have limited utility to inform population-level interventions because the policy makers, for example a large insurer, may not have the ability to obtain additional information on top of routinely collected health care data through insurance claims or electronic medical records (EMRs) for enrolled patients. Third, predictive models for outcomes that are important from the payers’ perspective (high cost)6 and from the patients’ perspective (loss in home time)7,8 have not received as much attention as mortality and hospitalization. To address these limitations of the previously proposed models, we undertook this investigation with the primary objective of comparing several machine learning approaches with traditional logistic regression for development of predictive models for all-cause mortality, HF hospitalization, high cost, and loss in home time in patients with HF. Medicare claims data linked to EMRs from 2 large academic health care provider networks (“providers” includes physicians, clinicians, other health care professionals, and their institutions that compose the networks) in Boston, Massachusetts, were used to evaluate the added value of augmenting only claims-based predictive models with EMR-derived information.

    Methods
    Data Source

    In this prognostic study, we used 2007-2014 Medicare claims data from Parts A (inpatient coverage), B (outpatient coverage), and D (prescription benefits) that were linked deterministically by beneficiary numbers, date of birth, and sex (linkage success rate, 99.2%) with EMRs for 2 large health care provider networks in the Boston metropolitan area. We identified patients between 2007 and 2013 and used data from January 1, 2007, to December 31, 2014, for outcome assessment. Data from the network with a larger sample size were used for model development (training set), and data from the second network were used for model validation (testing set). The Medicare claims data contain information on demographic characteristics (age, sex, and race/ethnicity), enrollment start and end dates, dispensed medications and performed procedures, and medical diagnosis codes.9 Data not recorded in claims were extracted from the EMR, including laboratory test results and free-text information from patient medical records. A signed data use agreement with the Center for Medicare & Medicaid Services was available, and the Brigham and Women’s Hospital’s Institutional Review Board approved this study with waiver of individual patient consent based on secondary analysis of existing data that did not require patient recontact or intervention. This study followed the Transparent Reporting of a Multivariable Prediction Model for Individual Prognosis or Diagnosis (TRIPOD) reporting guideline for prediction model development and validation.

    Study Design

    We identified a cohort of patients aged 65 years or older with HF from Medicare fee-for-service claims using International Classification of Diseases, Ninth Revision, codes (listed in eTable 1 in the Supplement)10 after at least 180 days of continuous enrollment in fee-for-service Medicare Parts A, B, and D between 2007 and 2013 and at least 1 recorded EF value in EMRs within 30 days on either side of the claims-based HF diagnosis date. This claims-based HF diagnosis date was defined as the cohort entry date, and a previous 180-day period was defined as the baseline period. After the cohort entry date, patients were followed up for outcomes of interest for 365 days with censoring on Medicare disenrollment or mortality. eFigure 1 in the Supplement summarizes the study design.

    Outcomes

    We focused on 4 key outcomes of interest in HF. First, all-cause mortality within 365 days of the cohort entry date was identified based on information recorded in Medicare claims. Second, HF hospitalization was identified using Medicare inpatient claims (part A) based on the primary discharge diagnosis of HF within 365 days of the cohort entry date. Third, total costs for all-causes were identified from Medicare parts A, B, and D including hospitalization costs, outpatient costs, and medication costs for 365 days of the cohort entry date. To account for variable follow-up owing to early mortality in some patients, monthly costs were estimated by dividing total costs by total months of follow-up. Based on the resulting average monthly cost distribution per patient, membership in the highest cost decile was identified. Fourth, we created a binary variable as greater than or equal to 25% loss in home times versus less than 25% after subtracting days spent in the hospital and nursing homes from the total follow-up time (ie, between cohort entry date and last date of follow-up) to quantify the number of days patients spent at home. This measure has been shown to have good correlation with patients’ functional status. In a prior study, loss of 15 days or more at home were found to have 3 to 5-fold higher incidence of patient centered outcomes including poor self-rated health and mobility impairment.7 In another study, reduced home time over 1 year after HF hospitalization was closely correlated with traditional time-to-event mortality and hospitalization outcomes.8

    Predictors

    Based on a review of existing literature to identify factors associated with prognosis of HF,4,11 we selected a total of 54 variables from Medicare claims, including demographic characteristics (age, sex, and race/ethnicity), HF-related variables (specific International Classification of Diseases, Ninth Revision, codes indicating systolic, diastolic, left, rheumatic, hypertensive, or unspecified HF, number of HF hospitalizations, site of recorded HF diagnosis at study entry [inpatient or outpatient], history of implantable cardioverter-defibrillator, cardiac resynchronization therapy, or left-ventricular assist device), HF-related medication use, comorbid conditions, and 2 composite scores (claims-based frailty index12 and claims-based socioeconomic status index13,14). eTable 2 in the Supplement contains the full list of variables and operational definitions.

    Eight additional variables were extracted from EMRs based on the most proximal recorded value to the cohort entry date during the baseline period, including serum sodium, serum potassium, serum urea nitrogen, serum creatinine, and B-type natriuretic peptide levels; left-ventricular EF value; EF classification (<40% considered reduced; 40%-49%, moderately reduced; or ≥50%, preserved), and body mass index class (<18 considered underweight; 18-25, healthy; 26-29, overweight; ≥30, obese; or missing [calculated as weight in kilograms divided by height in meters squared]). For laboratory results (serum sodium, serum potassium, serum urea nitrogen, and serum creatinine levels), missing values were observed in the range of 5% to 25%. Therefore, we used the multiple imputation procedure using an expectation maximization algorithm for maximum likelihood parameter estimation based on all other predictors and outcomes separately within the training and testing data sets. Imputation is widely considered to be beneficial for handling missing data in predictive models, and inclusion of outcomes for imputation is recommended.15,16 For B-type natriuretic peptide level and body mass index in which the proportion of missing data was substantially higher (54% and 69%, respectively), we did not consider the multiple imputation procedure to be feasible and instead opted for missing indicator categories. Because we required the recording of an EF as a cohort entry criterion, there were no missing values for this variable. Distributions of all predictor variables were reported for training and testing data separately. Standardized differences17 for all variables between training and testing data were reported, in which absolute values greater than 10 may be suggestive of an important difference in distribution of a particular variable between these populations.

    Machine Learning Modeling Approaches

    In addition to the traditional multivariable logistic regression model, we constructed predictive models for each outcome using the following machine learning approaches in the training data. All models were constructed in 2 phases: first using only claims-based predictors, and second adding EMR-based variables to the claims-based predictors.

    Least Absolute Shrinkage and Selection Operator

    The least absolute shrinkage and selection operator (LASSO) is a regularized regression approach that incorporates a penalty to the log-likelihood function with the goal of shrinking imprecise coefficients toward 0. We used 10-fold cross-validation to select the value of the penalty parameter in a way that minimized the model deviance. The LASSO model offers several key advantages, including consistency in identifying the true underlying model18,19 and effective handling of multicollinearity.20

    Classification and Regression Tree

    Classification and regression tree (CART) analysis is a nonparametric approach that uses a decision tree framework to progressively segregate values of predictors in binary splits. Every value of the predictor variable is evaluated as a potential split, and the optimal split is determined by the gain in information (decrease in entropy). We implemented CART in a conditional inference framework, whereby stopping criteria were applied based on multiple testing procedures to obviate the need for subjective pruning and control overfitting.21

    Random Forests

    Random forest is a supervised ensemble learning method that builds many decision trees to predict the outcome of interest. We constructed a forest consisting of 500 individual trees. A key advantage of random forests is that, as long as a reasonably large number of trees is constructed, the forest does not require extensive tuning. Random forest error rates are largely insensitive to the number of features selected to split each node.22 Therefore, we used the default of a random sample of √n predictors at each node, where n is the total number of predictors under consideration. The predicted probability was derived based on average prediction across all of the trees.

    Gradient-Boosted Model

    Gradient-boosted model (GBM) is another tree-based ensemble learning method in which a series of weak classifiers is sequentially constructed and combined, each time aiming to correct errors made in the prediction by the previous classifier, to form a strong learner. We selected a low learning rate (0.01) and interaction depth of 4, as these parameters are noted to have robust performance across a variety of scenarios.23 We evaluated a maximum of 10 000 iterations; to tune the optimal number of iterations, we used 10-fold cross-validation.

    Statistical Analysis
    Performance Evaluation of Candidate Approaches

    Performance of all modeling approaches was evaluated using the following parameters in the testing data: (1) Brier score,24 which is a quadratic scoring rule in which the squared differences between actual binary outcomes and predicted probabilities are calculated and lower values indicate higher overall accuracy; (2) area under the receiver operating characteristic curve; and (3) calibration plots characterized by visual inspection and reporting the intercept and slope.24 The intercept’s departure from 0 indicates the extent to which predictions are systematically underpredicting or overpredicting probability of the event of interest. Departure of the slope from 1 indicates that predicted and observed probabilities are farther from the perfect prediction line of 45°. We compared the area under the receiver operating characteristic curves for machine learning models with a logistic regression model using the 2-sided DeLong test at a significance level of .05.25

    Performance Evaluation of the Selected Approach

    After selecting the most efficient statistical modeling approach based on the metrics outlined above for each outcome, we provided additional performance characteristics for the selected approach to focus on the differences between claims-only and claims + EMR versions for those models. First, we constructed precision-recall curves,26 which provide insights into the relevant question of what proportion of true cases the algorithm can identify (sensitivity) and with the level of accuracy (the positive predictive value) at different probability cutoffs. Next, we constructed decision curves27 to summarize the comparative utility of claims-only and claims + EMR versions for those models in selecting a patient population for intervention in terms of net benefit, defined as net increase in the number of true-positive cases identified without an increase in the number of false-positive results at various threshold probability values. In addition, we reported observed probability of events by predicted risk deciles. We reported the 10 most influential predictors for all 4 outcomes selected by these models. In addition, we characterized performance across a variety of subgroups, including HF type (reduced EF, midrange EF, and preserved EF), sex, age (65-74 and ≥75 years), and source of HF diagnosis at study cohort entry (inpatient or outpatient) based on the area under the receiver operating characteristic curve.

    Data analysis was conducted from January 1 to December 31, 2018. All models were developed in R software, version 3.4.3 (R Project for Statistical Computing). Codes for implementation of these models are publicly available (http://www.drugepi.org/dope-downloads/).

    Results
    Study Cohort

    We included a total of 9502 patients aged 65 years or older in this study with at least 1 HF diagnosis and a recorded measurement of EF within 1 month of the HF diagnosis date; 6113 of these patients were included in the training set and 3389 were used as the testing set. Table 1 summarizes baseline characteristics of these patients. The mean (SD) age was 78 (8) years, 2779 were men (45.5%), and 5571 were white (91.1%) in the training data set; the mean (SD) age was 77 (8) years, 1486 were men (43.8%), and 2853 were white (84.2%) in the testing data set (standardized differences between training and testing data sets of 12.5 for age, 3.4 for male sex, and 21.1 for white race). Distribution of left-ventricular EF–based HF class was similar between training and testing data sets, with 73.8% and 73.7% of patients having preserved EF in the training and testing data sets, respectively (standardized difference, 0.2). Mortality incidence was 20.6% (n = 1259) in the training set and 22.6% (n = 766) in the testing set. Congestive HF hospitalization was observed in 11.3% (n = 693) and 11.4% (n = 387), whereas home time loss of 25% of days or higher during follow-up was observed in 24.0% (n = 1467) and 23.9% (n = 810) in the training and testing sets, respectively.

    Comparison of Modeling Approaches

    Of the 5 candidate modeling approaches, GBM consistently provided the highest discrimination and lowest Brier scores across all 4 outcomes, which was closely followed by random forests and LASSO (Table 2). Absolute differences in area under the receiver operating characteristic curves between logistic regression and other models were small when using claims-only predictors for all outcomes except HF hospitalization (Table 2; eFigure 2 in the Supplement). C statistics for logistic regression using claims-only predictors were as follows: mortality, 0.724 (95% CI, 95% CI, 0.705-0.744); HF hospitalization, 0.707 (95% CI, 95% CI, 0.676-0.737); high cost, 0.734 (95% CI, 95% CI, 0.703-0.764); and home days loss, 0.781 (95% CI, 95% CI, 0.764-0.798). The C statistics for GBM using claims-only predictors were as follows: mortality, 0.727 (95% CI, 95% CI, 0.708-0.747); HF hospitalization, 0.745 (95% CI, 95% CI, 0.718-0.772); high cost, 0.733 (95% CI, 95% CI, 0.703-0.763); and home days loss claims only, 0.790 (95% CI, 95% CI, 0.773-0.807). The CART model was consistently outperformed by all other approaches. Improvements were noted in accuracy and discrimination for all models when EMR-based predictors were added to claims-only predictors for mortality, HF hospitalization, and home time loss outcomes but not for the cost outcome.

    Visual inspection of the calibration plots (eFigures 3-10 in the Supplement) indicated that GBM was generally well calibrated, with slopes closer to 1 and intercepts closer to 0 across all outcomes. Calibration with GBM was better at the highest-risk strata for the high cost outcome, in which logistic regression had poor calibration (eFigure 7 and eFigure 8 in the Supplement). Based on these observations, we selected GBM as the most consistent modeling approach of the 5 approaches evaluated.

    Further Evaluation of GBM

    Higher area under the precision-recall curves were obtained for claims + EMR vs claims-only GBMs predicting mortality (0.484 vs 0.423), HF hospitalization (0.413 vs 0.403), and home time loss (0.575 vs 0.521) but not cost (0.249 vs 0.252) (eFigure 11 in the Supplement). For mortality and home time loss outcomes, the observed probability was higher in the highest-risk strata when using claims + EMR predictors (Figure 1). In line with this observation, the decision curve analysis also suggested that the net benefit of using claims + EMR predictors was higher than using the claims-only set at various threshold probability values for mortality and home time loss outcomes but similar for the other 2 outcomes (eFigure 12 in the Supplement).

    Most Influential Predictors

    Figure 2 contains the 10 most influential predictors selected by GBM from claims-only and claims + EMR sets for all outcomes. Age and frailty score were selected in all models with relative influence (RI) in the range of 2.9 to 12.2 for age and 3.5 to 31.5 for frailty score across various models. EMR-based predictors that were selected in all models included serum urea nitrogen (RI range, 3.4-12.2), serum creatinine (RI range, 3.2-5.9), and serum potassium (RI range, 2.5-4.7) levels. For HF hospitalization and high cost outcomes, history of HF hospitalizations (RI, 31.2 in claims + EMR model; 38.5 in claims-only model) and prior cost decile (RI, 45.8 in claims + EMR model; 50.6 in claims-only model), respectively, were the most influential predictors.

    Model Performance in Subgroups

    Model performance was generally equivalent across subgroups, including HF type (reduced EF and preserved EF) and sex. However, the discrimination was relatively lower for patients with midrange EF, inpatients, and patients older than 75 years for all outcomes (Table 3). For instance, the claims-only model for mortality had discrimination of 0.702 for patients aged 75 years or older and 0.761 for patients younger than 75 years.

    Discussion

    In this study, we constructed predictive models for 4 important outcomes in HF using routinely collected health care data from insurance claims and EMRs of 1 health care provider network followed by an independent validation using data from a second network. We observed that machine learning methods, including tree-based ensemble approaches and penalized regression, offered only limited improvement over the widely used logistic regression. Although augmenting claims data with detailed EMR-derived predictors resulted in notable improvement in model performance for certain outcomes, including mortality and home days loss, such improvement was not seen for prediction of high future costs.

    Our study adds to a growing body of literature indicating limited performance improvement with machine learning approaches over logistic regression for clinical risk prediction problems and additionally offers several insights. In a large meta-analysis of 71 studies, Christodoulou et al28 found no evidence supporting the hypothesis that clinical prediction models based on machine learning have improved discrimination. For HF specifically, Frizzell et al5 concluded that use of a number of machine learning approaches did not improve prediction of 30-day readmissions compared with logistic regression. In our study, we observed that when using only claims-based predictors, many of which are binary variables indicating presence or absence of medical conditions or use of specific medications, the performance improvement with machine learning approaches was minimal for prediction of most outcomes. However, when the predictor set was expanded to include EMR-based information, which included numerous laboratory test results as continuous variables, we noted that machine learning approaches generally fared better than logistic regression. This observation follows the intuition that, because tree-based machine learning approaches, such as GBM or random forests, are nonparametric and do not assume linearity for a predictor-outcome association, they are usually more adept at generating predictions based on continuous variables. Furthermore, we observed that meaningful improvement in prediction of certain health care use type outcomes, such as high cost, may be more difficult to achieve even with the addition of more granular EMR-based predictors.

    In addition to the methodologic learnings, the models constructed and validated in this study may also be important from an applications standpoint. The primary intended use of these models is facilitating risk stratification with respect to key patient-level outcomes in using routinely collected health care data to identify a high-risk target population for effectively deploying population-based interventions. For instance, an insurer who is interested in deploying interventions, such as home nurse visits, to ensure optimal HF management and downstream cost savings may benefit from identifying a population with a high 1-year risk of HF hospitalization based on their administrative data using models from this study to possibly ensure the most efficient use of finite resources.

    Strengths and Limitations

    There some key strengths of this study. First, we reported discrimination as well as calibration of the models from an independent validation sample. Furthermore, our models included 2 key predictors that were not used in previous models—a frailty score12 and a composite score as a proxy for socioeconomic status13—both of which appeared to improve prediction meaningfully independent of other variables. In addition, we studied 4 different outcomes and were able to generate insights based on model performance in predicting each of these outcomes.

    There are some limitations of the study. First, our data source contained patients from only 1 geographic region of the United States, which limits generalizability and requires validation in other populations. Second, our model validation was conducted only with concurrent patients in different health care provider networks without additional prospective validation within the same provider networks. Third, we used only structured and curated predictor variables in our machine learning approaches; future research is required to test the improvement in prediction of HF-specific outcomes offered by machine learning approaches that are able to mine unstructured information, such as clinicians’ free-text notes.29 Fourth, we used only a subset of machine learning approaches and, therefore, cannot comment on performance of approaches that were not evaluated herein, such as neural networks and support vector machines. In addition, we focused on administrative claims–based prediction and augmented claims data with select EMR-based variables. We did not evaluate model performance based on EMR data alone, which is an important limitation because such a model could be useful for clinicians as they weigh various care options for patients during medical visits.

    Conclusions

    Machine learning methods offered limited improvement over logistic regression in predicting key outcomes in HF based on administrative claims. Inclusion of additional clinical parameters from EMRs improved prediction for some, but not all, outcomes. Models constructed in this report may be helpful in identifying a high-risk target population for deploying population-based interventions.

    Back to top
    Article Information

    Accepted for Publication: November 14, 2019.

    Published: January 10, 2020. doi:10.1001/jamanetworkopen.2019.18962

    Open Access: This is an open access article distributed under the terms of the CC-BY-NC-ND License. © 2020 Desai RJ et al. JAMA Network Open.

    Corresponding Author: Rishi J. Desai, MS, PhD, Division of Pharmacoepidemiology and Pharmacoeconomics, Department of Medicine, Brigham and Women’s Hospital, Harvard Medical School, 1620 Tremont St, Ste 3030-R, Boston, MA 02120 (rdesai@bwh.harvard.edu).

    Author Contributions: Dr Desai had full access to all of the data in the study and takes responsibility for the integrity of the data and the accuracy of the data analysis.

    Concept and design: Desai, Evers, Schneeweiss.

    Acquisition, analysis, or interpretation of data: All authors.

    Drafting of the manuscript: Desai.

    Critical revision of the manuscript for important intellectual content: All authors.

    Statistical analysis: Desai.

    Obtained funding: Desai, Evers, Schneeweiss.

    Supervision: Wang, Evers, Schneeweiss.

    Conflict of Interest Disclosures: Dr Wang reported receiving grants from Bayer during the conduct of the study, and receiving grants from Novartis, Johnson & Johnson, and Boehringer Ingelheim outside the submitted work. Dr Vaduganathan reported receiving grants from the KL2/Catalyst Medical Research Investigator Training award from Harvard Catalyst and serving on paid advisory boards for Amgen, AstraZeneca, Baxter Healthcare, Bayer AG, Boehringer Ingelheim, and Relypsa outside the submitted work. Dr Evers reported receiving personal fees from Bayer AG during the conduct of the study. Dr Schneeweiss reported receiving grants from Bayer, Boehringer Ingelheim, and Genentech during the conduct of the study; and receiving personal fees from WHISCON LLC and Aetion Co outside the submitted work. No other disclosures were reported.

    Funding/Support: This study was supported by an investigator-initiated research grant from Bayer AG.

    Role of the Funder/Sponsor: The study was conducted by the authors independent of the sponsor. Dr Evers, who is employed by Bayer, participated in preparation and review of the manuscript but had no role in the decision to submit the manuscript for publication. Funders had no role in design and conduct of the study or in collection, management, analysis, and interpretation of the data.

    References
    1.
    Mozaffarian  D, Benjamin  EJ, Go  AS,  et al; Writing Group Members; American Heart Association Statistics Committee; Stroke Statistics Subcommittee.  Executive summary: heart disease and stroke statistics—2016 update: a report from the American Heart Association.  Circulation. 2016;133(4):447-454. doi:10.1161/CIR.0000000000000366PubMedGoogle ScholarCrossref
    2.
    Benjamin  EJ, Virani  SS, Callaway  CW,  et al; American Heart Association Council on Epidemiology and Prevention Statistics Committee and Stroke Statistics Subcommittee.  Heart disease and stroke statistics—2018 update: a report from the American Heart Association.  Circulation. 2018;137(12):e67-e492. doi:10.1161/CIR.0000000000000558PubMedGoogle ScholarCrossref
    3.
    Blecker  S, Paul  M, Taksler  G, Ogedegbe  G, Katz  S.  Heart failure–associated hospitalizations in the United States.  J Am Coll Cardiol. 2013;61(12):1259-1267. doi:10.1016/j.jacc.2012.12.038PubMedGoogle ScholarCrossref
    4.
    Rahimi  K, Bennett  D, Conrad  N,  et al.  Risk prediction in patients with heart failure: a systematic review and analysis.  JACC Heart Fail. 2014;2(5):440-446. doi:10.1016/j.jchf.2014.04.008PubMedGoogle ScholarCrossref
    5.
    Frizzell  JD, Liang  L, Schulte  PJ,  et al.  Prediction of 30-day all-cause readmissions in patients hospitalized for heart failure: comparison of machine learning and other statistical approaches.  JAMA Cardiol. 2017;2(2):204-209. doi:10.1001/jamacardio.2016.3956PubMedGoogle ScholarCrossref
    6.
    Greiner  MA, Hammill  BG, Fonarow  GC,  et al.  Predicting costs among Medicare beneficiaries with heart failure.  Am J Cardiol. 2012;109(5):705-711. doi:10.1016/j.amjcard.2011.10.031PubMedGoogle ScholarCrossref
    7.
    Lee  H, Shi  SM, Kim  DH.  Home time as a patient-centered outcome in administrative claims data.  J Am Geriatr Soc. 2019;67(2):347-351. doi:10.1111/jgs.15705PubMedGoogle ScholarCrossref
    8.
    Greene  SJ, O’Brien  EC, Mentz  RJ,  et al.  Home-time after discharge among patients hospitalized with heart failure.  J Am Coll Cardiol. 2018;71(23):2643-2652. doi:10.1016/j.jacc.2018.03.517PubMedGoogle ScholarCrossref
    9.
    Hennessy  S.  Use of health care databases in pharmacoepidemiology.  Basic Clin Pharmacol Toxicol. 2006;98(3):311-313. doi:10.1111/j.1742-7843.2006.pto_368.xPubMedGoogle ScholarCrossref
    10.
    McCormick  N, Lacaille  D, Bhole  V, Avina-Zubieta  JA.  Validity of heart failure diagnoses in administrative databases: a systematic review and meta-analysis.  PLoS One. 2014;9(8):e104519. doi:10.1371/journal.pone.0104519PubMedGoogle Scholar
    11.
    Ouwerkerk  W, Voors  AA, Zwinderman  AH.  Factors influencing the predictive power of models for predicting mortality and/or heart failure hospitalization in patients with heart failure.  JACC Heart Fail. 2014;2(5):429-436. doi:10.1016/j.jchf.2014.04.006PubMedGoogle ScholarCrossref
    12.
    Kim  DH, Schneeweiss  S, Glynn  RJ, Lipsitz  LA, Rockwood  K, Avorn  J.  Measuring frailty in Medicare data: development and validation of a claims-based frailty index.  J Gerontol A Biol Sci Med Sci. 2018;73(7):980-987. doi:10.1093/gerona/glx229PubMedGoogle ScholarCrossref
    13.
    Bonito  A, Bann  C, Eicheldinger  C, Carpenter  L. Creation of New Race-Ethnicity Codes and Socioeconomic Status (SES) Indicators for Medicare Beneficiaries: Final Report, Sub-Task 2. Rockville, MD: Agency for Healthcare Research and Quality; January 2008. AHRQ publication 08-0029-EF.
    14.
    Gopalakrishnan  C, Gagne  JJ, Sarpatwari  A,  et al.  Evaluation of socioeconomic status indicators for confounding adjustment in observational studies of medication use.  Clin Pharmacol Ther. 2019;105(6):1513-1521. doi:10.1002/cpt.1348PubMedGoogle ScholarCrossref
    15.
    Steyerberg  EW, van Veen  M.  Imputation is beneficial for handling missing data in predictive models.  J Clin Epidemiol. 2007;60(9):979. doi:10.1016/j.jclinepi.2007.03.003PubMedGoogle ScholarCrossref
    16.
    Sterne  JA, White  IR, Carlin  JB,  et al.  Multiple imputation for missing data in epidemiological and clinical research: potential and pitfalls.  BMJ. 2009;338:b2393. doi:10.1136/bmj.b2393PubMedGoogle ScholarCrossref
    17.
    Austin  PC.  Using the standardized difference to compare the prevalence of a binary variable between two groups in observational research.  Commun Stat Simul Comput. 2009;38(6):1228-1234. doi:10.1080/03610910902859574Google ScholarCrossref
    18.
    Chand  S. On tuning parameter selection of LASSO-type methods—a Monte Carlo study. Paper presented at: Applied Sciences and Technology (IBCAST) 2012 9th International Bhurban Conference; January 9-12, 2012; Islamabad, Pakistan. https://ieeexplore.ieee.org/document/6177542. Accessed January 31, 2018.
    19.
    Zhang  Y, Li  R, Tsai  C-L.  Regularization parameter selections via generalized information criterion.  J Am Stat Assoc. 2010;105(489):312-323. doi:10.1198/jasa.2009.tm08013PubMedGoogle ScholarCrossref
    20.
    Oyeyemi  GM, Ogunjobi  EO, Folorunsho  AI.  On performance of shrinkage methods—a Monte Carlo study.  Int J Stat Appl. 2015;5(2):72-76. doi:10.5923/j.statistics.20150502.04Google Scholar
    21.
    Hothorn  T, Hornik  K, Zeileis  A.  Unbiased recursive partitioning: a conditional inference framework.  J Comput Graph Stat. 2006;15(3):651-674. doi:10.1198/106186006X133933Google ScholarCrossref
    22.
    Breiman  L.  Random forests.  Mach Learn. 2001;45(1):5-32. doi:10.1023/A:1010933404324Google ScholarCrossref
    23.
    Hastie  T, Tibshirani  R, Friedman  J. Boosting and additive trees. In:  The Elements of Statistical Learning. New York, NY: Springer; 2009:337-387. doi:10.1007/978-0-387-84858-7_10
    24.
    Steyerberg  EW, Vickers  AJ, Cook  NR,  et al.  Assessing the performance of prediction models: a framework for traditional and novel measures.  Epidemiology. 2010;21(1):128-138. doi:10.1097/EDE.0b013e3181c30fb2PubMedGoogle ScholarCrossref
    25.
    DeLong  ER, DeLong  DM, Clarke-Pearson  DL.  Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach.  Biometrics. 1988;44(3):837-845. doi:10.2307/2531595PubMedGoogle ScholarCrossref
    26.
    Saito  T, Rehmsmeier  M.  The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets.  PLoS One. 2015;10(3):e0118432. doi:10.1371/journal.pone.0118432PubMedGoogle Scholar
    27.
    Vickers  AJ, Elkin  EB.  Decision curve analysis: a novel method for evaluating prediction models.  Med Decis Making. 2006;26(6):565-574. doi:10.1177/0272989X06295361PubMedGoogle ScholarCrossref
    28.
    Christodoulou  E, Ma  J, Collins  GS, Steyerberg  EW, Verbakel  JY, Van Calster  B.  A systematic review shows no performance benefit of machine learning over logistic regression for clinical prediction models.  J Clin Epidemiol. 2019;110:12-22. doi:10.1016/j.jclinepi.2019.02.004PubMedGoogle ScholarCrossref
    29.
    Rajkomar  A, Oren  E, Chen  K,  et al.  Scalable and accurate deep learning with electronic health records.  NPJ Digit Med. 2018;1(1):18. doi:10.1038/s41746-018-0029-1PubMedGoogle ScholarCrossref
    ×