Development, Validation, and Evaluation of a Simple Machine Learning Model to Predict Cirrhosis Mortality

This cohort study compares different machine learning methods in predicting overall mortality in cirrhosis and uses machine learning to select easily scored clinical variables for a novel prognostic model in patients with cirrhosis.


eMethods 2. Imputation Approach
To impute the missing data, we used a non-parametric machine learning based imputation strategy-MissForest-that may have better accuracy relative to other imputation strategies such as Multiple Imputation (e.g., Multiple Imputation through Chained Equations). 1,2 MissForest predicts missing values through a series of non-parametric random forest tree ensembles. Briefly, the algorithm first makes an initial prediction of the missing value with a random forest fitting (first imputation iteration). Then, using the completely known data-matrix, a random-forest is trained with (y) representing the complete values observed from the predictor that contains missing values overall. Then, the missing values in that predictor (y) are predicted from the just-trained random forest model. This prediction is compared to the initial prediction of missingness. This process of training a new random forest on observed data plus the new predictions for the missing data and then making predictions of the truly missing data is repeated with a new prediction made until convergence. Convergence is assumed when the normalized root-mean-square-error (NRSME) of the prediction is minimized; when a new random forest prediction begins to increase the resulting NRMSE, the algorithm stops and the last random forest model predictions are used to impute the missing values.
One advantage of the MissForest approach is that a final dataset with a single prediction of the missing data values is the result. This makes MissForest more flexible for later machine learning methods like gradient boosting. In contrast, data-based multiple imputation methods require accounting for imputation of multiple datasets and then accounting for between-variance imputation. Accounting for between-imputation dataset variance in non-parametric machine learning models like gradient boosting remains unclear to the authors' knowledge.

eMethods 3. Machine Learning Models
Gradient descent boosting creates a series of "boosted" decision trees of weaker predictors to create stronger final predictions. The final model was allowed to train up to 1,000 trees; however, optimization occurred at 127 trees. Additionally, a learning rate of 0.1 and a maximum depth of each tree of 7 (i.e., up to 7-way interactions) were identified as optimal during training.
The LASSO performs variable selection by first evaluating the magnitude of each predictor in a prediction model including the full set of predictors and then for each variable in the range full set of predictors, it adds a penalty (λ) to the prediction equation equal to the absolute value of smallest coefficient among all coefficients in the model at that stage of evaluation. 3,4 It then removes that variable and selects the remaining coefficients for the next iteration of penalty evaluation. This iterative loop continues until removal of predictors starts to increase prediction error. This process both selects the predictors that have the strongest influence on the outcome, while at the same time removes predictors that contribute little to the prediction. This facilitates a parsimonious prediction model that is more likely to be unbiased when predicting future data. Evaluating each predictor among all predictors available is customarily called the FULL pathway evaluation of predictors-as we refer to it here. We can also further constrain the prediction model to evaluate only a maximum number of most influential predictors (e.g., 10 or fewer predictors) and start the penalty process evaluation at the first iteration by selecting a penalty (λ) that is equal to the absolute value of the 11 th most strongly predictive variable and starting the iteration through the remaining predictors after imposing that λ penalty. This results in a PARTIAL pathway as we do not evaluate all other predictors than the 10 most strongly influential predictors at the first iteration of the LASSO Antibiotic AM000, AM110-AM120, AM150, AM200, AM250, AM300, AM350, AM400, AM550, AM600, AM650, AM700 Anti-depressive CN600, CN601, CN602, CN609 Betablocker CV100 Diuretic CV700-CV704, CV709 ICD-9, CPT, and Drug Class Codes Used to Define the Advanced Liver Disease Cohort and Candidate Predictor Variables CirCom score uses a specific set of ICD-10 codes. We mapped the ICD-10 to ICD-9 codes to define conditions included in CirCom. *Circom: nonmetastatic cancer, metastatic cancer, hematologic cancer, substance abuse other than alcoholism, epilepsy, acute myocardial infarction, heart failure, peripheral arterial disease, chronic obstructive pulmonary disease, chronic kidney disease were pulled using most recent inpatient or outpatient diagnoses given in the 5 years before index date. Circom score was calculated by the algorithm 20 developed and validated by Jepsen et al. CirCom score uses a specific set of ICD-10 codes to define the conditions. We mapped these ICD-10 to ICD-9 codes to define conditions included in CirCom (as shown in Supplementary Table 1).

©2020
We used the Academy of Healthcare Research and Quality Clinical Classifications Software (CCS) to define the conditions that were not included in the CirCom score (such as diabetes, depression, anxiety, and alcohol use).
Abbreviations: CirCom, cirrhosis-specific comorbidity score; HBV, hepatitis B virus; HCV, hepatitis C virus SI conversion factors: To convert albumin to g/L, multiply by 10.0; bilirubin to μmol/L, multiply by 17.104; creatinine to μmol/L, multiply by 88.4; hemoglobin to g/L, multiply by 10.0; platelet count to ×10 9 /L, multiply by 1.0; sodium to mmol/L, multiply by 1.0; Unless otherwise indicated, data are expressed as number (percentage) of patients. Owing to missing data, percentages may not total 100. .79 We removed the intercept term when fitting discrete time models, and each person-period has a coefficient associated with the hazard of death in that time period. Using the first year as example, the expected person-period hazard for a patient would be aj=1=-.02 and the result from this estimate is an adjustment in the predicted risk by 1/(1 + exp(-aj=1)) = 1/(1 + exp(-(-.02)) = .49. This probability of mortality in 1 year from cohort entry may seem quite high (~50/50 chance); yet, we note that many of the predictors are "protective" in the sense that they reduce risk to a value