Development and Validation of an Explainable Machine Learning Model for Major Complications After Cytoreductive Surgery

Key Points Question Can machine learning provide superior risk prediction compared with the current statistical methods for patients undergoing cytoreductive surgery? Findings In this prognostic study, an optimized machine learning model demonstrated superior capability of predicting individual-level risk of major complications after cytoreductive surgery than traditional methods. Cohort-level risk prediction allowed unbiased categorization of patients into 6 distinct surgical risk groups. Meaning These results suggest that explainable machine learning methods cannot only provide accurate risk prediction but can also allow identification of potentially modifiable sources of risk on patient and cohort levels.


eMethods.
Predictors: There were a total of 147 predictors, and the complete list of predictors were shown in Table S1. The predictors were from the following groups: Demographic (e.g., Gender, BMI), comorbidity (e.g., Prior cardiac event, diabetes), pre-operative laboratory tests (e.g., Platelet count, WBC count), diagnostic and staging workup (e.g., Extraperitoneal metastases), extent of surgery and operative predictors (e.g., Right diaphragm peritonectomy, extubated in OR), and pre-operative treatment (e.g., type and duration of chemotherapy). To pre-process these predictors, we first grouped them into four data formats -ordinal, binary, categorical, and continuous. We then regrouped the ordinal predictors, such as number of previous CRS, into three categories: None, once; and more than once. In the next step, we pre-processed the missing values. For each of the categorical predictors, a missing category was created. For each of the continuous predictors, the missing values were replaced by its mean or median based on the skewness of its distributions. For example, if the distribution was right or left skewed, the median imputation was applied. If the continuous predictor missed more than 50%, its missing values were replaced by an extreme value (e.g., 999) and a corresponding missing indicator was created. Lastly, we transformed the categorical variables using the one-hot encoder, explicitly creating indicator predictor for each level of the categorical variable. 1

Predictive modeling
The primary outcome of interest was grade 3 or higher (grade3+) complications based on the Clavien-Dindo classification system. 2 To predict grade 3+ surgical complications, we used an ensemble-based ML model -gradient boosting model (hereafter referred to as GBM), as implemented in the lightGBM package in Python. 3 GBM is a machine learning method that combines a series of simple tree-based models, such as decision trees. 4 This method first builds a simple tree-based model, and the added more models one at a time to the existing model to minimize the loss function, a function of the difference (aka "the loss) between the prediction of the model and the given outcome value. This process is known as gradient descent. The method is also able reduce overfitting and improve performance by tuning several parameters, such as the number of trees, the fraction subsample, or columns to build each of the tree-based models, and the depth of the trees. The optimal parameters of the final model are selected based on the loss function and the specified evaluation metrics, such as area under the precision/recall curve.
We first trained the model with two different sets of training set. When building a model, data are usually divided into training and test set to avoid overfitting. Based on empirical studies, the best results can be obtained if we use 70-80% of the data for training, and 20-30% data for testing. 5 The first set contained 80% of the patients, which was the entire training set, and the second was a subset of this training set, which only consists of the patients with grade 3+ complications and no complications. We then trained this model on the extremes of outcomes (no complication vs. grade 3+ complication) to improve model performance. We reasoned that if the dichotomy is magnified, it will allow a greater accuracy of optimized GBM model in identifying features of patients with grade 3+ complications during training. Both prediction models were trained using a 5-fold cross validation to find the optimal hyperparameters, and their tuning ranges are listed in Table S2. During the cross validation, we randomly assigned the patients into five different subsets and ensured that the ratio between the two outcome groups (e.g., grade 3+ versus grade 0) were the same in each of these subsets. We trained the model on four of these groups and tested it on the fifth group. This process was repeated five times so that each patient was used in both training and validation groups. Each time, the areas under the receiver operating characteristic curve (AUROC) and precision-recall curve (AUPRC) were generated, and the optimal values of the hyperparameters that maximized the average AUROC (i.e. overall ROC) was selected. The models with the optimal values of both models were then validated by the independent holdout test set, which contained the remaining 20% of the patients.
Separately, we developed two multivariate logistic regression models to predict surgical complication of grade 3+ vs. no complication. The first model (hereafter referred to as MLR model 1), which included all significant predictors from univariate logistic regression models and excluded the predictors that were highly correlated with each other. The second model -hereafter referred to as MLR model 2 -used the predictors specified from a previously published paper that predicted surgical complication. These predictors included (CCI, excluding the index malignancy), symptoms, and prior resection and operative status. 6 We then compared predictive performances of these two regression models to that of the GBM.

Local and Global Interpretation of ML models
Although ensembled-based GBMs may provide good prediction accuracy, they cannot be applied in the clinic because the generated output cannot be interpreted. To facilitate interpretation of the ensemble-based GBM method we used an artificial intelligence SHAP (SHapley Additive exPlanations) method. 7,8 This SHAP method calculates a total SHAP value for each individual (i.e., individual-level total SHAP value) -a patient with a higher SHAP value corresponds to a higher likelihood of the target outcome i.e. major complications. In addition, the method breaks down the SHAP values of predictors for a given individual to predictor specific SHAP value which examines how the levels of predictors contribute to this individual's total SHAP value. These predictor-level SHAP values can also be aggregated to show how each of the predictor affects the outcome across all individuals. SHAP values enable the interpretation of the model both on a local (e.g., individual prediction) or a global (e.g., population trend) level.
To facilitate the interpretation, the method creates the force, summary, and dependence plots to visualize the local-and global-level SHAP values. On the local level, the force plot shows the individual-level total SHAP value and the predictors that contribute to this value. On the global level, the summary and dependence plots display the predictor specific SHAP value across all individuals. Specifically, the summary plot ranks the predictors' predictive ability to the outcome as well as the direction of the effects; the dependence plots not only display the direction of effects, but also shows the non-linear associations and interactions between predictors. In addition, individual-level predictions can be embedded into an explanation space (i.e., Local explanation embedding) to make the global interpretation. In this space, individuals with a similar combination of individual-level SHAP values were grouped together based on Euclidean distance -an unsupervised distance-based clustering method. These clusters (or patient groupings) allow us to discover common patterns among a subgroup of the population, and to interpret how these patterns together lead to the outcome. Balancing of positive and negative weight.
Fixed at the ratio of two outcome groups num_boost_round Number of gradient boosted trees. Fixed at 1000 for tuning the number of tree, and the optimal number of trees for the final model 1 was determined using 5-fold CV. max_depth Maximum tree depth for base learners. Fixed at 6 for tuning the number of tree, and randomly picked from 3 to 10 in the final model. min_child_weight Minimum sum of instance weights needed in a child.
Fixed at 1 for tuning the number of tree, and randomly picked from 1 to 5 in the final model. colsample_bytree Subsample ratio of columns when constructing each tree.
Fixed at 1 for tuning the number of tree, and uniformly picked from 0.4 to 0.9 in the final model. Subsample Subsample ratio of the training instance.
Uniformly picked from 0.4 to 0.9 in the final model. Reg_alpha L1 regularization Uniformly picked from 10 evenly spaced points between decades 10^-2 and 10^2 Reg_lambda L2 regularization Uniformly picked from 10 evenly spaced points between decades 10^-2 and 10^2 1 A two-stage tuning process was adopted to train the model. In stage I, the initial number of gradient boosted tree (num_boost_round) was tuned in a model with all other parameters fixed at a default value. In stage II, using the number of tree tuned from the stage I model, the other parameters of the models, which included max depth, min child weight, subsample rate, column sample rate, regularized alpha and lambda levels, were further tuned. The model prediction and interpretion were generated using the stage II model with the tuned parameters. 0.43 0.39 0.15 1 All three models were developed with the subset of the training set excluding the patients with grade 1 and 2 complications. 2 MLR model 1 included all significant predictors from univariate logistic regression models and excluded the predictors that were highly correlated with each other. 3 MLR model 2 included CCI score, symptomatic, and previous CRS and HIPEC status. 4 The AUROC and AUPRC of the test set were reported. 5 The threshold to classify the predicted probability was determined at the recall and precision of 40%. 6 Positive class was the total number of patients who had the outcome.