Use of Machine Learning to Develop and Evaluate Models Using Preoperative and Intraoperative Data to Identify Risks of Postoperative Complications

Key Points Question Can machine learning models predict patient risks of postoperative complications related to pneumonia, acute kidney injury, deep vein thrombosis, delirium, and pulmonary embolism? Findings In a cohort study of 111 888 operations at a large academic medical center, machine learning algorithms exhibited high areas under the receiver operating characteristic curve for predicting the risk of postoperative complications related to pneumonia, acute kidney injury, deep vein thrombosis, pulmonary embolism, and delirium. Meaning These findings suggest that machine learning models using preoperative and intraoperative data can predict postoperative complications and generate reliable and clinically meaningful interpretations for supporting clinical decisions along the perioperative care continuum.


Methods
In this appendix, we first describe the following major tasks that we have accomplished for data extraction: (1) Use filters to clean the data (2) Capture various preoperative laboratory values (3) Create postoperative outcomes These tasks are outlined sequentially on the following pages. For other outcomes (pneumonia, DVT, PE and delirium), refer to "SATISFY-SOS Pilot Study -Algorithm for Automated Medical Record Review" at the end of this appendix. • For each lab test listed in Table 1, identify the value that is closest to the anesthesia start time, but is still before the anesthesia start time.
o Consider all similarly-named variables as a group. (For example, the preoperative bicarbonate would be the value closest to anesthesia start, regardless of whether it is lab code 4126 or 5297) o Include labs drawn up to 30 days before the anesthesia start time. o If a patient did not have a particular lab drawn, then treat that field as missing.
Thus each patient should have variables called "pre op ALT," "pre op albumin," and so on. TASK 3: Create two variables for postoperative complications: Acute kidney injury (AKI) The first variable uses creatinine values to define AKI. A patient has an AKI if the creatinine rises by 0.3 mg/dl or by 50% of its preoperative value within the first 48 hours after surgery.
The second variable uses new onset of renal replacement therapy (dialysis) to define AKI. A patient has an AKI if they were not on dialysis before surgery and they needed dialysis after surgery prior to discharge.
Patients are excluded if they are undergoing kidney transplant or if they are undergoing creation or revision of dialysis access (for example, arteriovenous fistula, ateriovenous graft). eTable 2 at the end of this section contains CPT codes for these procedures, eTable 3 contains ICD-9-CM codes, and Table 4 contains ICD-10-PCS codes.

(1) Acute Kidney Injury -Creatinine Definition
This calculation requires the CKD_DialysisHistory variable.
• Set the AKI_Creatinine variable to missing if any of the following are true o CKD_DialysisHistory = "ongoing hemodialysis" or "ongoing peritoneal dialysis" o Preoperative creatinine is missing o Postoperative creatinine peak is missing o Procedure code matches a CPT code in Table 2, ICD-9-CM code in Table 3, or ICD-10-PCS code in Table 4. (In other words, patient is undergoing kidney transplant or dialysis access procedure) • Set the AKI_Creatinine variable to 1 if any of the following are true o Postoperative peak creatinine >= Preoperative creatinine + 0.3 o Postoperative peak creatinine >= 1.5 * Preoperative creatinine • Otherwise set the AKI_Creatinine variable to 0

(2) Acute Kidney Injury -Dialysis Definition
This calculation requires the M_Kidney_Dialysis variable and the CKD_DialysisHistory variable.
• Set the AKI_Dialysis variable to missing if any of the following are true o CKD_DialysisHistory = "ongoing hemodialysis" or "ongoing peritoneal dialysis" o Procedure code matches a CPT code in Table 2, ICD-9-CM code in Table 3, or ICD-10-PCS code in Table 4

TASK 1: Identify the target date range
A. Variables to be used in this procedure a. OR_Date -This is the date that triggered our team to send a survey to the patient. It is defined as the first time the patient received a billable anesthesia service at a BJC facility starting two weeks prior to the date of study consent. It is often, but not always, the same as the procedure of interest. There are no missing values.
b. PAP_Type -This is a categorical variable indicating where the patient underwent preoperative assessment by the anesthesiology department. Possible values include i. "CPAP Clinic" -assessed at Center for Preoperative Assessment and Planning, an outpatient clinic ii. "CPAP-incomplete" -assessed at CPAP, but data form is <75% complete iii. "DPAP (holding area)" -assessed on day of surgery in preop holding area iv. "DPAP (on ward)" -assessed on day of surgery on hospital ward v. "IPAP" -assessed as an inpatient, prior to day of surgery vi. "IPAP-incomplete" -assessed as inpatient, but data form is <75% complete vii. "TPAP-AP" -assessed by telephone viii. "TPAP-RN" -assessed by telephone by a nurse c. PAP_Date -This is the date that the patient underwent their CPAP, DPAP, IPAP, or TPAP. In most cases, this is also the date of study consent. d. CPAP_DOSPlanned -This is the date of anticipated surgery, as documented in the preoperative assessment note. The value is missing for about 50% of patients in our study. If the patient has a surgery on this date, it is likely the procedure of interest.
B. Identify the date of the procedure of interest a. If PAP_Type = "IPAP" or "IPAP-incomplete" or "DPAP (holding area)" or "DPAP (on ward)" or "TPAP-AP" or "TPAP-RN" OR if PAP_Type = [missing] i. Then OR_Date gives the date of the procedure of interest.
Patients who did not go the CPAP clinic must have been consented during the hospitalization for the procedure of interest. It is safe to assume that these patients have not had additional billable anesthesia services in the past two weeks that were not associated with the current hospitalization. Therefore OR_Date accurately gives the date of the procedure of interest.
b. If PAP_Type = "CPAP Clinic" or "CPAP-incomplete" AND OR_Date = CPAP_DOSPlanned i. Then OR_Date gives the date of the procedure of interest. These criteria select patients who were seen in CPAP, had the procedure of interest on the originally scheduled date, and did not undergo any minor procedure between the CPAP visit and the procedure of interest. i. Then search for "Anesthesia Record" document with a date matching "CPAP_DOSPlanned". If there is a match, then "CPAP_DOSPlanned" gives the date of the procedure of interest. If there is no match, continue to the next step. If "CPAP_DOSPlanned" is missing, skip this step. This covers situation (1) above. ii. If neither of the above searches yields a result, then use "OR_Date" as the date of the procedure of interest. This covers situations (2), (3), and (4) above. For situation (2), we have identified the incorrect procedure but do not have sufficient information to find the procedure of interest. For situation (3), we have identified the procedure of interest. For situation (4), realize that the patient never had the intended procedure of interest, so chart review will focus on the minor procedure.
C. Identify the visit number (starts with 701… in most cases) associated with the date of the procedure of interest. This is included in the "Anesthesia Record" document, which all patients in the study will have. It is also in the operative summary document, but some patients in our study undergo procedures (i.e. electrophysiology) that do not generate that document.
D. Restrict search for complications to data corresponding to this visit number. Examine complications that occurred during hospitalization for the procedure of interest.
TASK 2: Identify complications occurring within the target date range For all subsequent steps, restrict the search to data corresponding to the visit number identified in Task 1. In general, searches of ICD-9 diagnosis codes should exclude the admitting diagnosis, as the admitting diagnosis should be the indication for surgery and would be unlikely to represent a postoperative complication. Note that the admitting diagnosis is typically repeated in the list of final diagnoses. It should still be excluded from searches. In this section, we have conducted comparison between 7 most common data imputation techniques: mode, mean, median, dummy indication, Multiple Imputation by Chained Equations 2 (with 10 iterations), MissForest 3 (with 10 iterations), and kNN imputation. Due to the selection of hyper-parameters in kNN, we further implemented kNN with number of nearest neighbors =3 and uniform weights, and number of nearest neighbors =5 and distance-based weights. In total, 8 imputation methods were compared.

Variables to be used in this procedure:
The experiment was designed using pneumonia dataset in 3 steps. First, each imputation method was performed on a copy of the original pneumonia dataset, and the imputed values for the nominal (categorical) variables were rounded to the closest values in the original distribution. By doing so, the range of each variable was reserved. With 8 imputation methods implemented, we ended up with 8 imputed datasets. Second, each imputed dataset was processed in the same way as described in the manuscript. The categorical variables were split into binary variables by onehot-encoding, and continuous variables were normalized by z-scoring. Last, each processed dataset was evaluated by gradient boosting tree (GBT), random forest (RF) and logistic regression (LR) with 5 random shuffles of cross validation. The configuration of GBT, RF and LR was the same as in the manuscript.
The performance of each imputation method was tabulated in eTable 1. As clearly shown in the table, regardless of imputation methods that we used, GBT had the best performance in terms of both AUROC and AUPRC. Moreover, regardless of any machine learning models used, we observed that the imputed dataset using dummy indicator was most predictive. This might be explained by the fact that some measurements, especially lab tests, are missing by "intention": when clinicians decide not to perform a lab test on a patient, it reflects the clinicians' opinion that the lab result is expected to be normal. As a result, the indication of missingness in dummy indicator method preserves this "intention" of clinicians, hence the imputed dataset becomes more informative. For support vector machine (SVM), the regularizer was set to l2 norm to avoid overfitting and loss function was set to squared hinge loss. Due to the large number of records in the dataset, a linear SVM was used. In logistic regression (LR), Newton-CG solver was used for its optimal performance in large datasets. The hyper-parameters of random forest (RF) and deep neural network (DNN) were chosen by grid search. In RF, we varied the number of base learners from 40 to 300, maximum depth from 20 to 200, and minimum samples for splits from 1 to 7. The optimal hyper parameters for RF were 300 base learners, 200 maximum depth, and minimum 4 samples for splits. For DNN models, we explored both 3-layer, 4-layer and 5-layer architecture by varying number of nodes in each layer. When exploring the optimal DNN architecture, we varied the number of nodes in the first layer as 16, 32, 64, 128. The number of nodes in the second layer was chosen as half of the first layer accordingly, the number of nodes in third layer was chosen as half of the second layer, and so on. The last layer of DNN model was always unchanged and had 2 nodes, as it directly connected to the softmax layer to generate probabilistic output. The optimal configuration of DNN had 4 layers with 128, 64, 32, and 2 nodes in each layer). When training the DNN model, we further explored the choice of learning rate as 0.0001, 0.001, 0.01 and different batch size options as 32, 64, 128, 256, 512, 1024, 2048. The optimal settings are choosing learning rate as 0.001 and batch size as 2048. Gradient boosting tree (GBT) was created by setting tree-based learners and logistic loss function. Note that different versions of GBT may affect model performance, due to parameter setting. Version list and Python codes are uploaded to Github: xuebing1234/handoff_framework eAppendix 5. Details of Performance Metrics of Each Model Note that sensitivity, specificity, prevision, F-score and accuracy vary depending on the threshold of ML models (as shown in ROC curves), we fixed specificity at 95% for easier comparison between different models. As shown in eTable 1 to eTable 5, in most cases the machine learning model with highest AUROC would have the highest AUPRC too, hence model selection based on either AUROC or AUPRC would yield similar results. Such observation is consistent with theory, that if a model dominates in ROC curve, then it also dominates in PRC curve 1 . See reference for detailed proof. When the model is determined, the thresholds can be carefully adjusted based on the clinicians' judgement on the relative weight between sensitivity, specificity, etc.
eTable8: AUROCs of best machine learning models for pneumonia, acute kidney injury (AKI), deep vein thrombosis (DVT), pulmonary embolism (PE) and delirium. In this section, we show 2 patients with negative predicted risks and 2 patients with positive predicted risks. For the risk overview, we created three candidate graphs: a) comparison of the patient with respect to the average of patients who had pneumonia (Fig 1.a, 2.a, 3.a, and 4.a), b) comparison of the patient with respect to the average of patients who did not have pneumonia (Fig 1.b, 2.b, 3.b and 4.b), and c) comparison of patients with respect to the average of patients who had pneumonia and the average of patients who did not have (Fig 1.c, 2.c, 3.c and 4.c). For the key intraoperative variables (blood pressure in the prediction model of pneumonia), we created detailed visualizations to look into it. First, we created a nested pie chart to show how much contribution (measured by SHAP values) does each statistical feature make to the prediction (Fig 1.d, 2.d, 3.d and 4.d).
In the outer circle it shows the average of all patients; and in the inner circle it shows the patient-of-interest. Second, we created a bar plot to show how much does the value of each statistical feature differentiate from the average of all patients (Fig  1.e, 2.e, 3.e and 4.e). Each statistical feature was normalized to zero mean and unit variance, therefore the magnitude reflects the relative difference from average. In this section, we show example cases of true positives (TP), false positives (FP), true negatives (TN) and false negatives (FN) for 5 outcomes based on model interpretation. For the 20 example cases, we show: the comparison of patients with respect to the risks of getting pneumonia (TP :  Fig 6.a, FP: Fig 6.b, TN: Fig 6.c, and FN: Fig 6.d); the comparison of the patient with respect to the risks of getting AKI (TP : Fig 7.a, FP: Fig 7.b, TN: Fig 7.c, and FN: Fig 7.d); the comparison of patients with respect to the risks of getting DVT (TP : Fig 8.a, FP: Fig 8.b, TN: Fig 8.c, and  FN: Fig 8.d); the comparison of patients with respect to the risks of getting PE (TP : Fig 9.a, FP:  Fig 9.b, TN: Fig 9.c, and FN: Fig 9.d) and the comparison of patients with respect to the risks of getting delirium (TP : Fig 10.a, FP: Fig 10.b, TN: Fig 10.c, and FN: Fig 10.d).
In each figure, we are showing 4 distinct cases in the prediction of same outcome. In subplots a and c, most of the patients' preoperative and intraoperative data were consistent with historical cases of positive/negative patients, which resulted in constantly increasing or decreasing risks. In subplots b and d, the patients' data had mixed effects: some measurements were consistent with positive cases but some measurements were consistent with negative cases, hence the overall risk fluctuated. Taking Fig. 6 for example, some measurements including albumin, hematocrit were weighted as more important in the predictive model of pneumonia, thus the values of these measurements misled the overall risk. Note that albumin level in the case of Fig. 6c and 6d were missing. In the presence of missingness, machine learning model learned from historical data that indication of missing measurement would lower the likelihood of getting pneumonia, however, this was not always true and such missingness misled the prediction in Fig. 6d.
By looking at different scenarios in each outcome, we argue that the model interpretation have several advantages than a simple risk score. First, regardless of the correctness of the prediction, model interpretation could provide us with the important variables in each case, and how such variables affect the prediction risk in comparison with the positive/negative cohorts. Second, when most measurements are consistently contributing towards the same direction, such model interpretation could provide clinicians with more confidence in trusting the prediction risks. Last but not least, outcomes and input variables may not have a simple causal relationship. In the cases of false negatives and false positives, the conflicting factors could highlight the issues of wrong data, or alert clinicians the complexity of the patients' scenario, so that more attention should be paid to the details. preoperative data is used for surgery planning. Scenario 2 occurs after emergency surgery is performed and only intraoperative data is used for handoff process. Scenario 3 occurs after normal surgery with both preoperative and intraoperative data used for handoff process