Development and Performance of the Pulmonary Embolism Result Forecast Model (PERFORM) for Computed Tomography Clinical Decision Support

Key Points Question Can machine-learning approaches achieve an objective pulmonary embolism risk score by analyzing temporal patient data to accurately inform computed tomographic imaging decisions? Findings In this multi-institutional diagnostic study of 3214 patients, a machine learning model was designed to achieve an accurate patient-specific risk score for pulmonary embolism diagnosis. The model was successfully evaluated in both multi-institutional inpatient and outpatient settings. Meaning Machine learning algorithms using retrospective temporal patient data appear to be a valuable and feasible tool for accurate computation of patient-specific risk score to better inform clinical decision-making for computed tomographic pulmonary embolism imaging.

This supplementary material has been provided by the authors to give readers additional information about their work.

Methodology for Temporal Feature engineering
Demographics -As demographics, we considered four static features: gender (male/female), race/ethnicity (white/black/asian/native american/others/unknown), age at time of observation, smoking habit (yes/no) and coded them as categorical variables (age binned into 10 groups). In case of change in smoking status, we only considered the current observation and coded as 'Smoking'/'Non-smoking'. Vitals -We considered only the primary vital signs of the patient which includes systolic and diastolic blood pressure, height, weight, body mass index (BMI), temperature, respiration rate, pulse oximetry (spO2), and heart rate. For both internal and external datasets, the primary vitals are recorded using the LOINC standard coding system 1 . In order to capture temporality, we measured the sensitivity to change in primary vitals within a 30 day window by computing derivatives of each vital sign along the temporal axis where first value is the normal range of the targeted vital. The derivative of a vital can be represented as where , , , , … . . is a measure of the vital over time and is normal range of the targeted vital. Given that majority of the targeted population are adults (with mean age: Stanford 60.53 and Duke 70.2), as normal range we considered vital signs against normal values if prior baseline vitals were not available. Inpatient and Outpatient medication -The inpatient and outpatient drug formulary and vocabularies were mapped to a 2016 version of RxNorm 2 . Prescription orders were distilled to the Pharmacologic class labels which active moieties that share scientifically documented properties is defined on the basis of any combination of three attributes: Mechanism of Action (MOA), Physiologic Effect (PE), and Chemical Structure (CS), that the FDA has determined to be scientifically valid and clinically meaningful. For drug feature engineering, we considered a 12 month window and identified 641 unique Pharmacologic class of drugs given to the training set SHC patients (inpatient and outpatient). Afterwards, we coded the medication usage as two numeric representations as: (1) presence/absence of the medication which is a binary value that captures if medication from a particular Pharmacologic class given to the patient within the 12 month window; (2) frequency of the medication as a numeric value to captures how many times the particular medication was repeated within 12 months. Diagnosis code -Diagnosis codes considered were ICD-9 format (codes with less than 1% occurrences in the training set were excluded). In order to limit the learning space, the diagnosis codes were collapsed to the top diagnosis categories using the the International Classification of Diseases, Version 9. Expansion to subcategories was performed with review of ICD9 taxonomy such that in total 141 unique diagnosis groupings (see Supplement Table 1) were generated for each group as a binary variable representing the presence/absence of a particular diagnosis within the 12 month window. For ensuring no data leakage, we dropped all ICD-9 codes recorded from the same encounter (hospitalization and ED visit) as of the CT exam from our analysis.
Laboratory tests -All available laboratory tests were categorized into 22 unique test categories (Supplement Table 2). Laboratory tests are coded in binary presence/absence as well as we captured the latest value of each test. Missing lab data is coded as '0' value.
2. Cross-validation performance of the models on SHC patient data eFigure 1 summarize the 10-fold cross-validation results of the SHC cases. The ElasticNet model performance mean AUC was 0.90+/-0.01) and the Neural model was (0.83+/-0.01) with both models showing low variations (+/-0.01) between the folds which represents high generalizability. (D-Dimer <500 normal)

Models intepretability
ElasticNet model -eFigure 2 shows the trends of the 22 most relevant features for the prediction PE pre-test risk. Looking at the graph we can clearly see that presence of pulmonary embolism and infraction, and neoplasm (cancer) influenced the PE prediction the most. Interesting true value of the D-dimer lab test is also listed as the top features than just the presence of the D-dimer test. Thus we can assume that these features are relevant in order to assess if a new patient has the PE or not. PE neural model -We used a method called sensitivity analysis for computing the relevance of each EMR feature. Sensitivity analysis takes the partial derivative of the loss function of the trained neural recurrence model with respect to each input feature to derive the importance for the targeted prediction task. eFigure 3 represent results of sensitivity analysis of input for two cases where the importance scores are plotted as bar and predicted probability value and ground truth labels are also shown in the figure.

EMR grouping criteria: diagnosis code and laboratory exams
In Section 2.2, we described the proposed feature engineering pipeline that can parse EMR while maintaining the significant temporal properties where the pipeline used pre-defined grouping criteria for diagnosis codes and laboratory test. eTable 2 listed the diagnosis grouping based on ICD9 standard and eTable 3 listed laboratory test grouping which was generated based on discussion with domain experts from both Stanford and Duke sides.

Comparison between multiple machine learning models
We experimented with multiple linear and non-linear machine learning models using the same temporal feature vector and reported the performance as AUROC and Negative Predictive Value (NPV) in eTable 4. In the manuscript, we only described the ElasticNet model which resulted the superior performance in terms of AUROC and NPV on both SHC and Duke hold-out test set.