Predictive Modeling for Perinatal Mortality in Resource-Limited Settings

Key Points Question Can prenatal and postdelivery variables accurately predict the risk of stillbirth and neonatal deaths in resource-limited settings of low- and middle-income countries? Findings Using advanced machine learning–based modeling techniques on a large multicountry prospective maternal and neonatal database, this cohort study found that the prediction accuracy of models for risk of stillbirth and neonatal death using variables before delivery is low, but the prediction accuracy for neonatal death can be improved by including postdelivery variables. Birth weight was the most important predictor of neonatal mortality. Meaning Models that include postdelivery variables have good prediction accuracy for neonatal deaths.

each scenario records of neonates with missing data among the predictors were excluded from the analyses so that only neonates with complete data were included in the analyses. Also, the delivery/day-1 and post-delivery data sets were censored for deaths occurring prior to grouping time points so that only surviving neonates were included. The full data had 502,648 records with outcomes. The prenatal analysis removed 15,006 records due to missing data among the predictors, while the pre-delivery analysis removed 15,111 records due to missing data among the predictors. For the delivery/day-1 analysis, 487,326 records were available after censoring for prior death, of these 17,814 records were removed due to missing data among the predictors.
Finally, for the post-delivery/day-2 analysis, 485,966 records were available after censoring for prior death, of these 17,610 records were removed due to missing data among the predictors.

Predictive model building process
The process of building and validating the predictive models followed the approach of Eggleston et al. with the addition of more model assessment procedures due to more data being available. 2 As noted above, the analysis utilized 4 scenario datasets, each representing a set of predictors available at a specific time relative to delivery. Each scenario dataset was analyzed using the same process described here.
For a given scenario, we first randomly divided each scenario data sets into ten subgroups, and each subgroup was divided again randomly into three analysis data sets, training (60%), test (20%), and validation (20%) data sets. 2 After the data splitting, a set of conventional and © 2020 Shukla VV et al. JAMA Network Open.
advanced machine learning predictive models were fit (and tuned if possible) using the training analysis dataset within all 10 data subgroups for a given scenario. The models fit were logistic regression and advanced machine learning models (SVM (support vector machine with radial basis function kernel), EN (logistic elastic net), NN (neural network), GBE (gradient boosted ensemble), and RF (random forest)). Data management was completed using SAS 9.4 software (SAS Institute, Cary, NC), 3 and the model building was completed using the scikit-learn Python module. 4 Graphics were completed using R 4.0.2. 5 The scikit-learn LogisticRegression() function was used for logistic regression. The scikit-learn SGDClassifier() function with log loss and elasticnet penalty was used for the elastic net. The scikit-learn MLPClassifier() function was used to fit the neural network. The sci-kit learn RandomForestClassifier() function was used to fit the Random Forest. The sci-kit learn GradientBoostingClassifier() function was used to fit the Grandient Boosting ensemble. Finally, the sci-kit learn SVC() function was used to fit the support vector machine. The sci-kit learn GridSearchCV() function was used to tune the models, The sci-kit learn roc_auc_score() function as used to estimate AUC of ROC. The feature_importances attribute of fitted models was used to assess relative importance of predictors. All models except logistic regression were tuned using 10-fold cross-validation on training data, and then each tuned model was applied to the test data for a predictive accuracy assessment. 2 For assessment of the consistency of accuracy, the tuning was repeated on training plus test data, and tuned models were applied to the validation data. The predictive accuracy was assessed using the area under the curve (AUC) of the receiver operating characteristic (ROC) curves. 6 The models considered in this manuscript were selected because the authors were agnostic concerning what model would be best suited for the prediction of neonatal mortality and each model uniquely uses the information in the data to predict neonatal mortality. Due to the splitting of the data into 10 subsets of training/test/validation data, a second round of validation was also performed within each of the 10 subsets by retuning the models on the training plus test data and assessing predictive accuracy using the validation data. As with the first validation, the predictive accuracy was assessed using the area under the curve (AUC) of the receiver operating characteristic (ROC) curves. This process of tuning on training data using 10fold cross-validation, followed by the assessment on test data, and re-evaluated a second time on validation data after tuning on training plus test data, repeated ten times on separate subsets, allowed the authors to get a good picture about which model was consistently better than other models considered and assess the stability of the predictive accuracy.
Because the entire analysis was repeated with each of the ten subsets for each scenario, the authors could calculate for each tuned model an average AUC on test data and a separate average AUC on validation data, the standard error of AUC on test and validation data, as well as compute paired t-tests for all possible combinations of model comparisons. For these comparisons, the null hypothesis was no differences in the predictive accuracy across each pair of models represented in the paired t-tests. No adjustments for multiple comparisons were made. Once validation-based AUC values and paired t-tests results were reviewed a best model was identified for use in developing a potentially improved logistic regression model within a given scenario, if the validation-based AUC was at least 0.80. Given that the best models in the two scenarios that had sufficient validation-based AUC values were tree-based ensemble models, this additional analysis involved identifying the top 15 important predictors within the model using the GINI importance measure. 7 These top 15 predictors were included in a set of potentially important predictors for inclusion in the potentially improved logistic regression model. In addition to using the top 15 predictors from the best model, the least absolute shrinkage and selection operator (LASSO) method was used to investigate the value of any two-way interactions between predictors in the original logistic regression model excluding site variables.
If an interaction term was identified using the least absolute shrinkage and selection operator (LASSO) as a potentially valuable interaction term, the interaction term was added to list of potentially important predictors. Additional predictors were added to the list of potentially important predictors so that no interaction term was present without main effect terms being added as well. With this full set of potentially important predictors, modified logistic regression models were built as surrogate models for interpretation and if useful as a basis for developing a portable risk scoring system that could be fully described in print.

Development of portable risk scoring system
For the delivery/day-1 and post-delivery/day-2 scenarios, the modified logistic regression model was used to translate measures of risk into probabilities of mortality by creating risk score weights from the parameter coefficients, eTable3. These risk score weights were created by multiplying the parameter coefficients by 10 and rounding off to the nearest tenth decimal place.
These risk score weights were used to create a total risk score by taking the product of each risk score weight and risk measure value and then summing up all the products. The total risk scores were used to generate a logistic regression model that would translate the total risk score into a probability of mortality, eTable4. The logistic regression models, reported in eTable4, are somewhat redundant for simply calculating the probability of mortality, but the calculated total risk score is on a better scale than the linear combination of covariate value by parameter estimate and the logistic regression models in eTable4, which use the total risk score as the only covariate, shows how changes in the total risk score change the probability of mortality. The predictive accuracy of the models was assessed using the entire validation data to evaluate the quality of the logistic regression-based prediction model relative to the machine learning models.

Sample size justification
One rule of thumb for predictive modeling is 10 events per variable; however, machine learning models require a relatively large sample size for good predictive accuracy. For some machine learning models, Ploeg et al. have shown via simulation that an events per variable ratio of >200 may be needed for achieving stable estimates of predictive accuracy. 8 Depending on the scenario, the data used in this analysis was clearly deemed sufficiently large either because the © 2020 Shukla VV et al. JAMA Network Open. events per variable ratio was greater than 200, or the deemed sufficiently large because the events per variable ratio was greater than 10 given we split the data in such a way that models were validated twice and estimates of uncertainty in predictive accuracy were quantified.
Before data splitting, the event rate per variable ratio for the prenatal analysis was 1477 for the combined outcome of total fresh stillbirths and total neonatal deaths. After setting aside 60% of the data for training, the training data set had an event per variable ratio of 886. Before data splitting, the event rate per variable ratio for the pre-delivery analysis was 738 for the combined outcome of total fresh stillbirths and total neonatal deaths. After setting aside 60% of the data for training, the training data set had an event per variable ratio of 443. Therefore, the available sample size was considered large enough to adequately train models on the prenatal and predelivery data.
Before data splitting, the event rate per variable ratio for the delivery/day-1 and the postdelivery/day-2 analyses was 152 and 104 respectively for the outcome of neonatal deaths. The training data sets of 60% of total data had event per variable ratios of 91 and 62 respectively.
Although the event per variable ratios for the delivery (day-1) and post-delivery (day-2) analyses were not greater than 200, they were greater than the rule of thumb of 10 events per variable. Our method of repeating the model fitting process 10 times to estimate predictive accuracy on test data, and then repeating the process again to estimate predictive accuracy on validation data was considered sufficient to quantify the variability in overall predictive accuracy assessments and identify cases where the model building process was unreliable.  3 Parity must be standardized by subtracting 1.788 and then dividing by 2.1019. 4 Gestational age at delivery must be standardized by subtracting 38.556 and then dividing by 3.5519. 5 Gestational age at enrollment must be standardized and then squared. 6 Gestational age at enrollment must be standardized, and birthweight must be standardized, then the two must be multiplied. 7 Parity must be standardized and then multiplied by the indicator for mother's age between 20-35 years at enrollment. 8 Parity and birthweight must be standardized, then the two must be multiplied. 9 The indicator for vaginal delivery must be multiplied by the indicator for the mother having obstructed/prolonged labor/failure to progress 10 Birthweight must be standardized, then squared. 11 Birthweight must be standardized and then multiplied by the indicator for bag and mask resuscitation. 12 The indicator for the mother being <20 years old at enrollment must be multiplied by the indicator for the neonate having at least one condition requiring hospitalization. 13 Gestational age at delivery must be standardized and then squared.
14 Birthweight must be standardized and then multiplied by the indicator for the neonate having at least one condition requiring hospitalization. Risk score calculation as explained in eTable3. AUC=Area under the curve.