Prediction of 2-Year Cognitive Outcomes in Very Preterm Infants Using Machine Learning Methods

Key Points Question Can easily available neonatal data identify very preterm infants who will exhibit cognitive delay later in life? Findings In this prognostic study of cognitive outcomes at 2-year follow-up among 1062 infants born very preterm, a logistic regression model containing 26 neonatal features identified 93% of very preterm infants who screened positive for cognitive delay at 2-year follow up, with a specificity of 46%. Meaning Use of this model could target those very preterm infants at the highest risk of cognitive delay to receive early and effective intervention.


Results of Internal Validation Using Ten-Fold Cross Validation for Models A-D eReferences
This supplementary material has been provided by the authors to give readers additional information about their work.

Rebalancing dataset
An imbalanced dataset is one where the classification categories are not approximately equally represented. 1 In registry data and cohort data this is often the case, where datasets are composed largely of those without the condition or disease of interest.Most machine learning algorithms assume a roughly balanced class distribution.When class imbalance is present the decision boundary is biased toward the majority class. 2 An algorithm which learns from class-imbalanced data tends to have poor predictive accuracy for samples in the minority class when it is shown new data.It tends to classify most new samples as being in the majority class. 1,3In our original dataset there were n=231 infants with cognitive delay at 2 year follow up and n= 831 infants without delay.Training the algorithms on this class imbalanced data would likely result in a classifier bias towards the majority class (i.e.typical cognitive development).To address this, the synthetic minority oversampling technique (SMOTE) was applied to the training dataset. 1 The aim of the SMOTE algorithm is to create a balanced dataset where the outcome classes, in this case 'cognitive delay' and 'typical cognitive development', have approximately equal observations.It achieves this through a combination of oversampling the minority class and undersampling the majority class.Unlike other oversampling techniques which simply duplicate minority class cases, SMOTE creates synthetic examples in the minority class. 1 Each minority class sample can be considered as operating in a 'feature space'. 1 The algorithm randomly selects k minority class samples from a sample's nearest minority class neighbours (k is a parameter set by the user).Then, synthetic examples are created at random locations along the lines joining the k minority class neighbours.This is repeated until the desired amount of oversampling, set by the user using the 'perc.over'parameter, is achieved.In our study, oversampling was set to perc.over = 200 and the number of nearest neighbours used to generate new samples was set to k=5.For each existing sample categorised as 'cognitive delay' in the training dataset (n=162) two new samples were generated using information from the 5 nearest neighbours of each 'cognitive delay' sample, resulting in n=486 samples categorised as 'cognitive delay' in the rebalanced dataset. 1,4Undersampling of the majority class, if required, is achieved by random removal of samples from the majority class.In our study SMOTE was applied only to the training dataset.This ensured that the final model was tested on completely unseen and untouched real life data.

Feature Effect
Feature effect was explored using partial dependence plots (PDPs) created with the 'iml' package. 5The PDPs shows the average predicted probability of cognitive delay (CD) for a given feature value on the x-axis.For every value x of the feature of interest, model predictions are calculated and averaged across every observation, based on the observation having x as the value for that feature. 5A lower value on the y-axis suggests that CD is less likely at that value of the feature on x-axis.Conversely, a higher value on the y-axis suggests a higher probability of CD.PDPs ignore whether the value x is plausible or even possible for that observation, therefore making plot estimations unreliable in the presence of correlated features.Plots should also be interpreted with caution for values with few or no observations.The feature effect for the ten most important features in Model B and Model D are shown in eFigure 1 and eFigure 2. In Model B head circumference was strongly correlated with gestational age (Pearson correlation coefficient (r) = 0.80), birthweight (r = 0.82) and duration of hospitalisation (r = -0.63).Therefore, for the plot showing the effect of head circumference, the correlated features were removed from the model.Similarly, the plot for birthweight excludes the effect of gestational age (r = 0.71) and head circumference (r=0.82).The same procedure was performed for correlated features in Model D. The PDPs show how a machine learning model, such as the gradient boosted machine model in eFigure 3, can model non-linear relationships.On the y-axis of the plots is the probability of cognitive delay and on the x-axis is either the categorical feature label or the continuous feature value.The distribution of feature values is shown along the x-axis.The predicted probability of CD was higher for families with who reported a non-Scandinavian family language, for those who were not discharged to home, for boys, for those who received no breastmilk at discharge, for those with Grade 4 IVH and for those with Non IVH intracranial haemorrhage.As the duration of hospitalisation increased, the predicted probability of CD increased.As head circumference and birthweight increased, the predicted probability of CD decreased.

eFigure 1 .
Feature selection using Boruta algorithm eFigure 1 Legend.CLD -Maternal chronic lung disease, NR -Neonatal resuscitation, cardiac_comp -cardiac compressions, CPAP -Continuous positive airway pressure, TTN -Transient tachypnoea of the newborn, CNS -Central nervous system, vent -ventilation, PPHN -Persistent pulmonary hypertension of the newborn, RBC -Red blood cell, RDS -Respiratory distress syndrome, ROP -Retinopathy of prematurity, Umb_ven_cath -Umbilical venous catheter, Umb_art_cath -Umbilical arterial catheter, IUGR -Intrauterine growth restriction, NEC -Necrotising enterocolitis, PDA -Patent ductus arteriosus, PROM -Preterm rupture of membranes, IVH -Intraventricular hemorrhage, BPD -Bronchopulmonary dysplasia, Gest_diabetes -Gestational diabetes, cPVL -Cystic periventricular leukomalacia, Duration_conv_mech_vent -Duration of conventional mechanical ventilation, HFOV -High frequency oscillatory ventilation.A standardized feature importance score is shown in the y-axis.The blue boxplots represent the minimum, mean, and maximum z-scores of the shadow attributes which are created by randomly shuffling the values of the original features.The green, yellow, and red boxplots represent the z-scores of the confirmed, tentative, and rejected features respectively.

eFigure 2 .
Features selected by Boruta algorithm with correlation coefficients >0.70 eFigure 2 Legend.Gest age -Gestational age, Z.score BW -Z score for birthweight, Head circum -Head circumference, Dur mech vent -Duration of all mechanical ventilation in days, Dur CPAP -Duration of continuous positive airway pressure in days, Dur convent vent -Duration of conventional mechanical ventilation in days, Dur HFOV -Duration of high frequency oscillatory ventilation in days, Dur Extra O2 -Duration of extra oxygen, Dur hospital -Duration of hospitalisation, Dur hosp + home -Duration of hospital and home care eFigure 3. Receiver operating characteristic curves of Models A-D eFigure 3 Legend: The receiver operating characteristic (ROC) curves for Models A-D tested on the unseen dataset are shown.The sensitivity is shown on the y-axis and 1-Specificity on the x-axis.To ensure the performance of Model B (selected as our optimal model) was not affected by bias introduced by a single 70-30 split of the dataset, we performed ten random simulations of the 70-30 split and tested the AUROC across each unseen dataset.The mean AUROC of Model B across the ten simulations of the 70-30 split was 0.76 (95% Confidence Interval 0.69 -0.82).

eFigure 4 .
Feature effect plots for Model B (Logistic Regression) eFigure 4 Legend: a Model excludes gestational age and head circumference.b Model excludes gestational age, birthweight and duration of hospitalisation.

eFigure 5 .
Feature effect plots for Model D (Gradient Boosted Machine) eFigure 5 Legend: a Model excludes gestational age and head circumference, b Model excludes duration of hospitalisation, c Model excludes Duration of CPAP in days, d Model excludes birthweight and gestational age.On the y-axis of the plots is the probability of cognitive delay and on the x-axis is either the categorical feature label or the continuous feature value.The distribution of feature values is shown along the x-axis.For plots examining continuous features, for values where there are few or no observations the plots should be interpreted with caution as estimates may be unreliable.
Features Selected by Boruta Algorithm With Correlation Coefficients >0.70 eFigure 3. Receiver Operating Characteristic Curves of Models A-D eFigure 4. Feature Effect Plots for Model B (Logistic Regression) eFigure 5. Feature Effect Plots for Model D (Gradient Boosted Machine) eTable 1. Features With >25% Missing Values eTable 2. Missing Data for 90 Features Considered in Modelling Process eTable 3. Characteristics of Study Population eTable 4. Final 26 Features Included in Model eTable 5.

Missing data for 90 features considered in modelling process
Categorizations of parental education, timing of rupture of membranes, and timing of antenatal corticosteroids reflect how these variables were reported to the Swedish Neonatal Quality Register.

Final 26 features included in model
Fishers exact test, b Pearson's Chi-squared test, c Welch two sample t test, d Wilcoxan rank sum test e Defined as a registered ICD-10 code for BPD (P27.1) or a registration of supplemental oxygen use at 36 weeks of postmenstrual age f Reasons for not completing: Child declined or would not participant n=25, Inattention, hyperactivity or fatigue n=7, Language barrier n=1, Parent declined n=5, Resources or administrative reason n=6, child unable to complete n=14, reason unclear or not recorded n=6.

Results of internal validation using ten-fold cross validation for Models A-D
An initial random gridsearch was performed first to inform the range of tuning parameters chosen for training process. a