Development and Validation of a Deep Learning Model to Predict the Occurrence and Severity of Retinopathy of Prematurity

Key Points Question Can a deep learning system provide reliable prediction of retinopathy of prematurity (ROP) using retinal photographs and clinical characteristics? Findings In this prognostic study including data from 815 infants, the mean areas under the receiver operating characteristic curve (AUCs) of the system were 0.90 and 0.87 in predicting the occurrence and severity of ROP, respectively. For the external validation set, the AUCs were 0.94 and 0.88, respectively. Meaning These findings suggest the feasibility of using deep learning approaches to predict ROP with high accuracy and generalizability.


eAppendix 1. Retinopathy of Prematurity Screening and Follow-up Schedule
All included infants completed continuous follow-up until 45 weeks of postmenstrual age (PMA) and underwent serial ophthalmoscopic examinations under pupillary dilation according to the current screening guidelines. 1 According to the current screening guidelines, infants with gestational age (GA) <28 weeks received the initial retinal examinations at 31 weeks of PMA, and infants with GA ≥28 weeks had their first retinal examination at four weeks postnatal age. 1 Follow-up was scheduled according to the current screening guidelines: infants with type II ROP or suspected aggressive posterior-retinopathy of prematurity (ROP) need ≤one-week follow-up, infants with Zone II stage 2 ROP without plus disease need one-to two-week follow-up, infants with Zone II stage 1 ROP need two-week follow-up, and infants with Zone III any stage ROP need two-to three-week follow-up. 1

Images Preprocessing
In the steps of multiple images preprocessing, saturated pixels with an intensity value of 255 or more in the retinal photographs were discarded, and the block-matching and 3D filtering method was employed to denoise and smooth the retinal photographs.All retinal photographs of each case were then resized into 256*256 pixel for residual network (ResNet)-50 (Microsoft Research). 43ep Feature Extraction and Feature Vector Construction In the ResNet-50, convolutional parameter layers were used for iteratively filters learning to transform input images into hierarchical feature maps, and to learn discriminative features at varying spatial levels without the need for manually tuned parameters.These convolutional layers were continuously positioned, whereby each layer transformed the input image to propagate the output information into the next layer.Finally, the learned deep features were extracted from the global average pooling layer that represented the average activations of each unit in this layer, yielding 512 features.The 512 highly abstracted features of each retinal photograph were concatenated with 46 clinical characteristics of the same case into a final vector of 558 dimensions for each retinal photograph.

Prediction Model Training
A deep neural network (DNN) was trained on our representative feature vector of 558 values to generate the predictive probability of each retinal photograph.We adopted 3 different training schemes to output the predictive labels based on the probability threshold of 0.30 of occurrence-network (OC-Net) and 0.45 of severity-network (SE-Net), respectively.The prediction label was 1 for the annotation of "ROP" and 0 for the annotation of "normal" in the OC-Net.The prediction label was 1 for the annotation of "severe ROP" and 0 for the annotation of "mild ROP" in the SE-Net.
First, under the majority voting method, we calculated the mean predictive probability of all retinal photographs from the same case.The prediction label of each case was divided into 1 (mean predictive probability equal to or larger than probability threshold) or 0 (mean predictive probability smaller than probability threshold).Second, under the one-vote veto method, we divided the prediction label of each retinal photograph to be 1 (predictive probability equal to or larger than probability threshold) or 0 (predictive probability smaller than probability threshold).If the prediction label of any retinal photograph in a case was 1, this case would be labeled as 1.Third, under the image-level method, we used the ground truth label of each case as the ground truth label of all corresponding retinal photographs.Meanwhile, we defined the prediction label of each retinal photograph to be 1 (predictive probability equal to or larger than probability threshold) or 0 (predictive probability smaller than probability threshold), and evaluated the prediction accuracy on an image level.Through the above the 3 training schemes, patient-level and image-level predictive labels could be respectively obtained at this stage.The Pytroch codes used in this manuscript can be found on GitHub at https://github.com/yaoMYZ/ROP.

Cross-validation on the Training Set Using Different Training Schemes
The 5-fold cross-validation manner was used for internal validation of the 3 training schemes.In this manner, the occurrence-dataset and the severity-dataset were both randomly and equally divided into 5 independent sub-samples, 4 of which were used to train the OC-Net and SE-Net of each training scheme, respectively, with the remaining one used as for internal validation and fine tuning.This procedure was repeated until each sub-sample had been used as the validation set.Among the 3 training schemes, both OC-Net and SE-Net of the major voting scheme achieved the best overall performance (eTable 2 in the Supplement).Thus, the major voting scheme was deployed in the deep learning system for further validation under requirement of 100% sensitivity.

eFigure 1 .
Distributions of Birth Weight and Gestational Age of All Included InfantsA, Each bar represents the number of infants whose birth weight were within the given range.B, Each bar represents the number of infants whose gestation age were within the given weeks.

eFigure 2 .
Weight Ratios of Different CharacteristicsThe absolute values of the weights of each characteristic in the last fully-connected layers of the deep learning (DL) system were summed to obtain the corresponding importance factors.Then, the importance factors of all 46 characteristic were normalized to obtain the final weight ratios.The bar of the final weight ratios indicates the importance of different features as average for the DL system on 5 test runs.A, The blue bars represent the weight ratios of the OC-Net.B, The orange bars represent the weight ratios of the SE-Net.The higher the bar is, the more important the corresponding clinical characteristics is for the prediction task.Abbreviations: RBC, red blood cell; MV, mechanical ventilation; RDS, respiratory distress syndrome; HIE, hypoxic ischemic encephalopathy; CRP, C-reactive protein; IVF-ET, in vitro fertilization and embryo transfer; WBC, white blood cell.

eFigure 3 .
Calibration Plots for Evaluating the Calibration Ability of the Deep Learning System Using Major Voting Scheme Calibration plots for the observed proportion of occurrence and severe type of retinopathy of prematurity versus predictive probability are obtained from OC-Net (A) and SE-Net (B), respectively.

eFigure 4 .
Original Retinal Photographs and Saliency Maps of Normal and Retinopathy of Prematurity Cases from Grad-CAMGrad-CAM is used to generate color saliency heat-maps that highlight the regions playing important roles in the final prediction of occurrence-network (A and B) and severity-network (C and D) under the major voting scheme.In the saliency maps, the red regions indicate a stronger contribution than the green regions, and the blue regions have low to no contribution to the retinopathy of prematurity prediction.
eTable 1.Comparison of Birth Characteristics Between the Included and Excluded Infants 3. Comparison of Birth Characteristics of Infants With False-Negative and True-Positive Results for Prediction of Occurrence of Retinopathy of Prematurity Comparison of Birth Characteristics of Infants With False-Negative and True-Positive Results for Prediction of Occurrence of Retinopathy of Prematurity (continued) Chi-square Test, b Unpaired Mann-Whitney Test.Abbreviations: SD standard deviation.Abbreviations: IVF-ET, in vitro fertilization and embryo transfer; MV, mechanical ventilation; CRP, C-reactive protein; RBC, red blood cell; WBC, white blood cell; SD standard deviation.
a Chi-square Test, b Unpaired Mann-Whitney Test.Abbreviations: SD standard deviation.Abbreviations: AUC, area under the receiver operating characteristic curve; OC-Net, occurrence-network; SE-Net, severity-network; CI confidence interval.eTablea