Automated Explainable Multidimensional Deep Learning Platform of Retinal Images for Retinopathy of Prematurity Screening | Neonatology | JAMA Network Open | JAMA Network
[Skip to Navigation]
Sign In
Figure 1.  The Receive Operating Characteristic (ROC) Curves for System Performance
The Receive Operating Characteristic (ROC) Curves for System Performance

The ROC for detecting any stage of retinopathy of prematurity (ROP) (A), intraocular hemorrhage (B), and preplus/plus disease (C). The area under curve (AUC) of training, validation, and test sets from each of the classifiers are shown (D). The ROC for referral-warranted ROP (RW ROP) was obtained by aggregating the results of 4 classifiers (stage, hemorrhage, posterior, and preplus/plus). Any positive findings of ROP-related features would result in the RW, and the AUC at the image, eye, and patient levels are shown.

Figure 2.  Visualization of Stage and Hemorrhage in Heat Maps
Visualization of Stage and Hemorrhage in Heat Maps

The first and second columns indicate the original and preprocessed retinal images, respectively. The third and fourth columns are heat maps generated by Class Activation Mapping (CAM) and DeepSHAP, respectively. A, The original image presents both the stage of ROP and retinal hemorrhage on the peripheral retina. The upper row contains heat maps showing the stage of ROP, whereas the lower row contains heat maps showing retinal hemorrhages. B, The image presents both the stage of ROP and the reflection. Though it shares a similar morphology with its reflection, the lesion is successfully recognized. C, The image shows retinal hemorrhages and many artifacts; however, the hemorrhage area is highlighted by the heat map. DeepSHAP shows the more fine-grained heat map than CAM on each feature.

Table 1.  Demographic Details of Patientsa From 4 Centers
Demographic Details of Patientsa From 4 Centers
Table 2.  Performance of 5 Classifiers of the Platform
Performance of 5 Classifiers of the Platform
Table 3.  Performance Comparison Between Human Experts and J-PROP
Performance Comparison Between Human Experts and J-PROP
1.
Blencowe  H, Lawn  JE, Vazquez  T, Fielder  A, Gilbert  C.  Preterm-associated visual impairment and estimates of retinopathy of prematurity at regional and global levels for 2010.   Pediatr Res. 2013;74(suppl 1):35-49. doi:10.1038/pr.2013.205 PubMedGoogle ScholarCrossref
2.
Solebo  AL, Teoh  L, Rahi  J.  Epidemiology of blindness in children.   Arch Dis Child. 2017;102(9):853-857. doi:10.1136/archdischild-2016-310532 PubMedGoogle ScholarCrossref
3.
International Committee for the Classification of Retinopathy of Prematurity.  The international classification of retinopathy of prematurity revisited.   Arch Ophthalmol. 2005;123(7):991-999. doi:10.1001/archopht.123.7.991 PubMedGoogle ScholarCrossref
4.
Hutcheson  KA, Nguyen  ATQ, Preslan  MW, Ellish  NJ, Steidl  SM.  Vitreous hemorrhage in patients with high-risk retinopathy of prematurity.   Am J Ophthalmol. 2003;136(2):258-263. doi:10.1016/S0002-9394(03)00190-9 PubMedGoogle ScholarCrossref
5.
Daniel  E, Ying  GS, Siatkowski  RM, Pan  W, Smith  E, Quinn  GE; e-ROP Cooperative Group.  Intraocular hemorrhages and retinopathy of prematurity in the telemedicine approaches to evaluating acute-phase Retinopathy of Prematurity (e-ROP) study.   Ophthalmology. 2017;124(3):374-381. doi:10.1016/j.ophtha.2016.10.040 PubMedGoogle ScholarCrossref
6.
Wallace  DK, Freedman  SF, Hartnett  ME, Quinn  GE.  Predictive value of pre-plus disease in retinopathy of prematurity.   Arch Ophthalmol. 2011;129(5):591-596. doi:10.1001/archophthalmol.2011.63 PubMedGoogle ScholarCrossref
7.
Adams  GG, Bunce  C, Xing  W,  et al.  Treatment trends for retinopathy of prematurity in the UK: active surveillance study of infants at risk.   BMJ Open. 2017;7(3):e013366. doi:10.1136/bmjopen-2016-013366 PubMedGoogle Scholar
8.
Watts  P, Maguire  S, Kwok  T,  et al.  Newborn retinal hemorrhages: a systematic review.   J AAPOS. 2013;17(1):70-78. doi:10.1016/j.jaapos.2012.07.012 PubMedGoogle ScholarCrossref
9.
Quinn  GE, Vinekar  A.  The role of retinal photography and telemedicine in ROP screening.   Semin Perinatol. 2019;43(6):367-374. doi:10.1053/j.semperi.2019.05.010 PubMedGoogle ScholarCrossref
10.
Vander  JF, Handa  J, McNamara  JA,  et al.  Early treatment of posterior retinopathy of prematurity: a controlled trial.   Ophthalmology. 1997;104(11):1731-1735. doi:10.1016/S0161-6420(97)30034-7 PubMedGoogle ScholarCrossref
11.
Medeiros  FA, Jammal  AA, Thompson  AC.  From machine to machine: an OCT-trained deep learning algorithm for objective quantification of glaucomatous damage in fundus photographs.   Ophthalmology. 2019;126(4):513-521. doi:10.1016/j.ophtha.2018.12.033 PubMedGoogle ScholarCrossref
12.
Coudray  N, Ocampo  PS, Sakellaropoulos  T,  et al.  Classification and mutation prediction from non-small cell lung cancer histopathology images using deep learning.   Nat Med. 2018;24(10):1559-1567. doi:10.1038/s41591-018-0177-5 PubMedGoogle ScholarCrossref
13.
Hu  J, Chen  Y, Zhong  J, Ju  R, Yi  Z.  Automated analysis for retinopathy of prematurity by deep neural networks.   IEEE Trans Med Imaging. 2019;38(1):269-279. doi:10.1109/TMI.2018.2863562 PubMedGoogle ScholarCrossref
14.
Springenberg  JT, Dosovitskiy  A, Brox  T, Riedmiller  M. Striving for simplicity: the all convolutional net. Accessed May 15, 2020. https://arxiv.org/abs/1412.6806
15.
Brown  JM, Campbell  JP, Beers  A,  et al; Imaging and Informatics in Retinopathy of Prematurity (i-ROP) Research Consortium.  Automated diagnosis of plus disease in retinopathy of prematurity using deep convolutional neural networks.   JAMA Ophthalmol. 2018;136(7):803-810. doi:10.1001/jamaophthalmol.2018.1934 PubMedGoogle ScholarCrossref
16.
Wang  J, Ju  R, Chen  Y,  et al.  Automated retinopathy of prematurity screening using deep neural networks.   EBioMedicine. 2018;35:361-368. doi:10.1016/j.ebiom.2018.08.033 PubMedGoogle ScholarCrossref
17.
Carbonneau  M-A, Cheplygina  V, Granger  E, Gagn  G.  Multiple instance learning: a survey of problem characteristics and applications.   Pattern Recognition. 2018;77:329-353. doi:10.1016/j.patcog.2017.10.009 Google ScholarCrossref
18.
World Medical Association.  World Medical Association Declaration of Helsinki: ethical principles for medical research involving human subjects.   JAMA. 2013;310(20):2191-2194. doi:10.1001/jama.2013.281053PubMedGoogle ScholarCrossref
19.
Wang  J, Zhang  G, Lin  J, Ji  J, Qiu  K, Zhang  M.  Application of standardized manual labeling on identification of retinopathy of prematurity images in deep learning.   Chin J Exp Ophthalmol. 2019;37(8):653-657.Google Scholar
20.
Zhang  M-L, Zhou  Z-H.  A review on multi-label learning algorithms.   IEEE Transactions on Knowledge and Data Engineering. 2014;26(8):1819-1837. doi:10.1109/TKDE.2013.39 Google ScholarCrossref
21.
Engstrom  L, Tran  B, Tsipras  D, Schmidt  L, Madry  A. Exploring the landscape of spatial robustness. Accessed May 10, 2020. https://arxiv.org/abs/1712.02779
22.
Müller  R, Kornblith  S, Hinton  G. When does label smoothing help? Accessed May 17, 2020. https://arxiv.org/abs/1906.02629
23.
Ronneberger  O, Fischer  P, Brox  T. U-net: convolutional networks for biomedical image segmentation. Accessed May 10, 2020. https://arxiv.org/abs/1505.04597
24.
Zhang  Z, Liu  Q, Wang  Y.  Road extraction by deep residual u-net.   IEEE Geoscience and Remote Sensing Letters. 2018;15:749-753. doi:10.1109/LGRS.2018.2802944 Google ScholarCrossref
25.
He  K, Gkioxari  G, Dollár  P, Girshick  R. Mask r-cnn. Accessed April 13, 2020. https://arxiv.org/abs/1703.06870
26.
Zhou  B, Khosla  A, Lapedriza  A, Oliva  A, Torralba  A. Learning deep features for discriminative localization. Accessed February 27, 2020. https://arxiv.org/abs/1512.04150
27.
Selvaraju  RR, Cogswell  M, Das  A, Vedantam  R, Parikh  D, Batra  D. Grad-cam: visual explanations from deep networks via gradient-based localization. Accessed April 5, 2020. https://arxiv.org/abs/1610.02391
28.
Simonyan  K, Vedaldi  A, Zisserman  A. Deep inside convolutional networks: visualising image classification models and saliency maps. Accessed May 16, 2020. https://arxiv.org/abs/1312.6034
29.
Sundararajan  M, Taly  A, Yan  Q. Axiomatic attribution for deep networks. Accessed May 30, 2020. https://arxiv.org/abs/1703.01365
30.
Bach  S, Binder  A, Montavon  G, Klauschen  F, Müller  K-R, Samek  W.  On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation.   PLoS One. 2015;10(7):e0130140. doi:10.1371/journal.pone.0130140 PubMedGoogle Scholar
31.
Lundberg  S, Lee  S-I. A unified approach to interpreting model predictions. Accessed January 3, 2020. https://arxiv.org/abs/1705.07874
32.
Shrikumar  A, Greenside  P, Kundaje  A. Learning important features through propagating activation differences. Accessed November 11, 2019. https://arxiv.org/abs/1704.02685
33.
Ramachandram  D, Taylor  GW.  Deep multimodal learning: a survey on recent advances and trends.   IEEE Signal Processing Magazine. 2017;34(6):96-108. doi:10.1109/MSP.2017.2738401 Google ScholarCrossref
Limit 200 characters
Limit 25 characters
Conflicts of Interest Disclosure

Identify all potential conflicts of interest that might be relevant to your comment.

Conflicts of interest comprise financial interests, activities, and relationships within the past 3 years including but not limited to employment, affiliation, grants or funding, consultancies, honoraria or payment, speaker's bureaus, stock ownership or options, expert testimony, royalties, donation of medical equipment, or patents planned, pending, or issued.

Err on the side of full disclosure.

If you have no conflicts of interest, check "No potential conflicts of interest" in the box below. The information will be posted with your response.

Not all submitted comments are published. Please see our commenting policy for details.

Limit 140 characters
Limit 3600 characters or approximately 600 words
    Original Investigation
    Ophthalmology
    May 5, 2021

    Automated Explainable Multidimensional Deep Learning Platform of Retinal Images for Retinopathy of Prematurity Screening

    Author Affiliations
    • 1Joint Shantou International Eye Center of Shantou University, the Chinese University of Hong Kong, Shantou, Guangdong, China
    • 2Network and Information Center, Shantou University, Shantou, Guangdong, China
    • 3XuanShi Med Tech (Shanghai) Company Limited, Shanghai, China
    • 4Department of Ophthalmology, The Sixth Affiliated Hospital of Guangzhou Medical University, Qingyuan People’s Hospital, Qingyuan, Guangdong, China
    • 5Department of Ophthalmology, Guangdong Women and Children Hospital, Guangzhou, Guangdong, China
    • 6Shantou University Medical College, Shantou, Guangdong, China
    • 7Department of Ophthalmology and Visual Sciences, The Chinese University of Hong Kong, Hong Kong, China
    JAMA Netw Open. 2021;4(5):e218758. doi:10.1001/jamanetworkopen.2021.8758
    Key Points

    Question  Can deep learning algorithms achieve a performance comparable with that of ophthalmologists on multidimensional identification of retinopathy of prematurity (ROP) using wide-field retinal images?

    Findings  In this diagnostic study of 14 108 eyes of 8652 preterm infants, a deep learning–based ROP screening platform could identify retinal images using 5 classifiers, including image quality, stages of ROP, intraocular hemorrhage, preplus/plus disease, and posterior retina. The platform achieved an area under the curve of 0.983 to 0.998, and the referral system achieved an area under the curve of 0.9901 to 0.9956; the platform achieved a Cohen κ of 0.86 to 0.98 compared with 0.93 to 0.98 by the ROP experts.

    Meaning  Results suggest that a deep learning platform could identify and classify multidimensional ROP pathological lesions in retinal images with high accuracy and could be suitable for routine ROP screening in general and children’s hospitals.

    Abstract

    Importance  A retinopathy of prematurity (ROP) diagnosis currently relies on indirect ophthalmoscopy assessed by experienced ophthalmologists. A deep learning algorithm based on retinal images may facilitate early detection and timely treatment of ROP to improve visual outcomes.

    Objective  To develop a retinal image–based, multidimensional, automated, deep learning platform for ROP screening and validate its performance accuracy.

    Design, Setting, and Participants  A total of 14 108 eyes of 8652 preterm infants who received ROP screening from 4 centers from November 4, 2010, to November 14, 2019, were included, and a total of 52 249 retinal images were randomly split into training, validation, and test sets. Four main dimensional independent classifiers were developed, including image quality, any stage of ROP, intraocular hemorrhage, and preplus/plus disease. Referral-warranted ROP was automatically generated by integrating the results of 4 classifiers at the image, eye, and patient levels. DeepSHAP, a method based on DeepLIFT and Shapley values (solution concepts in cooperative game theory), was adopted as the heat map technology to explain the predictions. The performance of the platform was further validated as compared with that of the experienced ROP experts. Data were analyzed from February 12, 2020, to June 24, 2020.

    Exposure  A deep learning algorithm.

    Main Outcomes and Measures  The performance of each classifier included true negative, false positive, false negative, true positive, F1 score, sensitivity, specificity, receiver operating characteristic, area under curve (AUC), and Cohen unweighted κ.

    Results  A total of 14 108 eyes of 8652 preterm infants (mean [SD] gestational age, 32.9 [3.1] weeks; 4818 boys [60.4%] of 7973 with known sex) received ROP screening. The performance of all classifiers achieved an F1 score of 0.718 to 0.981, a sensitivity of 0.918 to 0.982, a specificity of 0.949 to 0.992, and an AUC of 0.983 to 0.998, whereas that of the referral system achieved an F1 score of 0.898 to 0.956, a sensitivity of 0.981 to 0.986, a specificity of 0.939 to 0.974, and an AUC of 0.9901 to 0.9956. Fine-grained and class-discriminative heat maps were generated by DeepSHAP in real time. The platform achieved a Cohen unweighted κ of 0.86 to 0.98 compared with a Cohen κ of 0.93 to 0.98 by the ROP experts.

    Conclusions and Relevance  In this diagnostic study, an automated ROP screening platform was able to identify and classify multidimensional pathologic lesions in the retinal images. This platform may be able to assist routine ROP screening in general and children hospitals.

    Introduction

    Retinopathy of prematurity (ROP) is a leading cause of visual impairment and irreversible blindness of children worldwide, mainly affecting preterm infants with extremely low birth weight and those who are small for gestational age. Approximately 1.2% (184 700 of 1 4900 000) of preterm infants worldwide have been estimated to have ROP, among which 30 000 preterm infants have permanent visual impairment.1 Poor visual outcomes from ROP can be largely avoided if ROP is detected early and treated appropriately.2

    In clinical scenarios, 3 ROP-related features (the stages of ROP and preplus/plus disease [considered specific features] and intraocular hemorrhage [considered a risk-indicative feature])3-5 have been adopted in ROP detection among preterm infants. According to the International Classification of Retinopathy of Prematurity, stages 1 to 5 of ROP are defined as abnormal response of immature vasculature in the retina, with increasing severity from stage 1 to 5. Preplus/plus disease is a continuum of abnormal changes with dilatation and tortuosity of posterior pole retinal vessels, indicating the need for intensive observation or treatment.6,7 In addition, intraocular hemorrhage is reported as a frequent predictor of the presence of ROP and poor outcomes in preterm infants.8,9 The standard method for ROP diagnosis relies on indirect ophthalmoscopy, which requires assessments performed by experienced ophthalmologists. In remote areas and places where ROP expertise is not readily available, a delayed or missed diagnosis of ROP can lead to vision loss.10 The development of an automated ROP screening platform that can meet the diagnostic criteria should facilitate timely treatment for patients.

    Deep learning (DL) algorithms, especially convolutional neural networks (CNNs), have been widely applied in medical image analysis for different diseases, including glaucoma, intracranial hemorrhage, and lung cancers.11,12 Image-based automated ROP screening systems using deep CNNs have also been developed. Specifically, Hu et al13 focused on the stages of ROP detection at the image level and used the Guided Backpropagation14 algorithm to visualize the lesion based on a data set of 5511 retinal images; Brown et al15 developed a classifier for the differentiation of normal from the preplus and plus diseases using images of the posterior retina; and Wang et al16 developed a model of 2 deep CNN networks for classifying ROP into gradations of normal, minor, and severe. Their model adopted the multi-instance learning17 method and only generated eye-level results. However, these studies focused only on a single-dimensional classifier. Herein, we aimed to develop an automated multidimensional platform for ROP detection and screening using retinal images.

    In this study, to address the previously mentioned problems, we developed an automated classification system covering 4 independent main classifiers (image quality, any stage of ROP, intraocular hemorrhage, and preplus/plus disease) and 1 auxiliary parameter (the posterior retina). We also developed an algorithm for the referral recommendation by integrating different outcomes from multiple dimensional analyses. The performance of our automated platform was further validated and compared with that of the ROP experts. This cloud-based platform was opened for external validation.

    Methods

    This diagnostic study, conducted from September 1, 2018, to June 24, 2020, was performed in compliance with the Declaration of Helsinki18 and approved by the Human Medical Ethics Committee of Joint Shantou International Eye Center of Shantou University and the Chinese University of Hong Kong. Written informed consents were waived because the retinal images used for platform development were deidentified for personal information. This study followed the Standards for Reporting of Diagnostic Accuracy (STARD) reporting guideline.

    Data Sets

    Retinal images of infants taken by corneal contact retinal cameras RetCam II or III (Clarity Medical Systems) for ROP screening were collected from 4 centers in southern China: Joint Shantou International Eye Center of Shantou University and the Chinese University of Hong Kong (JSIEC), Guangdong Women and Children Hospital in Yuexiu branch (Yuexiu) and Panyu branch (Panyu), and the Sixth Affiliated Hospital of Guangzhou Medical University and Qingyuan People’s Hospital (Qingyuan) (eFigure 1 in the Supplement). All retinal images including those of the normal fundus or those displaying any ROP were included. Exclusion criteria comprised (1) nonfundus photos or fundus photos taken by imaging devices other than RetCam; (2) infants with other ocular diseases, eg, congenital cataract, retinoblastoma, or persistent hyperplastic primary vitreous; and (3) any images with disagreeing labels.

    Image Labeling and Grading

    Binary classification was used to categorize each of the 5 dimensions related to ROP retinal image diagnosis; this system met the requirements of the screening application, including (1) image quality (defined as gradable or ungradable; ungradable images were defined as those of poor quality with significant blur, darkness, defocus, poor exposure, or numerous artifacts that could not be identified; the remaining were classified as gradable), (2) any stage of ROP (defined as any stage or nonstage; any stage was assigned to images with any stage of ROP identified, whereas nonstage was assigned to those without any stage of ROP), (3) intraocular hemorrhage (defined as hemorrhage or nonhemorrhage; hemorrhage was assigned to the images with any identifiable hemorrhage), (4) preplus/plus disease (defined as preplus/plus or non–preplus/plus; preplus/plus described a spectrum of posterior retinal vessel abnormalities including venous dilation and arteriolar tortuosity, whereas non–preplus/plus described normal vessels in the posterior retina), and (5) posterior retina (defined as posterior or nonposterior). In order to accurately identify preplus/plus disease, the region of the posterior retina had to be defined. In this study, the posterior retina was defined as a circular area centered at the optic disc with a radius 3 times the diameter of the optic disc. Any portion of the images within this predefined area were classified as within the posterior pole.

    The ground truth (criterion-standard) labels were determined by a group of ophthalmologists. The graders were trained according to our previously published protocol.19 Briefly, 2 trained junior ophthalmologists labeled independently, and the images with disagreeing labels were submitted to a senior ophthalmologist. If the decision was still uncertain, the label would be determined by an experienced ROP expert (G.Z.). Finally, the images with agreeing labels were kept for the automated system training, validation, and test data sets. Images with disagreeing labels were excluded. In addition, the optic disc and blood vessels for preplus/plus disease identification were labeled by an experienced grader.

    Classifiers and Platform Development

    The pipeline of our system is shown in eFigure 2 in the Supplement. An image was first evaluated for image quality. If it was predicted as ungradable, a recommendation to rephotograph was given. If the image was predicted as gradable, it entered the main pipeline. The main structure of the system was a multilabel classification20 and a postprocessing method that aggregated single-dimensional results to image-level results and image-level results to eye-level and patient-level results using max pooling. On a single image, ROP diagnosis could be viewed as a multilabel classification in that 1 image can present multiple features (stages of ROP, hemorrhage, and preplus/plus disease) simultaneously. This classification system was implemented using multiple independent classifiers based on the binary relevance. Multilabel classification was implemented using multiple independent classifiers instead of 1 classifier, because preplus/plus disease classification was implemented using an independent and complex pipeline. Taking the model ensemble into consideration, every classification task was implemented using a set of different neural networks. Dynamic data resampling and cost-sensitive learning were used simultaneously to resolve the class imbalance. Model ensemble and test time image augmentation were used to improve accuracy and make predictions robust to small perturbations.21 Label smoothing22 was used to calibrate the predicted probabilities. Because preplus/plus diseases have fewer positive samples than other classifiers and the presence of preplus/plus disease is attributed only to the blood vessels in the posterior pole, preplus/plus classification was considered to be a fine-grained classification and was implemented using an independent pipeline. An input image first needed to be judged on whether it was, in fact, a posterior image. The nonposterior image was regarded as non–preplus/plus. If the image belonged to a posterior image, the blood vessels were extracted using a patch-based DL technique called Res-UNet.23,24 Moreover, the optic disc was detected using a Mask-RCNN,25 and the posterior regions were calculated based on the optic disc. Afterward, the blood vessels in the posterior pole area were cropped and input into the final classifier. A set of neural networks was used to classify whether the image was preplus/plus disease or not. More details are available in eMethods 1 in the Supplement.

    Finally, the image-level referral decision was automatically generated by integrating the results of multiple classifiers, and the eye-level and patient-level referral decisions were generated by integrating multiple image-level results. Details of the methods, especially that of algorithm development, are shown in eMethods 2 and 3 in the Supplement.

    Statistical Analysis

    The data set was randomly split into training, validation, and test data sets with a ratio of 75:10:15 by a patient-based split policy in order to ensure all images of a patient were allocated into the same sub–data set of any classifier. The test set was also used to evaluate the performance of the automated referral decision. The performance of each classifier was evaluated by true negative (TN), false positive (FP), false negative (FN), true positive (TP), F1 score, sensitivity, and specificity. The receiver operating characteristic (ROC) analysis and area under curve (AUC) with 95% CIs were also calculated. Two-sided 95% CIs with the Delong method for AUC were calculated using the open-source package pROC, version 1.14.0 (Xavier Robin). Data were analyzed from July 15, 2019, to June 24, 2020.

    The comparison was carried out between our platform, JSIEC Platform for Retinopathy of Prematurity (J-PROP), and 3 experienced ROP experts (W.G., D.G., and T.L.) from JSIEC on 200 retinal images extracted randomly from the test set. Three ROP-related features were identified, and a diagnosis of referral-warranted (RW) ROP was generated automatically by J-PROP via integration of the feature identification results or generated manually from the results of the ROP experts. A criterion-standard diagnosis originated from the ground truth labels. The Cohen unweighted κ was calculated and displayed in the interobserver heat map with a conventional scale where 0.2 or less was considered to be slight agreement, 0.21 to 0.40 was labeled as fair, 0.41 to 0.60 was labeled as moderate, 0.61 to 0.80 was labeled as strong, and 0.80 to 1.0 was considered to be near-complete agreement. The indexes of TN, FP, FN, TP, F1 score, sensitivity, and specificity were also calculated.

    Results

    Of 55 490 retinal images, 3241 (5.8%) were discarded because they were nonfundus photos, fundus photos imaged by devices other than RetCam, not ROP but other ocular diseases such as congenital cataract and retinoblastoma, and images without agreed-upon labeling. A total of 52 249 retinal images from 14 108 eyes of 8652 preterm infants (mean [SD] gestational age, 32.9 [3.1] weeks; 4818 of 7973 boys with known sex [60.4%]) were annotated and included as the ground truth data set (Table 1). With the available data, the mean (SD) birth weight was 1925 (774) g. The data set was randomly split into training (n = 39 029), validation (n = 5140), and test (n = 8080) data sets with a ratio of 75:10:15 by a patient-based split policy. The demographic characteristics of the patients are shown in Table 1, and the data set distribution is listed in eTable 1 in the Supplement.

    System Performance

    The performance of 5 independent classifiers was validated and tested. In the test set, all classifiers achieved an F1 score of 0.718 to 0.981, a sensitivity of 0.918 to 0.982, a specificity of 0.949 to 0.992, and an AUC of 0.9827 to 0.9981 (Table 2, Figure 1, and eFigure 3 in the Supplement). For the ROP-related features, any stage of ROP achieved an F1 score of 0.946, a sensitivity of 0.982, a specificity of 0.985, and an AUC of 0.9981 (95% CI, 0.9974-0.9989), whereas hemorrhage achieved 0.961, 0.972, 0.992, and 0.9977 (95% CI, 0.9963-0.9991), respectively. The performance of preplus/plus disease achieved an F1 score of 0.718, a sensitivity of 0.918, a specificity of 0.970, and an AUC of 0.9827 (95% CI, 0.9706-0.9948).

    In the test data set, the performance of RW ROP detection at the image level achieved an F1 score of 0.956, a sensitivity of 0.981, a specificity of 0.974, and an AUC of 0.9956 (95% CI, 0.9942-0.9970). Eye-level and patient-level F1 scores decreased to 0.915 and 0.898, respectively, whereas the outcomes of other indexes were similar to that of the image level (Table 2 and Figure 1). The performance of RW ROP detection ignoring the hemorrhage dimension was also analyzed (Table 2). In addition, the performance on 2 subsets, the RetCam II and RetCam III sets, were analyzed separately (eTable 2 and 3 in the Supplement). For comparing the performance between the single model and the ensemble model, the AUC of the models was calculated, and the performance of the ensemble model was more accurate than that of the single model (for identifying stage, the ensemble model achieved an AUC of 0.9981 [95% CI, 0.9974-0.9989] compared with that of 0.9968-0.9971 by a single model; for identifying hemorrhage, the ensemble model achieved an AUC of 0.9977 [95% CI, 0.9963-0.9991] compared with that of 0.9940-0.9969 by a single model; for identifying preplus/plus disease, the ensemble model achieved an AUC of 0.9827 [95% CI, 0.9706-0.9948] compared with that of 0.9712-0.9809 by a single model) (eTable 4 in the Supplement).

    Visualization and Explainability

    The features extracted by neural networks just before the classification header were visualized using t-Distributed Stochastic Neighbor Embedding, which is a technique for dimensionality reduction (eFigure 4 in the Supplement). DeepSHAP31 as a heat map technique was adopted to provide explainability, and extensive experiments were carried out to compare the heat maps generated by different techniques including DeepSHAP, Class Activation Mapping (CAM),26,27 Saliency Maps,28 Guided Backpropagation, Integrated Gradients,29 Layer-wise Relevance Propagation (LRP)-Epsilon, and LRP-Z30 (Figure 2 and eFigure 5 in the Supplement).

    Human-Platform Comparison

    Our platform, J-PROP, was further compared with the ROP experts (Table 3). For the detection of intraocular hemorrhage, preplus/plus disease, and image-level RW ROP, J-PROP achieved a sensitivity of 1.000, whereas the experts achieved an average sensitivity of 0.958 to 1.000. The confusion matrix of the agreements among the ROP experts and criterion-standard diagnosis is shown in eFigure 6 in the Supplement. J-PROP achieved a Cohen κ of 0.93 for any stage of ROP, 0.97 for intraocular hemorrhage, 0.86 for preplus/plus disease, and 0.98 for RW ROP, whereas ROP experts achieved a mean Cohen κ of 0.93 (range, 0.87-1.00), 0.93 (range, 0.91-0.95), 0.98 (range, 0.95-1.00), and 0.95 (range, 0.93-0.99) for the 4 classifiers, respectively.

    Misclassification Analysis

    In the test set, the images misclassified by any independent classifier were analyzed case by case for possible reasons. Poor contrast and artifacts were the most common reasons for misclassification in the stage and hemorrhage classifiers, whereas atypical vessel morphology was the main reason in preplus/plus disease with FN predictions. There were some errors due to incorrect annotations labeled by the junior ophthalmologists instead of by the J-PROP (eTable 5, eFigure 7, and eFigure 8 in the Supplement).

    After full validation and testing, the neural network models were deployed to the production environment, and our cloud-based platform, J-PROP, was built and is openly accessible (eFigures 9 and 10 in the Supplement).

    Discussion

    Results from this study suggest that (1) we developed a cloud-based DL platform integrating multidimensional classification and multilevel referral strategies that has the potential to meet clinical needs; (2) preplus/plus disease classification was implemented using an independent pipeline; and (3) DeepSHAP, which is a combination of DeepLIFT32 and Shapley values, could be adopted to generate fine-grained and class-discriminative heat maps in real time. Collectively, our automated ROP screening system, J-PROP, covering 4 main dimensions (image quality, any stages of ROP, intraocular hemorrhage, and preplus/plus diseases), was not only associated with high accuracy in both single-dimensional classification and image-, eye- and patient-level referral decisions but also generated fine-grained and class-discriminative heat maps for the explainability. The J-PROP platform is an open platform and appears to be promising for ROP screening.

    From the point of view on a single image, ROP diagnosis can be viewed as a multilabel classification,20 wherein 1 image can belong to multiple classes simultaneously. This classification was implemented using multiple independent classifiers based on binary relevance (neglecting class dependence). With multiple images, ROP diagnosis is a multiple-instance learning problem17 wherein the images are instances and the eyes or patients are labeled as “bags.” Our study adopted a multimodal learning with decision-level fusion33 method, which differed from a previous study16 that used the standard multi-instance learning method. In the previous study’s method,16 the training instances come in bags, with all examples in a bag sharing the same label. A neural network has multiple inputs and a single output (Softmax was considered as a single output). Predictions were made only at the bag level. In contrast, a single-image classification with a postprocessing method was adopted in this study. The training instances were considered as singletons instead of bags. Predictions were made on the instance level, and the postprocessing method was used to generate the results of the bag level by aggregating (max pooling) the instance level results. The label of a single image can be given based on this image alone. As we labeled every image, J-PROP can fully use the samples, and neural networks can be trained quickly. On the contrary, for the traditional multiple-instance method, likely only 1 of multiple images was used to train the feature extractor during 1 backpropagation. In addition to the eye-level results, J-PROP has the potential to provide image-specific results. The explainable heat map of an image would not interfere with that of other images.

    Preplus/plus disease classification is challenging and easily confused with the ROP stage or hemorrhage. For the preplus/plus disease classification, there are fewer positive samples than negative samples. Most of the preplus/plus disease images simultaneously contain the features of any stage of ROP or intraocular hemorrhages. Because preplus/plus disease classification is essentially related to the blood vessels in the posterior pole of the retina, preplus/plus disease classification was considered to be a fine-grained classification and was implemented using an independent pipeline, including blood vessel segmentation, optic disc detection, selection of the blood vessels of the posterior pole region, and preplus/plus disease classification. This design was based on domain knowledge, and the core idea aimed to allow the inputs of the preplus/plus disease classifier containing only the region of interest and removing the irrelevant features as much as possible.

    The reasons for the false predictions on ROP-related features in the test set are shown in eTable 5 in the Supplement. Lesions with poor contrast and artifacts were the 2 common factors to interfere with the recognition of the staging of ROP and hemorrhage detection that led to FN and FP predictions. For preplus/plus disease, atypical morphology was the common cause of misclassification. Notably, various proportions of FP and FN were caused by incorrect annotations, which may have been due to the poor contrast, artifacts, and atypical morphologies as commonly found in DL platforms. Although J-PROP was not affected by artifacts in most of the cases, artifacts and atypical morphologies could still result in false predictions. In the future, we will work to continuously improve the generalization ability by adding more specific samples, such as images with different kinds of artifacts and with atypical morphologies. We hope to adopt a hard negative mining technique so as to pay more attention to the existing difficult samples.

    Limitations

    There are several limitations in this study. First, according to the International Classification of Retinopathy of Prematurity, there are 5 stages of ROP (stage 1-5) and 3 levels of plus disease (normal, preplus, and plus). However, in this study, each class was divided into only 2 levels (ie, any stage or nonstage and preplus/plus or non–preplus/plus). Second, posterior pole zones I to III proposed by the International Classification of Retinopathy of Prematurity represent an important parameter affecting the severity of ROP. We did not include the zone factor in our classification. Third, in some cases, DeepSHAP heat maps are fragile, just as many other methods, and do not meet the sensitivity and implementation invariance29 at the same time. Fourth, our work did not include a consideration of cost and staff training to acquire the images. We focused on the development and validation of a cloud-based ROP platform. In the future, we hope to design an edge and cloud platform and to complete the protocol on running-cost use and imaging technician selection and training. Finally, in addition to the local explanations, global understanding of neural networks is needed. Future studies should focus on the following questions: what patterns learned by the neural networks could represent the stages of ROP, and how are the features extracted by the neural network matched to these patterns? Even though the global interpretability of neural networks is an open question, future studies should attempt to understand neural networks in detail by using global interpretability methods, such as activation maximization and filter visualization.

    Conclusions

    This diagnostic study developed a cloud-based DL platform integrating a multidimension classification and multilevel referral strategy for ROP screening and referral recommendation. Results suggest that the referral decision could be automatically generated at the image, eye, and patient level. Our platform, J-PROP, has the potential to be applied in neonatal intensive care units, children’s hospitals, and rural primary health care centers for routine ROP screening. It may be useful in remote areas lacking in ROP expertise.

    Back to top
    Article Information

    Accepted for Publication: February 17, 2021.

    Published: May 5, 2021. doi:10.1001/jamanetworkopen.2021.8758

    Open Access: This is an open access article distributed under the terms of the CC-BY License. © 2021 Wang J et al. JAMA Network Open.

    Corresponding Author: Mingzhi Zhang, MD, Joint Shantou International Eye Center of Shantou University, The Chinese University of Hong Kong, North Dongxia Road, Shantou, Guangdong, China 515041 (zmz@jsiec.org).

    Author Contributions: Dr M. Zhang had full access to all of the data in the study and takes responsibility for the integrity of the data and the accuracy of the data analysis. Dr Wang, Dr M. Zhang, and Mr Ji contributed equally to this work.

    Concept and design: Wang, Ji, M. Zhang, Pang.

    Acquisition, analysis, or interpretation of data: Wang, Ji, M. Zhang, Lin, G. Zhang, Gong, Cen, Lu, X. Huang, D. Huang, Li, Ng.

    Drafting of the manuscript: Wang, Ji.

    Critical revision of the manuscript for important intellectual content: Cen, Wang, Ji, M. Zhang, Lin, G. Zhang, Gong, Lu, X. Huang, D. Huang, Li, Ng, Pang.

    Statistical analysis: Wang, Ji, Lin.

    Obtained funding: M. Zhang, Cen.

    Administrative, technical, or material support: Ji, M. Zhang, Lin, Gong, Cen, Lu, X. Huang, D. Huang, Li.

    Supervision: M. Zhang, Ng, Pang.

    Conflict of Interest Disclosures: None reported.

    Funding/Support: This work was supported in part by Science and Technology Innovation Strategy Special Fund Project of Guangdong Province (project code [2018]157-46; Dr M. Zhang), grant 002-18120304 from the Grant for Key Disciplinary Project of Clinical Medicine under the Guangdong High-level University Development Program, China (Dr M. Zhang), and grant 2020LKSFG16B from the Li Ka Shing Foundation cross-disciplinary research grants (Drs M. Zhang and Cen).

    Role of the Funder/Sponsor: The funders had no role in the design and conduct of the study; collection, management, analysis, and interpretation of the data; preparation, review, or approval of the manuscript; and decision to submit the manuscript for publication.

    References
    1.
    Blencowe  H, Lawn  JE, Vazquez  T, Fielder  A, Gilbert  C.  Preterm-associated visual impairment and estimates of retinopathy of prematurity at regional and global levels for 2010.   Pediatr Res. 2013;74(suppl 1):35-49. doi:10.1038/pr.2013.205 PubMedGoogle ScholarCrossref
    2.
    Solebo  AL, Teoh  L, Rahi  J.  Epidemiology of blindness in children.   Arch Dis Child. 2017;102(9):853-857. doi:10.1136/archdischild-2016-310532 PubMedGoogle ScholarCrossref
    3.
    International Committee for the Classification of Retinopathy of Prematurity.  The international classification of retinopathy of prematurity revisited.   Arch Ophthalmol. 2005;123(7):991-999. doi:10.1001/archopht.123.7.991 PubMedGoogle ScholarCrossref
    4.
    Hutcheson  KA, Nguyen  ATQ, Preslan  MW, Ellish  NJ, Steidl  SM.  Vitreous hemorrhage in patients with high-risk retinopathy of prematurity.   Am J Ophthalmol. 2003;136(2):258-263. doi:10.1016/S0002-9394(03)00190-9 PubMedGoogle ScholarCrossref
    5.
    Daniel  E, Ying  GS, Siatkowski  RM, Pan  W, Smith  E, Quinn  GE; e-ROP Cooperative Group.  Intraocular hemorrhages and retinopathy of prematurity in the telemedicine approaches to evaluating acute-phase Retinopathy of Prematurity (e-ROP) study.   Ophthalmology. 2017;124(3):374-381. doi:10.1016/j.ophtha.2016.10.040 PubMedGoogle ScholarCrossref
    6.
    Wallace  DK, Freedman  SF, Hartnett  ME, Quinn  GE.  Predictive value of pre-plus disease in retinopathy of prematurity.   Arch Ophthalmol. 2011;129(5):591-596. doi:10.1001/archophthalmol.2011.63 PubMedGoogle ScholarCrossref
    7.
    Adams  GG, Bunce  C, Xing  W,  et al.  Treatment trends for retinopathy of prematurity in the UK: active surveillance study of infants at risk.   BMJ Open. 2017;7(3):e013366. doi:10.1136/bmjopen-2016-013366 PubMedGoogle Scholar
    8.
    Watts  P, Maguire  S, Kwok  T,  et al.  Newborn retinal hemorrhages: a systematic review.   J AAPOS. 2013;17(1):70-78. doi:10.1016/j.jaapos.2012.07.012 PubMedGoogle ScholarCrossref
    9.
    Quinn  GE, Vinekar  A.  The role of retinal photography and telemedicine in ROP screening.   Semin Perinatol. 2019;43(6):367-374. doi:10.1053/j.semperi.2019.05.010 PubMedGoogle ScholarCrossref
    10.
    Vander  JF, Handa  J, McNamara  JA,  et al.  Early treatment of posterior retinopathy of prematurity: a controlled trial.   Ophthalmology. 1997;104(11):1731-1735. doi:10.1016/S0161-6420(97)30034-7 PubMedGoogle ScholarCrossref
    11.
    Medeiros  FA, Jammal  AA, Thompson  AC.  From machine to machine: an OCT-trained deep learning algorithm for objective quantification of glaucomatous damage in fundus photographs.   Ophthalmology. 2019;126(4):513-521. doi:10.1016/j.ophtha.2018.12.033 PubMedGoogle ScholarCrossref
    12.
    Coudray  N, Ocampo  PS, Sakellaropoulos  T,  et al.  Classification and mutation prediction from non-small cell lung cancer histopathology images using deep learning.   Nat Med. 2018;24(10):1559-1567. doi:10.1038/s41591-018-0177-5 PubMedGoogle ScholarCrossref
    13.
    Hu  J, Chen  Y, Zhong  J, Ju  R, Yi  Z.  Automated analysis for retinopathy of prematurity by deep neural networks.   IEEE Trans Med Imaging. 2019;38(1):269-279. doi:10.1109/TMI.2018.2863562 PubMedGoogle ScholarCrossref
    14.
    Springenberg  JT, Dosovitskiy  A, Brox  T, Riedmiller  M. Striving for simplicity: the all convolutional net. Accessed May 15, 2020. https://arxiv.org/abs/1412.6806
    15.
    Brown  JM, Campbell  JP, Beers  A,  et al; Imaging and Informatics in Retinopathy of Prematurity (i-ROP) Research Consortium.  Automated diagnosis of plus disease in retinopathy of prematurity using deep convolutional neural networks.   JAMA Ophthalmol. 2018;136(7):803-810. doi:10.1001/jamaophthalmol.2018.1934 PubMedGoogle ScholarCrossref
    16.
    Wang  J, Ju  R, Chen  Y,  et al.  Automated retinopathy of prematurity screening using deep neural networks.   EBioMedicine. 2018;35:361-368. doi:10.1016/j.ebiom.2018.08.033 PubMedGoogle ScholarCrossref
    17.
    Carbonneau  M-A, Cheplygina  V, Granger  E, Gagn  G.  Multiple instance learning: a survey of problem characteristics and applications.   Pattern Recognition. 2018;77:329-353. doi:10.1016/j.patcog.2017.10.009 Google ScholarCrossref
    18.
    World Medical Association.  World Medical Association Declaration of Helsinki: ethical principles for medical research involving human subjects.   JAMA. 2013;310(20):2191-2194. doi:10.1001/jama.2013.281053PubMedGoogle ScholarCrossref
    19.
    Wang  J, Zhang  G, Lin  J, Ji  J, Qiu  K, Zhang  M.  Application of standardized manual labeling on identification of retinopathy of prematurity images in deep learning.   Chin J Exp Ophthalmol. 2019;37(8):653-657.Google Scholar
    20.
    Zhang  M-L, Zhou  Z-H.  A review on multi-label learning algorithms.   IEEE Transactions on Knowledge and Data Engineering. 2014;26(8):1819-1837. doi:10.1109/TKDE.2013.39 Google ScholarCrossref
    21.
    Engstrom  L, Tran  B, Tsipras  D, Schmidt  L, Madry  A. Exploring the landscape of spatial robustness. Accessed May 10, 2020. https://arxiv.org/abs/1712.02779
    22.
    Müller  R, Kornblith  S, Hinton  G. When does label smoothing help? Accessed May 17, 2020. https://arxiv.org/abs/1906.02629
    23.
    Ronneberger  O, Fischer  P, Brox  T. U-net: convolutional networks for biomedical image segmentation. Accessed May 10, 2020. https://arxiv.org/abs/1505.04597
    24.
    Zhang  Z, Liu  Q, Wang  Y.  Road extraction by deep residual u-net.   IEEE Geoscience and Remote Sensing Letters. 2018;15:749-753. doi:10.1109/LGRS.2018.2802944 Google ScholarCrossref
    25.
    He  K, Gkioxari  G, Dollár  P, Girshick  R. Mask r-cnn. Accessed April 13, 2020. https://arxiv.org/abs/1703.06870
    26.
    Zhou  B, Khosla  A, Lapedriza  A, Oliva  A, Torralba  A. Learning deep features for discriminative localization. Accessed February 27, 2020. https://arxiv.org/abs/1512.04150
    27.
    Selvaraju  RR, Cogswell  M, Das  A, Vedantam  R, Parikh  D, Batra  D. Grad-cam: visual explanations from deep networks via gradient-based localization. Accessed April 5, 2020. https://arxiv.org/abs/1610.02391
    28.
    Simonyan  K, Vedaldi  A, Zisserman  A. Deep inside convolutional networks: visualising image classification models and saliency maps. Accessed May 16, 2020. https://arxiv.org/abs/1312.6034
    29.
    Sundararajan  M, Taly  A, Yan  Q. Axiomatic attribution for deep networks. Accessed May 30, 2020. https://arxiv.org/abs/1703.01365
    30.
    Bach  S, Binder  A, Montavon  G, Klauschen  F, Müller  K-R, Samek  W.  On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation.   PLoS One. 2015;10(7):e0130140. doi:10.1371/journal.pone.0130140 PubMedGoogle Scholar
    31.
    Lundberg  S, Lee  S-I. A unified approach to interpreting model predictions. Accessed January 3, 2020. https://arxiv.org/abs/1705.07874
    32.
    Shrikumar  A, Greenside  P, Kundaje  A. Learning important features through propagating activation differences. Accessed November 11, 2019. https://arxiv.org/abs/1704.02685
    33.
    Ramachandram  D, Taylor  GW.  Deep multimodal learning: a survey on recent advances and trends.   IEEE Signal Processing Magazine. 2017;34(6):96-108. doi:10.1109/MSP.2017.2738401 Google ScholarCrossref
    ×