[Skip to Content]
[Skip to Content Landing]
Figure 1.
Performance Results for 4-Class and 9-Class Age-Related Macular Degeneration (AMD) Severity Scale Problems
Performance Results for 4-Class and 9-Class Age-Related Macular Degeneration (AMD) Severity Scale Problems

Values in parentheses indicate error margins for 95% CIs. Information on the severity scales is given in the AMD Severity Scales subsection of the Methods section, and metrics are defined in the Metrics subsection of the Methods section. DL indicates deep learning.

Figure 2.
Whisker Plots for the Unsigned Error and the Signed Error Affecting the Predicted 5-Year Risk Probability
Whisker Plots for the Unsigned Error and the Signed Error Affecting the Predicted 5-Year Risk Probability

The boxes extend from the lower to upper quartile values of the data, with a line at the median. The whiskers extend from the box to show the range of the data. Flier (outlier) points are those past the end of the whiskers.

Table.  
Unsigned Error for the Predicted 5-Year Risk for the 3 Methods and Comparison With Estimated 5-Year Riska
Unsigned Error for the Predicted 5-Year Risk for the 3 Methods and Comparison With Estimated 5-Year Riska
1.
Bressler  NM.  Age-related macular degeneration is the leading cause of blindness.  JAMA. 2004;291(15):1900-1901. doi:10.1001/jama.291.15.1900PubMedGoogle ScholarCrossref
2.
Bressler  NM, Bressler  SB, Congdon  NG,  et al; Age-Related Eye Disease Study Research Group.  Potential public health impact of Age-Related Eye Disease Study results: AREDS report No. 11.  Arch Ophthalmol. 2003;121(11):1621-1624. doi:10.1001/archopht.121.11.1621PubMedGoogle ScholarCrossref
3.
Age-Related Eye Disease Study Research Group.  The Age-Related Eye Disease Study system for classifying age-related macular degeneration from stereoscopic color fundus photographs: the Age-Related Eye Disease Study Report Number 6.  Am J Ophthalmol. 2001;132(5):668-681. doi:10.1016/S0002-9394(01)01218-1PubMedGoogle ScholarCrossref
4.
Age-Related Eye Disease Study Research Group.  A randomized, placebo-controlled, clinical trial of high-dose supplementation with vitamins C and E, beta carotene, and zinc for age-related macular degeneration and vision loss: AREDS report No. 8.  Arch Ophthalmol. 2001;119(10):1417-1436. doi:10.1001/archopht.119.10.1417PubMedGoogle ScholarCrossref
5.
Davis  MD, Gangnon  RE, Lee  LY,  et al; Age-Related Eye Disease Study Group.  The Age-Related Eye Disease Study severity scale for age-related macular degeneration: AREDS report No. 17.  Arch Ophthalmol. 2005;123(11):1484-1498. doi:10.1001/archopht.123.11.1484PubMedGoogle ScholarCrossref
6.
Ying  GS, Maguire  MG, Alexander  J, Martin  RW, Antoszyk  AN; Complications of Age-related Macular Degeneration Prevention Trial Research Group.  Description of the Age-Related Eye Disease Study 9-step severity scale applied to participants in the Complications of Age-Related Macular Degeneration Prevention Trial.  Arch Ophthalmol. 2009;127(9):1147-1151. doi:10.1001/archophthalmol.2009.189PubMedGoogle ScholarCrossref
7.
Ferris  FL, Davis  MD, Clemons  TE,  et al; Age-Related Eye Disease Study (AREDS) Research Group.  A simplified severity scale for age-related macular degeneration: AREDS report No. 18.  Arch Ophthalmol. 2005;123(11):1570-1574. doi:10.1001/archopht.123.11.1570PubMedGoogle ScholarCrossref
8.
Velez-Montoya  R, Oliver  SCN, Olson  JL, Fine  SL, Quiroz-Mercado  H, Mandava  N.  Current knowledge and trends in age-related macular degeneration: genetics, epidemiology, and prevention.  Retina. 2014;34(3):423-441. doi:10.1097/IAE.0000000000000036PubMedGoogle ScholarCrossref
9.
US Department of Commerce; US Census Bureau.  Statistical Abstract of the United States, 2012. Washington, DC: US Census Bureau; 2012.
10.
Juang  R, McVeigh  E, Hoffmann  B, Yuh  D, Burlina  P. Automatic segmentation of the left-ventricular cavity and atrium in 3D ultrasound using graph cuts and the radial symmetry transform. In:  2011 IEEE International Symposium on Biomedical Imaging: From Nano to Macro. Piscataway, NJ: Institute of Electric and Electronics Engineers; 2011:606-609. doi:10.1109/ISBI.2011.5872480
11.
Burlina  P, Freund  DE, Dupas  B, Bressler  N.  Automatic screening of age-related macular degeneration and retinal abnormalities.  Conf Proc IEEE Eng Med Biol Soc. 2011;2011:3962-2966. doi:10.1109/IEMBS.2011.6090984PubMedGoogle Scholar
12.
Burlina  P, Billings  S, Joshi  N, Albayda  J.  Automated diagnosis of myositis from muscle ultrasound: exploring the use of machine learning and deep learning methods.  PLoS One. 2017;12(8):e0184059. doi:10.1371/journal.pone.0184059PubMedGoogle ScholarCrossref
13.
Ting  DSW, Cheung  CY, Lim  G,  et al.  Development and validation of a deep learning system for diabetic retinopathy and related eye diseases using retinal images from multiethnic populations with diabetes.  JAMA. 2017;318(22):2211-2223. doi:10.1001/jama.2017.18152PubMedGoogle ScholarCrossref
14.
Gulshan  V, Peng  L, Coram  M,  et al.  Development and validation of a deep learning algorithm for detection of diabetic retinopathy in retinal fundus photographs.  JAMA. 2016;316(22):2402-2410. doi:10.1001/jama.2016.17216PubMedGoogle ScholarCrossref
15.
Quellec  G, Charrière  K, Boudi  Y, Cochener  B, Lamard  M.  Deep image mining for diabetic retinopathy screening.  Med Image Anal. 2017;39:178-193. doi:10.1016/j.media.2017.04.012PubMedGoogle ScholarCrossref
16.
Gargeya  R, Leng  T.  Automated identification of diabetic retinopathy using deep learning.  Ophthalmology. 2017;124(7):962-969. doi:10.1016/j.ophtha.2017.02.008PubMedGoogle ScholarCrossref
17.
Burlina  PM, Joshi  N, Pekala  M, Pacheco  KD, Freund  DE, Bressler  NM.  Automated grading of age-related macular degeneration from color fundus images using deep convolutional neural networks.  JAMA Ophthalmol. 2017;135(11):1170-1176. doi:10.1001/jamaophthalmol.2017.3782PubMedGoogle ScholarCrossref
18.
Grassmann  F, Mengelkamp  J, Brandl  C,  et al.  A deep learning algorithm for prediction of Age-Related Eye Disease Study severity scale for age-related macular degeneration from color fundus photography.  Ophthalmology. 2018;125(9):1410-1420. doi:10.1016/j.ophtha.2018.02.037PubMedGoogle ScholarCrossref
19.
Burlina  P, Pacheco  KD, Joshi  N, Freund  DE, Bressler  NM.  Comparing humans and deep learning performance for grading AMD: a study in using universal deep features and transfer learning for automated AMD analysis.  Comput Biol Med. 2017;82:80-86. doi:10.1016/j.compbiomed.2017.01.018PubMedGoogle ScholarCrossref
20.
He  K, Zhang  X, Ren  S, Sun  J. Deep residual learning for image recognition. In:  Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway, NJ: Institute of Electric and Electronics Engineers; 2016:771-778.
21.
Age-Related Eye Disease Study Research Group.  The Age-Related Eye Disease Study (AREDS): design implications: AREDS report No. 1.  Control Clin Trials. 1999;20(6):573-600. doi:10.1016/S0197-2456(99)00031-8PubMedGoogle ScholarCrossref
23.
Landis  JR, Koch  GG.  The measurement of observer agreement for categorical data.  Biometrics. 1977;33(1):159-174. doi:10.2307/2529310PubMedGoogle ScholarCrossref
24.
Markowitz  J, Schmidt  AC, Burlina  P, Wang  IJ. Hierarchical zero-shot classification with convolutional neural network features and semantic attribute learning. In:  Proceedings of the Fifteenth IAPR International Conference on Machine Vision Applications (MVA). Piscataway, NJ: Institute of Electric and Electronics Engineers; 2017 doi:10.23919/MVA.2017.7986834
25.
Pekala  M, Joshi  N, Freund  DE, Bressler  NM, Cabrera Debuc  D, Burlina  P. Deep learning based retinal OCT segmentation. arXiv. https://arxiv.org/abs/1801.09749. Published January 29, 2018. Accessed August 13, 2018.
Original Investigation
December 2018

Use of Deep Learning for Detailed Severity Characterization and Estimation of 5-Year Risk Among Patients With Age-Related Macular Degeneration

Author Affiliations
  • 1Applied Physics Laboratory, The Johns Hopkins University, Baltimore, Maryland
  • 2Brazilian Center of Vision Eye Hospital, Brasília, Brazil
  • 3The Fourth Affiliated Hospital of China Medical University, Eye Hospital of China Medical University, Shenyang
  • 4Retina Division, Wilmer Eye Institute, Johns Hopkins University School of Medicine, Baltimore, Maryland
  • 5Editor, JAMA Ophthalmology
JAMA Ophthalmol. 2018;136(12):1359-1366. doi:10.1001/jamaophthalmol.2018.4118
Key Points

Question  How accurate are deep learning algorithms for characterizing age-related macular degeneration from fundus images, assessing Age-Related Eye Disease Study 4- and 9-step detailed severity scales, and determining 5-year risk of progression to advanced stages of age-related macular degeneration?

Findings  This study used 67 401 color fundus images from 4613 study participants of the Age-Related Eye Disease Study data set. Linearly weighted κ scores for estimating 4- and 9-step severity scale scores showed substantial agreement using gradings from highly trained fundus photograph graders as the criterion standard.

Meaning  The results of this study suggest that deep learning assists in a detailed 9-step severity assessment of age-related macular degeneration and estimating 5-year risk of progression to advanced age-related macular degeneration with reasonable accuracy.

Abstract

Importance  Although deep learning (DL) can identify the intermediate or advanced stages of age-related macular degeneration (AMD) as a binary yes or no, stratified gradings using the more granular Age-Related Eye Disease Study (AREDS) 9-step detailed severity scale for AMD provide more precise estimation of 5-year progression to advanced stages. The AREDS 9-step detailed scale’s complexity and implementation solely with highly trained fundus photograph graders potentially hampered its clinical use, warranting development and use of an alternate AREDS simple scale, which although valuable, has less predictive ability.

Objective  To describe DL techniques for the AREDS 9-step detailed severity scale for AMD to estimate 5-year risk probability with reasonable accuracy.

Design, Setting, and Participants  This study used data collected from November 13, 1992, to November 30, 2005, from 4613 study participants of the AREDS data set to develop deep convolutional neural networks that were trained to provide detailed automated AMD grading on several AMD severity classification scales, using a multiclass classification setting. Two AMD severity classification problems using criteria based on 4-step (AMD-1, AMD-2, AMD-3, and AMD-4 from classifications developed for AREDS eligibility criteria) and 9-step (from AREDS detailed severity scale) AMD severity scales were investigated. The performance of these algorithms was compared with a contemporary human grader and against a criterion standard (fundus photograph reading center graders) used at the time of AREDS enrollment and follow-up. Three methods for estimating 5-year risk were developed, including one based on DL regression. Data were analyzed from December 1, 2017, through April 15, 2018.

Main Outcomes and Measures  Weighted κ scores and mean unsigned errors for estimating 5-year risk probability of progression to advanced AMD.

Results  This study used 67 401 color fundus images from the 4613 study participants. The weighted κ scores were 0.77 for the 4-step and 0.74 for the 9-step AMD severity scales. The overall mean estimation error for the 5-year risk ranged from 3.5% to 5.3%.

Conclusions and Relevance  These findings suggest that DL AMD grading has, for the 4-step classification evaluation, performance comparable with that of humans and achieves promising results for providing AMD detailed severity grading (9-step classification), which normally requires highly trained graders, and for estimating 5-year risk of progression to advanced AMD. Use of DL has the potential to assist physicians in longitudinal care for individualized, detailed risk assessment as well as clinical studies of disease progression during treatment or as public screening or monitoring worldwide.

Introduction

Approximately 8 million people older than 50 years have intermediate-stage age-related macular degeneration (AMD).1 These individuals are at high risk of developing advanced AMD, which if left untreated, is a leading cause of blindness in the United States.1,2 The intermediate stage is defined by large drusen or retinal pigment epithelial abnormalities identified on fundus examination and rigorously quantified using fundus photographs to measure drusen size, drusen area, or pigmentary abnormalities. To quantify AMD severity and study disease progression, the Age-Related Eye Disease Study (AREDS) developed 2 classification scales based on these retinal abnormalities. In particular, a basic 4-step classification scale, which was not based on an analysis of outcome data, was used at entry into AREDS and defined as no AMD (AMD-1), early AMD (AMD-2), intermediate AMD (AMD-3), and advanced AMD (AMD-4).3 A more detailed, 9-step severity scale (eFigures 1-3 in the Supplement), which was based on outcome data, provided predictive variables for 5-year risk of developing choroidal neovascularization (CNV), central geographic atrophy (GA), or both.4-6 This detailed grading of fundus images can be time-consuming and likely limited to highly trained fundus photograph graders. Given the complexity of the 9-step scale, in the absence of trained graders (eg, from fundus photograph reading centers), most physicians probably do not use the AREDS detailed AMD severity scale. A simpler scale was judged to be needed and was developed with more practicality but with less predictive accuracy of 5-year risk.7

People older than 50 years are at greater risk for developing AMD.8 Currently, there are approximately 1.75 million to 3 million people in the United States with the advanced form of AMD and approximately 110 million more in the at-risk population of people older than 50 years.1,2,9 Worldwide, it was estimated in 2000 that there were more than 600 million people older than 60 years, with that number being projected to increase to 2.4 billion by 2050.8 Use of the 9-step scale performed only by highly skilled human graders to determine an individual’s 5-year risk of developing the advanced form of AMD is time and cost prohibitive.

The objective of this study was to develop automated methods that implement the AREDS 4-step AMD eligibility criteria and the 9-step AMD detailed severity scale using modern deep learning (DL) algorithms to automatically evaluate severity of AMD and risk of progression to advanced AMD from fundus photographs. This automated capability could alleviate the issue of identifying, in a timely manner, individuals at various levels of risk of progression in the population. In addition, it could allow for the objective assessment of detailed disease progression in practice or during enrollment and follow-up of clinical trials for AMD.

Deep learning approaches for automated classification from fundus images differ from traditional machine vision, medical image analysis, and automated retinal image analysis methods, which have relied on computing engineered features from the image.10,11 Instead, DL involves feature representations of images with multiple levels of abstraction not by relying on human-identified features but directly from data.12 Advances in DL were made possible through algorithmic optimization and expanded computational power using graphics processing units and have led to the possibility of using DL evaluation of fundus photographs not only for a 2-step system (referable vs not referable) but also for detailed 4- or 9-step systems.11,13-18 More detailed systems could determine whether the patient was referable or not and provide the health care system more granular classification that allows better predictive accuracy in the absence of highly trained human graders. The output could be used for more precise counseling of the patient and identification of patients at very high risk of progression to advanced AMD. Such patients might warrant more detailed or frequent monitoring (such as with optical coherence tomography [OCT] or OCT angiography) or may be the basis of more efficient clinical trials that could test preventive treatments on very high-risk individuals wherein the sample size could be kept small.

Methods
Data Set

The National Institutes of Health AREDS data set, including data collected from individuals from November 13, 1992, to November 30, 2005, from whom written informed consent was obtained, was derived from a 12-year longitudinal study designed to improve understanding of the frequency and risk factors of AMD progression. More than 130 000 field-2 stereoscopic color fundus photographs were captured during AREDS from 4613 study participants at the baseline and follow-up visits. Images were quantitatively graded by trained and certified graders at a fundus photograph reading center.3 The grades assigned to each image were used as a criterion standard for the multiclass classification problems studied herein. To avoid using essentially the same image twice, we handled cases with stereo pairs by removing the image with the poorer quality.17,19 This approach resulted in using a total of 67 401 graded images.17 Use of the AREDS data set was performed after Johns Hopkins University Institutional Review Board approval. Data were analyzed from December 1 2017, through April 15, 2018.

Automated Classification

To perform automated classification, we use DL algorithms known as deep convolutional neural networks (DCNNs) and specifically use the ResNet-50 network.20 The DCNNs use many computational layers that perform convolutions and nonlinear activation operations. This layered approach results in the identification of image features that represent the original image at different levels of abstraction (low-, middle-, and higher-level semantic features).

AMD Severity Scales
4-Step Scale

The 4-step AREDS scale for eligibility criteria in AREDS is an eye-based scale defined as follows: (1) eyes with no or only small drusen (drusen size, <63 μm) and no pigmentation abnormalities classified as normal were given a score of 1; (2) eyes with multiple small drusen or medium-sized drusen (drusen size, ≥63 and <125 μm) and/or pigmentation abnormalities related to AMD classified as early stage AMD were given a score of 2; (3) eyes with large drusen (drusen size, ≥125 μm) or numerous medium-sized drusen and pigmentation abnormalities classified as intermediate AMD were given a score of 3; (4) eyes with lesions associated with CNV or GA (eg, retinal pigment epithelial detachment, subretinal pigment epithelial hemorrhage) classified as advanced AMD3,6,21 were given a score of 4 if the fellow eye did not have central GA or CNV AMD. This scale was used to provide baseline severity levels of AMD in participants enrolling in AREDS. More recently, this 4-step scale was proposed for use in the public domain to identify individuals with intermediate- or advanced-stage AMD who might be referred to health care practitioners for monitoring for the development of advanced-stage AMD and for consideration of dietary supplementation, such as that used in AREDS, to reduce the risk of progression to advanced-stage AMD. For patients with no AMD or early-stage AMD, a referral and consideration of such supplementation might not be indicated.

9-Step Scale

The 9-step severity scale (eFigures 1-3 in the Supplement), which is based on outcome data from AREDS, is an eye-based scale that is more detailed and refined than the 4-step scale described above; the scale grades intricate quantitative features of total drusen area and pigmentation abnormalities.4-6 Specifically, it combines a 6-step drusen area scale with a 5-step pigmentary abnormality scale to create a 9-step scale with detailed predictability of potential development of advanced AMD in an individual. The 5-year risk of developing the advanced stage, defined as CNV, central GA, or both, increases from approximately 0.3% for an eye at step 1 to 53% for an eye at step 9.5 The eTable in the Supplement provides a detailed description of the quantitative criteria used to define each of the 9 steps and the corresponding 5-year risk factor associated with each step.5 Notations C-1, C-2, I-2, O-2, and 0.5 disc area seen in the eTable in the Supplement refer to standard circles used by trained fundus photograph graders to quantitatively measure areas of drusen and pigment abnormalities.3 Several severity steps have multiple criteria that define the given step, any one of which is sufficient for an eye to be diagnosed at that step. For example, an eye can be diagnosed as step 2 in one of 2 ways: (1) total drusen area greater than or equal to C-1 and less than C-2 and no increased pigment or depigmentation GA or (2) total drusen area less than C-1 and increased depigmentation GA in the questionable category (ie, the grader is at least 50%, but not more than 90%, sure that the abnormality exists) but with less than I-2 and increased pigment in the questionable category. In the AREDS reports, an eye that progressed to central GA (ie, beyond step 9) was given a score of 10, an eye that progressed to CNV was scored 11, and an eye with both central GA and CNV was scored 12. Images with a score of 10, 11, or 12 were not included in the analysis of the 9-step severity scale because having any of these stages corresponds trivially to a 5-year risk probability of 1.

DCNN Classification

For each of the classification problems above (4-step and 9-step scales), a separate multiclass classifier based on a DCNN was trained and tested. Additional details are provided in the eMethods in the Supplement.

Estimating 5-Year Risk of Progression to Advanced AMD

Three DL-based methods were used and tested to infer the 5-year risk directly from the fundus image as input. These methods included soft prediction, hard prediction, and regressed prediction.

Soft Prediction

Soft prediction first estimated 9-step class probabilities (pi, i = 1…9) for each of the 9 classes using the DCNN-based classification described above and using the output SoftMax values of the DCNN. Then, the risk estimate Esp was computed as the expected value of the class risk under this probability as Rsp = Ep[R] = Σi = 1..9 pi . Ri, where Ri is the risk ascribed to class i (eTable in the Supplement).

Hard Prediction

As in the previous method, in hard prediction, the 9-step class probabilities (pi, i = 1…9) for each of the 9 classes were estimated using the DCNN-based classification described above. The risk Rhp was computed as the risk of the class with maximum probability as Rsp = Ri*, where i* indicates argmaxi = 1..9 (pi).

Regressed Prediction

Unlike the other 2 methods, regressed prediction skipped the step of predicting the 9-step class for the fundus image and, instead, directly performed DL-based regression by mapping the input image to risk Rrp using a DCNN in regression mode. The specific features of the DCNN are similar to those used for classification (ResNet) with changes for regression (L2 loss function instead of cross-entropy and use of a single node on the last layer).

Data Partition

The data consisted of AREDS color fundus images that were subdivided using a train, validate, and test split of 88%, 2%, and 10%, respectively. This split is within typical values that are judged to be adequate; 2% left for validation still contains more than 1000 images. Care was taken that all images for a given study participant were comprised wholly within a partition. As described previously,17 a total of 67 401 images were used for the 4-step classification problems. In the case of the 9-step classification problem, some of the 67 401 images had no value assigned to them for the 9-step scale (ie, missing data). These images were removed, and, as noted above, images with severity scores of 10, 11, or 12 were also removed, resulting in 58 370 images available for the 9-step classification problem.

Human-Machine Comparisons

Because many of the gradings of AREDS photographs were performed in the 1990s, we also compared the DCNN algorithms with 21st century human performance by having an ophthalmologist independently grade a subset of 5000 AREDS images using the criteria defined for the 4-step AMD severity scale. These grades and the machine-generated grades were each compared with the criterion standard AREDS AMD grades from the fundus photograph reading center.17

Metrics

The metrics used to assess the quality of performance were linearly weighted κ (κw) score, accuracy, and confusion matrix.3,22,23 Although accuracy is a commonly used metric, κw was a superior measure in this situation (multiclass classification with ordinal classes) for performance evaluation for 2 reasons: κw discounts for chance agreement and weights error based on proximity of classes, whereas accuracy penalizes equally across classes when, for example, erroneously classifying an AMD-1 as an AMD-2 as opposed to AMD-4. For the confusion matrix (Figure 1), our convention used rows to represent the sample count for the true class, and columns provided the classifier prediction counts. Thus, the sum of each row equals the total number of images in each class, and the diagonal elements give the total number the classifier classified correctly in each class. Off-diagonal elements showed the number of images misclassified and how they were misclassified. Accuracies that represent population-specific class distributions can be computed from the confusion matrix.

Results

This study used 67 401 color fundus images from the 4613 study participants. Figure 1 reports the results of the multiclass classification for the 4-step and 9-step AMD severity scales. Results of the 4-step classification showed machine performance comparable to that of the ophthalmologist: comparing machine vs human, results for the 4-class classification were a κw of 0.773 and an accuracy of 75.7% vs a κw of 0.753 and an accuracy of 73.8%. Both human and machine had the greatest difficulty correctly classifying AMD-2, with human obtaining correct scores on 463 of 1194 images (38.8%) and the machine on 915 of 1711 images (53.5%). Furthermore, both human and machine made the largest percentage error misclassifying AMD-2 as AMD-1, with humans misclassifying 559 of 1194 images (46.8%) and the machines 585 of 1711 images (34.2%), which should have little clinical relevance. The next largest percentage error for each was misclassifying AMD-4 as AMD-3, with human misclassifying 117 of 653 images (17.9%) and the machine misclassifying 171 of 974 images (17.6%).

For the 9-step classification, κw was 0.738, suggesting substantial agreement with the criterion standard. From the confusion matrix for the 9-step classification, the machine had the greatest difficulty correctly classifying patients with severity scores of 3, 2, and 6, achieving accuracy of 9.7% (32 of 330 images) for step 3, 20.8% (151 of 725 images) for step 2, and 25.2% (122 of 484 images) for step 6. The largest percentage of misclassification error for the 9-step classification was misclassifying step 2 as step 1 (462 of 725 images [63.7%]). The next 3 largest misclassification errors were step 8 misclassified as step 7 (121 of 296 images [40.9%]), step 9 misclassified as step 8 (34 of 90 images [37.8%]), and step 3 misclassified as a step 1 (124 of 330 images [37.6%]). These aforementioned misclassifications were all made to adjacent classes with similar 5-year risk factors. Class imbalance and paucity of training exemplars for some classes appear to partly explain where the machine performance decreased. For example, in the 9-step classification, the 4 least represented classes were class 9 (90 images), class 8 (296 images), class 3 (330 images), and class 7 (391 images). On the other hand, the large fraction of class 2 cases that were misclassified as class 1 may be indicative of the intrinsic difficulty in distinguishing class 2 from class 1 as well as the imbalance between the 2 classes (>3 to 1).

The Table summarizes performance for 5-year risk probability prediction for each of the 3 methods: soft, hard, and regressed prediction. Error distribution is shown for all 3 methods in Figure 2 in which box plots are plotted against each step going from 1 through 9. The x-axis indicates the values of the 5-year risk for each step (rather than the step value itself) as reported previously5 and ranges from 0.3% and 0.6% for steps 1 and 2 to 47.4% and 53.2% for steps 8 and 9. From the top downward, errors are shown for soft, hard, and regressed 5-year risk estimates. The left plots show unsigned errors and the right plots show signed errors. Trends show interquartile values of errors increasing consistently with higher AMD steps values. Likewise, median values of unsigned error increase consistently with higher step value.

The overall mean estimation error for the 5-year risk ranged from 3.47% to 5.29% (Table). The error on 5-year estimated risk was smaller for lower-risk classes. Of the 3 methods, the hard prediction performed best for all classes except for class 3, in which the soft prediction outperformed all (mean [SD] prediction error, 1.92% [2.91%]; median, 1.28%), and class 6, in which the regressed prediction outperformed all (mean [SD] prediction error, 7.67% [5.37%]; median, 6.69%). For class 7, the soft prediction had the smallest mean (5.45%), whereas the hard prediction had the smallest median (0%).

Discussion

For the 4-step classification used for eligibility criteria in AREDS, this study matched human performance as seen previously in a 2-step classification.17 For the classification based on a 4-step severity scale, the machine performed on par with a 21st-century ophthalmologist grading, and based on κw, and both showed substantial agreement with the AREDS criterion standard grading.3,23 Both human and machine had the greatest difficulty correctly classifying AMD-2. Specifically, both human and machine made the largest percentage error misclassifying AMD-2 as AMD-1, which should have little clinical relevance. The next largest percentage error for each was misclassifying AMD-4 as AMD-3. All misclassifications assume that the human fundus photograph grading, as the criterion standard, was always correct.

The classification based on the 9-step severity scale based on outcome data was conducted by the machine compared with prior gradings by AREDS fundus photograph graders. Because of the increased number of classes, the total accuracy for the 9-step classification, as expected, was inferior to the 4-step classification. Nevertheless, the κw in this investigation, which took into account errors made to adjacent classes and assessed agreement beyond chance alone, indicated substantial agreement between the machine and the AREDS criterion standard by trained human graders.3,22,23 In contrast to another recently published investigation18 of DCCNs applied to a 12 plus 1-step AREDS detailed scale (in which the additional classes correspond to steps 10, 11, and 12, as well as a class for ungraded images), our study focused on the novel task of estimating the 5-year risk probability using automated identification of a fundus image as 1 of 9 increasingly at-risk steps in AMD progression. Both studies, however, revealed corresponding levels of classification performance, considering margins of error, with differences likely attributable to differing number of classes (steps 10-12 and ungraded images are not relevant to estimating risk for advanced AMD), data use, and data partitioning (eg, the other investigation18 included stereo pairs, but our study did not, mimicking the likely screening scenario of a single-graded image in screening, and performed patient partitioning); the other investigation18 used an ensemble of classifiers; therefore, direct comparisons of that study and ours cannot precisely be made.

Reading from the confusion matrix for the 9-step classification, the machine had the greatest difficulty correctly classifying patients with severity scores of 3, 2, and 6. The largest percentage of misclassification error for the 9-step classification was misclassifying step 2 as step 1. The next 3 largest misclassification errors consisted of step 8 misclassified as step 7, step 9 misclassified as step 8, and step 3 misclassified as step 1. These above-mentioned misclassifications were all made to adjacent classes with similar 5-year risk factors. Class imbalance and paucity of training exemplars for some classes likely explain where the machine performance decreased: in the 9-step classification, the 4 least represented classes were class 9 (90 images), class 8 (296 images), class 3 (330 images), and class 7 (391 images). On the other hand, the large fraction of class 2 cases that were misclassified as class 1 may be suggestive of the intrinsic difficulty in distinguishing class 2 from class 1 as well as the imbalance between the 2 classes (>3 to 1). These factors should motivate the future use of DL techniques that work with only a few training examples.24

In aggregate, preliminary findings suggest that DL methods may help in fine delineation of fundus and OCT, which also may help refine retinal diagnostics.25 Furthermore, similar to the 4-step scale, these findings suggest that the 9-step scale from DL may be used in the public domain to screen individuals with referable AMD with greater granularity of predictability for progression to advanced AMD as well as for use by ophthalmologists assessing risk of AMD in an individual beyond that provided by the simple scale.7 In addition, the greater predictive capability of the 9-step scale may make it more appealing for use in clinical trials to assess efficacy of new therapies that have small sample sizes, especially when the rate of progression to advanced AMD is higher than that determined by 4-step or simple scale eligibility criteria, while avoiding the complexity of training individuals to provide reliable and reproducible gradings.

Limitations

A possible limitation of this study is that the AREDS fundus photographs are primarily composed of images from white participants. In addition, as previously noted for the 9-step classification problem, there are large class imbalances among some of the classes. For example, for the 9-step scale, there were 24 411 images classified as step 1 and only 1160 images classified as step 9. To a lesser extent, this issues also exist for the 4-step scale in which there were a total of 20 801 images classified as AMD-1 and only 9023 AMD-4 images.

Conclusions

In summary, DL achieved results comparable to those of a physician and κw showing substantial agreement with the criterion standard on a large, complex, severity-scale data set, performing well in all categories except those with few training samples. Preliminary results show promise for future use of DL to assist physicians in longitudinal care for individualized, detailed risk assessment as well as clinical studies of disease progression during treatment. Deep learning may also eventually be used for public screening or monitoring in developed and developing countries worldwide that could assist in referring individuals to a health care practitioner when indicated and feasible.

Back to top
Article Information

Accepted for Publication: July 23, 2018.

Corresponding Author: Neil M. Bressler, MD, Wilmer Eye Institute, Johns Hopkins University, 600 N Wolfe St, Maumenee 752, Baltimore, MD 21287-9227 (nmboffice@jhmi.edu).

Published Online: September 14, 2018. doi:10.1001/jamaophthalmol.2018.4118

Author Contributions: Drs Burlina and Joshi had full access to all the data in the study and take full responsibility for the integrity of the data and the accuracy of the data analysis.

Concept and design: Burlina, Joshi, Freund, Bressler.

Acquisition, analysis, or interpretation of data: Burlina, Joshi, Pacheco, Freund, Kong.

Drafting of the manuscript: All authors.

Critical revision of the manuscript for important intellectual content: Burlina, Joshi, Freund, Bressler.

Statistical analysis: Burlina, Joshi, Freund.

Obtained funding: Burlina, Bressler.

Administrative, technical, or material support: Burlina, Joshi, Freund, Kong.

Supervision: Burlina, Freund, Bressler.

Conflict of Interest Disclosures: All authors have completed and submitted the ICMJE Form for Disclosure of Potential Conflicts of Interest. Drs Burlina, Freund, and Bressler report a patent on a system and method for automated detection of age-related macular degeneration and other retinal abnormalities. No other disclosures were reported.

Funding/Support: This work was supported in part by R21EY024310 from the National Eye Institute, the Johns Hopkins Applied Physics Laboratory, the James P. Gills Professorship, and unrestricted research funds to the Johns Hopkins University School of Medicine Retina Division for Macular Degeneration and Related Diseases Research.

Role of the Funder/Sponsor: The National Eye Institute and Johns Hopkins University had no role in the design and conduct of the study; collection, management, analysis, and interpretation of the data; preparation, review, or approval of the manuscript; and decision to submit the manuscript for publication.

Disclaimer: Dr Bressler is Editor of JAMA Ophthalmology, but he was not involved in any of the decisions regarding review of the manuscript or its acceptance. The Age-Related Eye Disease Study Database of Genotypes and Phenotypes data set was made available by the National Eye Institute of the National Institutes of Health (NIH). The content is solely the responsibility of the authors and does not necessarily represent the official views of the NIH.

Meeting Presentation: This paper was presented at The Retina Society Annual Meeting; September 14, 2018; San Francisco, California.

References
1.
Bressler  NM.  Age-related macular degeneration is the leading cause of blindness.  JAMA. 2004;291(15):1900-1901. doi:10.1001/jama.291.15.1900PubMedGoogle ScholarCrossref
2.
Bressler  NM, Bressler  SB, Congdon  NG,  et al; Age-Related Eye Disease Study Research Group.  Potential public health impact of Age-Related Eye Disease Study results: AREDS report No. 11.  Arch Ophthalmol. 2003;121(11):1621-1624. doi:10.1001/archopht.121.11.1621PubMedGoogle ScholarCrossref
3.
Age-Related Eye Disease Study Research Group.  The Age-Related Eye Disease Study system for classifying age-related macular degeneration from stereoscopic color fundus photographs: the Age-Related Eye Disease Study Report Number 6.  Am J Ophthalmol. 2001;132(5):668-681. doi:10.1016/S0002-9394(01)01218-1PubMedGoogle ScholarCrossref
4.
Age-Related Eye Disease Study Research Group.  A randomized, placebo-controlled, clinical trial of high-dose supplementation with vitamins C and E, beta carotene, and zinc for age-related macular degeneration and vision loss: AREDS report No. 8.  Arch Ophthalmol. 2001;119(10):1417-1436. doi:10.1001/archopht.119.10.1417PubMedGoogle ScholarCrossref
5.
Davis  MD, Gangnon  RE, Lee  LY,  et al; Age-Related Eye Disease Study Group.  The Age-Related Eye Disease Study severity scale for age-related macular degeneration: AREDS report No. 17.  Arch Ophthalmol. 2005;123(11):1484-1498. doi:10.1001/archopht.123.11.1484PubMedGoogle ScholarCrossref
6.
Ying  GS, Maguire  MG, Alexander  J, Martin  RW, Antoszyk  AN; Complications of Age-related Macular Degeneration Prevention Trial Research Group.  Description of the Age-Related Eye Disease Study 9-step severity scale applied to participants in the Complications of Age-Related Macular Degeneration Prevention Trial.  Arch Ophthalmol. 2009;127(9):1147-1151. doi:10.1001/archophthalmol.2009.189PubMedGoogle ScholarCrossref
7.
Ferris  FL, Davis  MD, Clemons  TE,  et al; Age-Related Eye Disease Study (AREDS) Research Group.  A simplified severity scale for age-related macular degeneration: AREDS report No. 18.  Arch Ophthalmol. 2005;123(11):1570-1574. doi:10.1001/archopht.123.11.1570PubMedGoogle ScholarCrossref
8.
Velez-Montoya  R, Oliver  SCN, Olson  JL, Fine  SL, Quiroz-Mercado  H, Mandava  N.  Current knowledge and trends in age-related macular degeneration: genetics, epidemiology, and prevention.  Retina. 2014;34(3):423-441. doi:10.1097/IAE.0000000000000036PubMedGoogle ScholarCrossref
9.
US Department of Commerce; US Census Bureau.  Statistical Abstract of the United States, 2012. Washington, DC: US Census Bureau; 2012.
10.
Juang  R, McVeigh  E, Hoffmann  B, Yuh  D, Burlina  P. Automatic segmentation of the left-ventricular cavity and atrium in 3D ultrasound using graph cuts and the radial symmetry transform. In:  2011 IEEE International Symposium on Biomedical Imaging: From Nano to Macro. Piscataway, NJ: Institute of Electric and Electronics Engineers; 2011:606-609. doi:10.1109/ISBI.2011.5872480
11.
Burlina  P, Freund  DE, Dupas  B, Bressler  N.  Automatic screening of age-related macular degeneration and retinal abnormalities.  Conf Proc IEEE Eng Med Biol Soc. 2011;2011:3962-2966. doi:10.1109/IEMBS.2011.6090984PubMedGoogle Scholar
12.
Burlina  P, Billings  S, Joshi  N, Albayda  J.  Automated diagnosis of myositis from muscle ultrasound: exploring the use of machine learning and deep learning methods.  PLoS One. 2017;12(8):e0184059. doi:10.1371/journal.pone.0184059PubMedGoogle ScholarCrossref
13.
Ting  DSW, Cheung  CY, Lim  G,  et al.  Development and validation of a deep learning system for diabetic retinopathy and related eye diseases using retinal images from multiethnic populations with diabetes.  JAMA. 2017;318(22):2211-2223. doi:10.1001/jama.2017.18152PubMedGoogle ScholarCrossref
14.
Gulshan  V, Peng  L, Coram  M,  et al.  Development and validation of a deep learning algorithm for detection of diabetic retinopathy in retinal fundus photographs.  JAMA. 2016;316(22):2402-2410. doi:10.1001/jama.2016.17216PubMedGoogle ScholarCrossref
15.
Quellec  G, Charrière  K, Boudi  Y, Cochener  B, Lamard  M.  Deep image mining for diabetic retinopathy screening.  Med Image Anal. 2017;39:178-193. doi:10.1016/j.media.2017.04.012PubMedGoogle ScholarCrossref
16.
Gargeya  R, Leng  T.  Automated identification of diabetic retinopathy using deep learning.  Ophthalmology. 2017;124(7):962-969. doi:10.1016/j.ophtha.2017.02.008PubMedGoogle ScholarCrossref
17.
Burlina  PM, Joshi  N, Pekala  M, Pacheco  KD, Freund  DE, Bressler  NM.  Automated grading of age-related macular degeneration from color fundus images using deep convolutional neural networks.  JAMA Ophthalmol. 2017;135(11):1170-1176. doi:10.1001/jamaophthalmol.2017.3782PubMedGoogle ScholarCrossref
18.
Grassmann  F, Mengelkamp  J, Brandl  C,  et al.  A deep learning algorithm for prediction of Age-Related Eye Disease Study severity scale for age-related macular degeneration from color fundus photography.  Ophthalmology. 2018;125(9):1410-1420. doi:10.1016/j.ophtha.2018.02.037PubMedGoogle ScholarCrossref
19.
Burlina  P, Pacheco  KD, Joshi  N, Freund  DE, Bressler  NM.  Comparing humans and deep learning performance for grading AMD: a study in using universal deep features and transfer learning for automated AMD analysis.  Comput Biol Med. 2017;82:80-86. doi:10.1016/j.compbiomed.2017.01.018PubMedGoogle ScholarCrossref
20.
He  K, Zhang  X, Ren  S, Sun  J. Deep residual learning for image recognition. In:  Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway, NJ: Institute of Electric and Electronics Engineers; 2016:771-778.
21.
Age-Related Eye Disease Study Research Group.  The Age-Related Eye Disease Study (AREDS): design implications: AREDS report No. 1.  Control Clin Trials. 1999;20(6):573-600. doi:10.1016/S0197-2456(99)00031-8PubMedGoogle ScholarCrossref
23.
Landis  JR, Koch  GG.  The measurement of observer agreement for categorical data.  Biometrics. 1977;33(1):159-174. doi:10.2307/2529310PubMedGoogle ScholarCrossref
24.
Markowitz  J, Schmidt  AC, Burlina  P, Wang  IJ. Hierarchical zero-shot classification with convolutional neural network features and semantic attribute learning. In:  Proceedings of the Fifteenth IAPR International Conference on Machine Vision Applications (MVA). Piscataway, NJ: Institute of Electric and Electronics Engineers; 2017 doi:10.23919/MVA.2017.7986834
25.
Pekala  M, Joshi  N, Freund  DE, Bressler  NM, Cabrera Debuc  D, Burlina  P. Deep learning based retinal OCT segmentation. arXiv. https://arxiv.org/abs/1801.09749. Published January 29, 2018. Accessed August 13, 2018.
×