Values in parentheses indicate error margins for 95% CIs. Information on the severity scales is given in the AMD Severity Scales subsection of the Methods section, and metrics are defined in the Metrics subsection of the Methods section. DL indicates deep learning.
The boxes extend from the lower to upper quartile values of the data, with a line at the median. The whiskers extend from the box to show the range of the data. Flier (outlier) points are those past the end of the whiskers.
eMethods. Additional Points of Clarification
eTable. The 9-Step AMD Severity Scale From AREDS Report 17
eFigure 1. Examples of Fundus Images Showing Age-Related Macular Degeneration (AMD) 9-Step Classification Ranging From 1 Through 3 (Left-Right)
eFigure 2. Examples of Fundus Images Showing Age-Related Macular Degeneration (AMD) 9-Step Classification Ranging From 4 Through 6 (Left-Right)
eFigure 3. Examples of Fundus Images Showing Age-Related Macular Degeneration (AMD) 9-Step Classification Ranging From 7 Through 9 (Left-Right)
Customize your JAMA Network experience by selecting one or more topics from the list below.
Burlina PM, Joshi N, Pacheco KD, Freund DE, Kong J, Bressler NM. Use of Deep Learning for Detailed Severity Characterization and Estimation of 5-Year Risk Among Patients With Age-Related Macular Degeneration. JAMA Ophthalmol. 2018;136(12):1359–1366. doi:10.1001/jamaophthalmol.2018.4118
How accurate are deep learning algorithms for characterizing age-related macular degeneration from fundus images, assessing Age-Related Eye Disease Study 4- and 9-step detailed severity scales, and determining 5-year risk of progression to advanced stages of age-related macular degeneration?
This study used 67 401 color fundus images from 4613 study participants of the Age-Related Eye Disease Study data set. Linearly weighted κ scores for estimating 4- and 9-step severity scale scores showed substantial agreement using gradings from highly trained fundus photograph graders as the criterion standard.
The results of this study suggest that deep learning assists in a detailed 9-step severity assessment of age-related macular degeneration and estimating 5-year risk of progression to advanced age-related macular degeneration with reasonable accuracy.
Although deep learning (DL) can identify the intermediate or advanced stages of age-related macular degeneration (AMD) as a binary yes or no, stratified gradings using the more granular Age-Related Eye Disease Study (AREDS) 9-step detailed severity scale for AMD provide more precise estimation of 5-year progression to advanced stages. The AREDS 9-step detailed scale’s complexity and implementation solely with highly trained fundus photograph graders potentially hampered its clinical use, warranting development and use of an alternate AREDS simple scale, which although valuable, has less predictive ability.
To describe DL techniques for the AREDS 9-step detailed severity scale for AMD to estimate 5-year risk probability with reasonable accuracy.
Design, Setting, and Participants
This study used data collected from November 13, 1992, to November 30, 2005, from 4613 study participants of the AREDS data set to develop deep convolutional neural networks that were trained to provide detailed automated AMD grading on several AMD severity classification scales, using a multiclass classification setting. Two AMD severity classification problems using criteria based on 4-step (AMD-1, AMD-2, AMD-3, and AMD-4 from classifications developed for AREDS eligibility criteria) and 9-step (from AREDS detailed severity scale) AMD severity scales were investigated. The performance of these algorithms was compared with a contemporary human grader and against a criterion standard (fundus photograph reading center graders) used at the time of AREDS enrollment and follow-up. Three methods for estimating 5-year risk were developed, including one based on DL regression. Data were analyzed from December 1, 2017, through April 15, 2018.
Main Outcomes and Measures
Weighted κ scores and mean unsigned errors for estimating 5-year risk probability of progression to advanced AMD.
This study used 67 401 color fundus images from the 4613 study participants. The weighted κ scores were 0.77 for the 4-step and 0.74 for the 9-step AMD severity scales. The overall mean estimation error for the 5-year risk ranged from 3.5% to 5.3%.
Conclusions and Relevance
These findings suggest that DL AMD grading has, for the 4-step classification evaluation, performance comparable with that of humans and achieves promising results for providing AMD detailed severity grading (9-step classification), which normally requires highly trained graders, and for estimating 5-year risk of progression to advanced AMD. Use of DL has the potential to assist physicians in longitudinal care for individualized, detailed risk assessment as well as clinical studies of disease progression during treatment or as public screening or monitoring worldwide.
Approximately 8 million people older than 50 years have intermediate-stage age-related macular degeneration (AMD).1 These individuals are at high risk of developing advanced AMD, which if left untreated, is a leading cause of blindness in the United States.1,2 The intermediate stage is defined by large drusen or retinal pigment epithelial abnormalities identified on fundus examination and rigorously quantified using fundus photographs to measure drusen size, drusen area, or pigmentary abnormalities. To quantify AMD severity and study disease progression, the Age-Related Eye Disease Study (AREDS) developed 2 classification scales based on these retinal abnormalities. In particular, a basic 4-step classification scale, which was not based on an analysis of outcome data, was used at entry into AREDS and defined as no AMD (AMD-1), early AMD (AMD-2), intermediate AMD (AMD-3), and advanced AMD (AMD-4).3 A more detailed, 9-step severity scale (eFigures 1-3 in the Supplement), which was based on outcome data, provided predictive variables for 5-year risk of developing choroidal neovascularization (CNV), central geographic atrophy (GA), or both.4-6 This detailed grading of fundus images can be time-consuming and likely limited to highly trained fundus photograph graders. Given the complexity of the 9-step scale, in the absence of trained graders (eg, from fundus photograph reading centers), most physicians probably do not use the AREDS detailed AMD severity scale. A simpler scale was judged to be needed and was developed with more practicality but with less predictive accuracy of 5-year risk.7
People older than 50 years are at greater risk for developing AMD.8 Currently, there are approximately 1.75 million to 3 million people in the United States with the advanced form of AMD and approximately 110 million more in the at-risk population of people older than 50 years.1,2,9 Worldwide, it was estimated in 2000 that there were more than 600 million people older than 60 years, with that number being projected to increase to 2.4 billion by 2050.8 Use of the 9-step scale performed only by highly skilled human graders to determine an individual’s 5-year risk of developing the advanced form of AMD is time and cost prohibitive.
The objective of this study was to develop automated methods that implement the AREDS 4-step AMD eligibility criteria and the 9-step AMD detailed severity scale using modern deep learning (DL) algorithms to automatically evaluate severity of AMD and risk of progression to advanced AMD from fundus photographs. This automated capability could alleviate the issue of identifying, in a timely manner, individuals at various levels of risk of progression in the population. In addition, it could allow for the objective assessment of detailed disease progression in practice or during enrollment and follow-up of clinical trials for AMD.
Deep learning approaches for automated classification from fundus images differ from traditional machine vision, medical image analysis, and automated retinal image analysis methods, which have relied on computing engineered features from the image.10,11 Instead, DL involves feature representations of images with multiple levels of abstraction not by relying on human-identified features but directly from data.12 Advances in DL were made possible through algorithmic optimization and expanded computational power using graphics processing units and have led to the possibility of using DL evaluation of fundus photographs not only for a 2-step system (referable vs not referable) but also for detailed 4- or 9-step systems.11,13-18 More detailed systems could determine whether the patient was referable or not and provide the health care system more granular classification that allows better predictive accuracy in the absence of highly trained human graders. The output could be used for more precise counseling of the patient and identification of patients at very high risk of progression to advanced AMD. Such patients might warrant more detailed or frequent monitoring (such as with optical coherence tomography [OCT] or OCT angiography) or may be the basis of more efficient clinical trials that could test preventive treatments on very high-risk individuals wherein the sample size could be kept small.
The National Institutes of Health AREDS data set, including data collected from individuals from November 13, 1992, to November 30, 2005, from whom written informed consent was obtained, was derived from a 12-year longitudinal study designed to improve understanding of the frequency and risk factors of AMD progression. More than 130 000 field-2 stereoscopic color fundus photographs were captured during AREDS from 4613 study participants at the baseline and follow-up visits. Images were quantitatively graded by trained and certified graders at a fundus photograph reading center.3 The grades assigned to each image were used as a criterion standard for the multiclass classification problems studied herein. To avoid using essentially the same image twice, we handled cases with stereo pairs by removing the image with the poorer quality.17,19 This approach resulted in using a total of 67 401 graded images.17 Use of the AREDS data set was performed after Johns Hopkins University Institutional Review Board approval. Data were analyzed from December 1 2017, through April 15, 2018.
To perform automated classification, we use DL algorithms known as deep convolutional neural networks (DCNNs) and specifically use the ResNet-50 network.20 The DCNNs use many computational layers that perform convolutions and nonlinear activation operations. This layered approach results in the identification of image features that represent the original image at different levels of abstraction (low-, middle-, and higher-level semantic features).
The 4-step AREDS scale for eligibility criteria in AREDS is an eye-based scale defined as follows: (1) eyes with no or only small drusen (drusen size, <63 μm) and no pigmentation abnormalities classified as normal were given a score of 1; (2) eyes with multiple small drusen or medium-sized drusen (drusen size, ≥63 and <125 μm) and/or pigmentation abnormalities related to AMD classified as early stage AMD were given a score of 2; (3) eyes with large drusen (drusen size, ≥125 μm) or numerous medium-sized drusen and pigmentation abnormalities classified as intermediate AMD were given a score of 3; (4) eyes with lesions associated with CNV or GA (eg, retinal pigment epithelial detachment, subretinal pigment epithelial hemorrhage) classified as advanced AMD3,6,21 were given a score of 4 if the fellow eye did not have central GA or CNV AMD. This scale was used to provide baseline severity levels of AMD in participants enrolling in AREDS. More recently, this 4-step scale was proposed for use in the public domain to identify individuals with intermediate- or advanced-stage AMD who might be referred to health care practitioners for monitoring for the development of advanced-stage AMD and for consideration of dietary supplementation, such as that used in AREDS, to reduce the risk of progression to advanced-stage AMD. For patients with no AMD or early-stage AMD, a referral and consideration of such supplementation might not be indicated.
The 9-step severity scale (eFigures 1-3 in the Supplement), which is based on outcome data from AREDS, is an eye-based scale that is more detailed and refined than the 4-step scale described above; the scale grades intricate quantitative features of total drusen area and pigmentation abnormalities.4-6 Specifically, it combines a 6-step drusen area scale with a 5-step pigmentary abnormality scale to create a 9-step scale with detailed predictability of potential development of advanced AMD in an individual. The 5-year risk of developing the advanced stage, defined as CNV, central GA, or both, increases from approximately 0.3% for an eye at step 1 to 53% for an eye at step 9.5 The eTable in the Supplement provides a detailed description of the quantitative criteria used to define each of the 9 steps and the corresponding 5-year risk factor associated with each step.5 Notations C-1, C-2, I-2, O-2, and 0.5 disc area seen in the eTable in the Supplement refer to standard circles used by trained fundus photograph graders to quantitatively measure areas of drusen and pigment abnormalities.3 Several severity steps have multiple criteria that define the given step, any one of which is sufficient for an eye to be diagnosed at that step. For example, an eye can be diagnosed as step 2 in one of 2 ways: (1) total drusen area greater than or equal to C-1 and less than C-2 and no increased pigment or depigmentation GA or (2) total drusen area less than C-1 and increased depigmentation GA in the questionable category (ie, the grader is at least 50%, but not more than 90%, sure that the abnormality exists) but with less than I-2 and increased pigment in the questionable category. In the AREDS reports, an eye that progressed to central GA (ie, beyond step 9) was given a score of 10, an eye that progressed to CNV was scored 11, and an eye with both central GA and CNV was scored 12. Images with a score of 10, 11, or 12 were not included in the analysis of the 9-step severity scale because having any of these stages corresponds trivially to a 5-year risk probability of 1.
For each of the classification problems above (4-step and 9-step scales), a separate multiclass classifier based on a DCNN was trained and tested. Additional details are provided in the eMethods in the Supplement.
Three DL-based methods were used and tested to infer the 5-year risk directly from the fundus image as input. These methods included soft prediction, hard prediction, and regressed prediction.
Soft prediction first estimated 9-step class probabilities (pi, i = 1…9) for each of the 9 classes using the DCNN-based classification described above and using the output SoftMax values of the DCNN. Then, the risk estimate Esp was computed as the expected value of the class risk under this probability as Rsp = Ep[R] = Σi = 1..9 pi . Ri, where Ri is the risk ascribed to class i (eTable in the Supplement).
As in the previous method, in hard prediction, the 9-step class probabilities (pi, i = 1…9) for each of the 9 classes were estimated using the DCNN-based classification described above. The risk Rhp was computed as the risk of the class with maximum probability as Rsp = Ri*, where i* indicates argmaxi = 1..9 (pi).
Unlike the other 2 methods, regressed prediction skipped the step of predicting the 9-step class for the fundus image and, instead, directly performed DL-based regression by mapping the input image to risk Rrp using a DCNN in regression mode. The specific features of the DCNN are similar to those used for classification (ResNet) with changes for regression (L2 loss function instead of cross-entropy and use of a single node on the last layer).
The data consisted of AREDS color fundus images that were subdivided using a train, validate, and test split of 88%, 2%, and 10%, respectively. This split is within typical values that are judged to be adequate; 2% left for validation still contains more than 1000 images. Care was taken that all images for a given study participant were comprised wholly within a partition. As described previously,17 a total of 67 401 images were used for the 4-step classification problems. In the case of the 9-step classification problem, some of the 67 401 images had no value assigned to them for the 9-step scale (ie, missing data). These images were removed, and, as noted above, images with severity scores of 10, 11, or 12 were also removed, resulting in 58 370 images available for the 9-step classification problem.
Because many of the gradings of AREDS photographs were performed in the 1990s, we also compared the DCNN algorithms with 21st century human performance by having an ophthalmologist independently grade a subset of 5000 AREDS images using the criteria defined for the 4-step AMD severity scale. These grades and the machine-generated grades were each compared with the criterion standard AREDS AMD grades from the fundus photograph reading center.17
The metrics used to assess the quality of performance were linearly weighted κ (κw) score, accuracy, and confusion matrix.3,22,23 Although accuracy is a commonly used metric, κw was a superior measure in this situation (multiclass classification with ordinal classes) for performance evaluation for 2 reasons: κw discounts for chance agreement and weights error based on proximity of classes, whereas accuracy penalizes equally across classes when, for example, erroneously classifying an AMD-1 as an AMD-2 as opposed to AMD-4. For the confusion matrix (Figure 1), our convention used rows to represent the sample count for the true class, and columns provided the classifier prediction counts. Thus, the sum of each row equals the total number of images in each class, and the diagonal elements give the total number the classifier classified correctly in each class. Off-diagonal elements showed the number of images misclassified and how they were misclassified. Accuracies that represent population-specific class distributions can be computed from the confusion matrix.
This study used 67 401 color fundus images from the 4613 study participants. Figure 1 reports the results of the multiclass classification for the 4-step and 9-step AMD severity scales. Results of the 4-step classification showed machine performance comparable to that of the ophthalmologist: comparing machine vs human, results for the 4-class classification were a κw of 0.773 and an accuracy of 75.7% vs a κw of 0.753 and an accuracy of 73.8%. Both human and machine had the greatest difficulty correctly classifying AMD-2, with human obtaining correct scores on 463 of 1194 images (38.8%) and the machine on 915 of 1711 images (53.5%). Furthermore, both human and machine made the largest percentage error misclassifying AMD-2 as AMD-1, with humans misclassifying 559 of 1194 images (46.8%) and the machines 585 of 1711 images (34.2%), which should have little clinical relevance. The next largest percentage error for each was misclassifying AMD-4 as AMD-3, with human misclassifying 117 of 653 images (17.9%) and the machine misclassifying 171 of 974 images (17.6%).
For the 9-step classification, κw was 0.738, suggesting substantial agreement with the criterion standard. From the confusion matrix for the 9-step classification, the machine had the greatest difficulty correctly classifying patients with severity scores of 3, 2, and 6, achieving accuracy of 9.7% (32 of 330 images) for step 3, 20.8% (151 of 725 images) for step 2, and 25.2% (122 of 484 images) for step 6. The largest percentage of misclassification error for the 9-step classification was misclassifying step 2 as step 1 (462 of 725 images [63.7%]). The next 3 largest misclassification errors were step 8 misclassified as step 7 (121 of 296 images [40.9%]), step 9 misclassified as step 8 (34 of 90 images [37.8%]), and step 3 misclassified as a step 1 (124 of 330 images [37.6%]). These aforementioned misclassifications were all made to adjacent classes with similar 5-year risk factors. Class imbalance and paucity of training exemplars for some classes appear to partly explain where the machine performance decreased. For example, in the 9-step classification, the 4 least represented classes were class 9 (90 images), class 8 (296 images), class 3 (330 images), and class 7 (391 images). On the other hand, the large fraction of class 2 cases that were misclassified as class 1 may be indicative of the intrinsic difficulty in distinguishing class 2 from class 1 as well as the imbalance between the 2 classes (>3 to 1).
The Table summarizes performance for 5-year risk probability prediction for each of the 3 methods: soft, hard, and regressed prediction. Error distribution is shown for all 3 methods in Figure 2 in which box plots are plotted against each step going from 1 through 9. The x-axis indicates the values of the 5-year risk for each step (rather than the step value itself) as reported previously5 and ranges from 0.3% and 0.6% for steps 1 and 2 to 47.4% and 53.2% for steps 8 and 9. From the top downward, errors are shown for soft, hard, and regressed 5-year risk estimates. The left plots show unsigned errors and the right plots show signed errors. Trends show interquartile values of errors increasing consistently with higher AMD steps values. Likewise, median values of unsigned error increase consistently with higher step value.
The overall mean estimation error for the 5-year risk ranged from 3.47% to 5.29% (Table). The error on 5-year estimated risk was smaller for lower-risk classes. Of the 3 methods, the hard prediction performed best for all classes except for class 3, in which the soft prediction outperformed all (mean [SD] prediction error, 1.92% [2.91%]; median, 1.28%), and class 6, in which the regressed prediction outperformed all (mean [SD] prediction error, 7.67% [5.37%]; median, 6.69%). For class 7, the soft prediction had the smallest mean (5.45%), whereas the hard prediction had the smallest median (0%).
For the 4-step classification used for eligibility criteria in AREDS, this study matched human performance as seen previously in a 2-step classification.17 For the classification based on a 4-step severity scale, the machine performed on par with a 21st-century ophthalmologist grading, and based on κw, and both showed substantial agreement with the AREDS criterion standard grading.3,23 Both human and machine had the greatest difficulty correctly classifying AMD-2. Specifically, both human and machine made the largest percentage error misclassifying AMD-2 as AMD-1, which should have little clinical relevance. The next largest percentage error for each was misclassifying AMD-4 as AMD-3. All misclassifications assume that the human fundus photograph grading, as the criterion standard, was always correct.
The classification based on the 9-step severity scale based on outcome data was conducted by the machine compared with prior gradings by AREDS fundus photograph graders. Because of the increased number of classes, the total accuracy for the 9-step classification, as expected, was inferior to the 4-step classification. Nevertheless, the κw in this investigation, which took into account errors made to adjacent classes and assessed agreement beyond chance alone, indicated substantial agreement between the machine and the AREDS criterion standard by trained human graders.3,22,23 In contrast to another recently published investigation18 of DCCNs applied to a 12 plus 1-step AREDS detailed scale (in which the additional classes correspond to steps 10, 11, and 12, as well as a class for ungraded images), our study focused on the novel task of estimating the 5-year risk probability using automated identification of a fundus image as 1 of 9 increasingly at-risk steps in AMD progression. Both studies, however, revealed corresponding levels of classification performance, considering margins of error, with differences likely attributable to differing number of classes (steps 10-12 and ungraded images are not relevant to estimating risk for advanced AMD), data use, and data partitioning (eg, the other investigation18 included stereo pairs, but our study did not, mimicking the likely screening scenario of a single-graded image in screening, and performed patient partitioning); the other investigation18 used an ensemble of classifiers; therefore, direct comparisons of that study and ours cannot precisely be made.
Reading from the confusion matrix for the 9-step classification, the machine had the greatest difficulty correctly classifying patients with severity scores of 3, 2, and 6. The largest percentage of misclassification error for the 9-step classification was misclassifying step 2 as step 1. The next 3 largest misclassification errors consisted of step 8 misclassified as step 7, step 9 misclassified as step 8, and step 3 misclassified as step 1. These above-mentioned misclassifications were all made to adjacent classes with similar 5-year risk factors. Class imbalance and paucity of training exemplars for some classes likely explain where the machine performance decreased: in the 9-step classification, the 4 least represented classes were class 9 (90 images), class 8 (296 images), class 3 (330 images), and class 7 (391 images). On the other hand, the large fraction of class 2 cases that were misclassified as class 1 may be suggestive of the intrinsic difficulty in distinguishing class 2 from class 1 as well as the imbalance between the 2 classes (>3 to 1). These factors should motivate the future use of DL techniques that work with only a few training examples.24
In aggregate, preliminary findings suggest that DL methods may help in fine delineation of fundus and OCT, which also may help refine retinal diagnostics.25 Furthermore, similar to the 4-step scale, these findings suggest that the 9-step scale from DL may be used in the public domain to screen individuals with referable AMD with greater granularity of predictability for progression to advanced AMD as well as for use by ophthalmologists assessing risk of AMD in an individual beyond that provided by the simple scale.7 In addition, the greater predictive capability of the 9-step scale may make it more appealing for use in clinical trials to assess efficacy of new therapies that have small sample sizes, especially when the rate of progression to advanced AMD is higher than that determined by 4-step or simple scale eligibility criteria, while avoiding the complexity of training individuals to provide reliable and reproducible gradings.
A possible limitation of this study is that the AREDS fundus photographs are primarily composed of images from white participants. In addition, as previously noted for the 9-step classification problem, there are large class imbalances among some of the classes. For example, for the 9-step scale, there were 24 411 images classified as step 1 and only 1160 images classified as step 9. To a lesser extent, this issues also exist for the 4-step scale in which there were a total of 20 801 images classified as AMD-1 and only 9023 AMD-4 images.
In summary, DL achieved results comparable to those of a physician and κw showing substantial agreement with the criterion standard on a large, complex, severity-scale data set, performing well in all categories except those with few training samples. Preliminary results show promise for future use of DL to assist physicians in longitudinal care for individualized, detailed risk assessment as well as clinical studies of disease progression during treatment. Deep learning may also eventually be used for public screening or monitoring in developed and developing countries worldwide that could assist in referring individuals to a health care practitioner when indicated and feasible.
Accepted for Publication: July 23, 2018.
Corresponding Author: Neil M. Bressler, MD, Wilmer Eye Institute, Johns Hopkins University, 600 N Wolfe St, Maumenee 752, Baltimore, MD 21287-9227 (email@example.com).
Published Online: September 14, 2018. doi:10.1001/jamaophthalmol.2018.4118
Author Contributions: Drs Burlina and Joshi had full access to all the data in the study and take full responsibility for the integrity of the data and the accuracy of the data analysis.
Concept and design: Burlina, Joshi, Freund, Bressler.
Acquisition, analysis, or interpretation of data: Burlina, Joshi, Pacheco, Freund, Kong.
Drafting of the manuscript: All authors.
Critical revision of the manuscript for important intellectual content: Burlina, Joshi, Freund, Bressler.
Statistical analysis: Burlina, Joshi, Freund.
Obtained funding: Burlina, Bressler.
Administrative, technical, or material support: Burlina, Joshi, Freund, Kong.
Supervision: Burlina, Freund, Bressler.
Conflict of Interest Disclosures: All authors have completed and submitted the ICMJE Form for Disclosure of Potential Conflicts of Interest. Drs Burlina, Freund, and Bressler report a patent on a system and method for automated detection of age-related macular degeneration and other retinal abnormalities. No other disclosures were reported.
Funding/Support: This work was supported in part by R21EY024310 from the National Eye Institute, the Johns Hopkins Applied Physics Laboratory, the James P. Gills Professorship, and unrestricted research funds to the Johns Hopkins University School of Medicine Retina Division for Macular Degeneration and Related Diseases Research.
Role of the Funder/Sponsor: The National Eye Institute and Johns Hopkins University had no role in the design and conduct of the study; collection, management, analysis, and interpretation of the data; preparation, review, or approval of the manuscript; and decision to submit the manuscript for publication.
Disclaimer: Dr Bressler is Editor of JAMA Ophthalmology, but he was not involved in any of the decisions regarding review of the manuscript or its acceptance. The Age-Related Eye Disease Study Database of Genotypes and Phenotypes data set was made available by the National Eye Institute of the National Institutes of Health (NIH). The content is solely the responsibility of the authors and does not necessarily represent the official views of the NIH.
Meeting Presentation: This paper was presented at The Retina Society Annual Meeting; September 14, 2018; San Francisco, California.
Create a personal account or sign in to: