A, category 1 or no AMD; B, category 2 or early AMD; C, category 3, intermediate AMD; and D, category 4 or advanced AMD.
Receiver operating characteristic curves for all experiments and algorithms showing also the corresponding area under the curve values.
A indicates algorithm A; AUC, area under the curve; DCNN, deep convolutional neural networks; NS, no stereo; NSG, no stereo gradable; PP, patient partitioning; SP, standard partitioning; WS, with stereo pairs; U, algorithm U.
Customize your JAMA Network experience by selecting one or more topics from the list below.
Burlina PM, Joshi N, Pekala M, Pacheco KD, Freund DE, Bressler NM. Automated Grading of Age-Related Macular Degeneration From Color Fundus Images Using Deep Convolutional Neural Networks. JAMA Ophthalmol. 2017;135(11):1170–1176. doi:10.1001/jamaophthalmol.2017.3782
When applying deep learning methods to the automated assessment of fundus images, what is the accuracy for detecting age-related macular degeneration?
This study found that the deep convolutional neural network method ranged in accuracy (SD) between 88.4% (0.7%) and 91.6% (0.1%), with kappa scores close to or greater than 0.8, which is comparable with human expert performance levels.
The results suggest that deep learning–based machine grading can be leveraged successfully to automatically assess age-related macular degeneration from fundus images in a way that is comparable with the human ability to grade age-related macular degeneration from these images.
Age-related macular degeneration (AMD) affects millions of people throughout the world. The intermediate stage may go undetected, as it typically is asymptomatic. However, the preferred practice patterns for AMD recommend identifying individuals with this stage of the disease to educate how to monitor for the early detection of the choroidal neovascular stage before substantial vision loss has occurred and to consider dietary supplements that might reduce the risk of the disease progressing from the intermediate to the advanced stage. Identification, though, can be time-intensive and requires expertly trained individuals.
To develop methods for automatically detecting AMD from fundus images using a novel application of deep learning methods to the automated assessment of these images and to leverage artificial intelligence advances.
Design, Setting, and Participants
Deep convolutional neural networks that are explicitly trained for performing automated AMD grading were compared with an alternate deep learning method that used transfer learning and universal features and with a trained clinical grader. Age-related macular degeneration automated detection was applied to a 2-class classification problem in which the task was to distinguish the disease-free/early stages from the referable intermediate/advanced stages. Using several experiments that entailed different data partitioning, the performance of the machine algorithms and human graders in evaluating over 130 000 images that were deidentified with respect to age, sex, and race/ethnicity from 4613 patients against a gold standard included in the National Institutes of Health Age-related Eye Disease Study data set was evaluated.
Main Outcomes and Measures
Accuracy, receiver operating characteristics and area under the curve, and kappa score.
The deep convolutional neural network method yielded accuracy (SD) that ranged between 88.4% (0.5%) and 91.6% (0.1%), the area under the receiver operating characteristic curve was between 0.94 and 0.96, and kappa coefficient (SD) between 0.764 (0.010) and 0.829 (0.003), which indicated a substantial agreement with the gold standard Age-related Eye Disease Study data set.
Conclusions and Relevance
Applying a deep learning–based automated assessment of AMD from fundus images can produce results that are similar to human performance levels. This study demonstrates that automated algorithms could play a role that is independent of expert human graders in the current management of AMD and could address the costs of screening or monitoring, access to health care, and the assessment of novel treatments that address the development or progression of AMD.
Age-related macular degeneration (AMD) is associated with the presence of drusen, long-spacing collagen, and phospholipid vesicles between the basement membrane of the retinal pigment epithelium and the remainder of the Bruch membrane.1 The intermediate stage of AMD, which often causes no visual deficit, includes eyes with many medium-sized drusen (the greatest linear dimension ranging from 63 µ-125 µ) or at least 1 large druse (greater than 125 µ) or geographic atrophy (GA) of the retinal pigment epithelium that does not involve the fovea.1
The intermediate stage often leads to the advanced stage, in which substantial damage to the macula can occur from choroidal neovascularization, also termed the wet advanced form, or GA that involve the center of the macula, which is termed the dry advanced form. Choroidal neovascularization , when not treated, often leads to the loss of central visual acuity,2 which affects daily activities like reading, driving, or recognizing objects. Consequently, the advanced stage can pose a substantial socioeconomic burden on society.3 Age-related macular degeneration is the leading cause of central vision loss among people older than 50 years in the United States; approximately 1.75 million to 3 million individuals have the advanced stage.3-5
While AMD currently has no definite cure, the Age-related Eye Disease Study (AREDS) has suggested benefits of specific dietary supplements for slowing AMD progression among individuals with the intermediate stage in at least 1 eye or the advanced stage only in 1 eye.6 Additionally, vision loss because of choroidal neovascularization can be reversed, stopped, or slowed by administering antivascular endothelial growth factor intravitreous injections.7 Ideally, individuals with the intermediate stage of AMD should be identified, even if asymptomatic, and referred to an ophthalmologist who can monitor for the development and subsequent treatment of choroidal neovascularization. Manual screenings of the entire at-risk population of individuals older than 50 years for the development of the intermediate stage of AMD in the United States is not realistic because the at-risk population is large (more than 110 million).8 It also is not feasible in all US health care environments to screen if there is poor access to experts who can identify the development of the intermediate stage of AMD. These same issues may be more pronounced in low- and middle-income countries. Therefore, automated AMD diagnostic algorithms, which identify the intermediate stage of AMD, are a worthy goal for future automated screening solutions for major eye diseases.
While no treatment comparable with antivascular endothelial growth factor currently exists for GA, numerous clinical trials are being conducted to identify treatments for slowing GA growth.9-12 Automated algorithms may play a role in assessing treatment efficacy, in which it is critical to quantify disease worsening objectively under therapy; careful manual grading of this by clinicians can be costly and subjective.
Past algorithms for automated retinal image analysis generally relied on traditional approaches that consisted of manually selecting engineered image features (eg, wavelets, scale-invariant feature transform13-15) that were then used in a classifier13-20 (eg, support vector machines [SVM]15,16 or random forests14). By contrast, deep learning (DL) methods17,21-29 learn task-specific image features with multiple levels of abstraction without relying on manual feature selection. Recent advances in DL have improved performance levels dramatically for numerous image analysis tasks. This progress was enabled by many factors (eg, novel methods to train very deep networks or using graphic processing units).22-26 Recently, DL has been used for conducting retinal image analyses, including tasks such as classifying referable diabetic retinopathy.27,28 A previous study17,21 reported on the use of deep universal features/transfer learning for automated AMD grading. The new study expanded on the previous study by using a data set that is approximately 10 to 20 times larger, using the full scope of deep convolutional neural networks (DCNN).
Quiz Ref IDThis study aimed to solve a 2-class AMD classification problem, classifying fundus images of individuals that have either no or early stage AMD (for which dietary supplements and monitoring for progression to advanced AMD is not considered) vs those with the intermediate or advanced stage AMD, for which supplements, monitoring, or both is considered. It leveraged DL and DCNN. The goals of this study were to measure and compare the performance of the proposed DL vs a human clinician, and a secondary goal was to compare the performance between 2 DL approaches that entailed different levels of computational effort regarding training.
Our study used the National Institutes of Health AREDS data set collected over a 12-year period. The Age-related Eye Disease Study originally was designed to improve understanding of AMD worsening, treatment, and risk factors for worsening. It includes over 130 000 color fundus images from 4613 patients that were taken with informed consent obtained at each of the clinical sites (Table 1). Color fundus photographs were captured of each patient at baseline and follow-up visits and were subsequently digitized. These images included stereo pairs taken from both eyes. Images were carefully and quantitatively graded by experts for identifying AMD at a US fundus photograph-reading center.2 Graders used graduated circles to measure the location and area of drusen and other retinal abnormalities (eg, retinal elevation and pigment abnormalities) in the fundus images to determine the AMD severity level.2 Each image was then assigned by graders to a category reflecting AMD severity that ranged from 1 to 4, with 1 = no AMD, 2 = early stage, 3 = intermediate stage, and 4 = advanced stage (Figure 1). These severity grades were used as a “gold standard” in our study for performing a 2-class classification of no or early stage AMD (here referred to as class 0) vs potentially referable (intermediate or advanced) stage (class 1). The Age-related Eye Disease Study is a public data set that can be made available on request to the National Insitutes of Health.
This study used DCNNs. A DCNN is a deep neural network that consists of many repeated processing layers that take as input fundus images that are processed via a cascade of operations with the goal of producing an output class label for each image.23,25,26 One way to think about DCNNs is that they match the input image with successive convolutional filters to generate low-, mid-, and high-level representations (ie, features) of the input image. Deep convolutional neural networks also include layers that pool features together spatially, perform nonlinear operations at various levels, combine these via fully connected layers, and output a final probability value for the class label (here the AMD-referable vs not referable classification). A DCNN is trained to discover and optimize the weights of the convolutional filters that produce these image features via a backpropagation process. This optimization is done directly by using the training images. Therefore, this process is considered to be a data-driven approach and contrasts with past approaches to processing and analyzing fundus imagery that have used engineered features that resulted from an ad hoc, manual, and therefore possibly suboptimal algorithmic design and selection of such features. While the workings of DCNNs are simple to grasp at a notional level, there is currently extensive research being conducted to understand, improve, and extend the current state of the art.
We used the AlexNet (University of Toronto) DCNN model (here called DCNN-A)23 in which the weights of all layers of the network are optimized via training to solve the referable AMD classification problem. This training process involved optimizing over 61 million convolutional filter weights. In addition to the layers mentioned above, this network included dropout, rectified linear unit activation, and contrast normalization steps.23 The dropout step consisted of arbitrarily setting to 0 some of the neuron outputs (chosen randomly) with the effect of encouraging functional redundancy in the network and acting as a regularization. Our implementation incorporated the Keras and TensorFlow DL frameworks. It used a stochastic gradient descent with a Nesterov momentum, with an initial learning rate that was set to 0.001. The training scheme used an early stopping mechanism that terminated training after 50 epochs of no improvement of the validation accuracy.23
For comparison, this study also used another DL approach that focused on reusing a pretrained DCNN and performing transfer learning.21,30 The idea behind transfer learning is to exploit knowledge that is learned from one source task that has a relative abundance of training data (general images of animals, food, etc.) to allow for learning in an alternative target task (AMD classification on fundus images). Here, universal features were computed by using a pretrained DCNN to solve a general classification problem on a large set of images and reuse these features for the AMD task. Our approach17,21 used the pretrained OverFeat (New York University)24 DCNN which was pretrained on more than a million natural images to produce a 4096 dimension feature vector, which was then used to retrain a linear SVM (LSVM)17,21,24 for our specific AMD classification problem from fundus images. We call this method DCNN-U.
The 2 methods (DCNN-A and DCNN-U) used a preprocessing of the input fundus image by detecting the outer boundaries of the retina, cropping images to the square that was inscribed within the retinal boundary, and resizing the square to fit the expected input size of AlexNet or OverFeat DCNNs. Additionally, DCNN-U used a multigrid approach in which the cropped image was coupled with 2 concentric square subimages that were centered in the middle of the inscribed image. The resulting 3 images (the cropped image plus 2 centered subimages) were then fed to the OverFeat DCNN to produce two additional 4096-long feature vectors. The 3 feature vectors for the image were then concatenated to generate a single 12 288-sized feature vector as input to the LSVM. This method is further detailed in previous reports.17,21
This study considered several experiments that used the entire AREDS fundus image data set as well as different subsets of AREDS. It also used different partitionings and groupings of the AREDS image data set. The different subsets of AREDS used are described here. The set of all AREDS images (133 821) was used, including stereo pairs (ensuring that stereo pairs from the same eye did not appear in the training and testing data sets). We called this set WS for “with stereo pairs”. We called the next set NS for “no stereo.” In this data set, only 1 of the stereo images was kept from each eye, which resulted in 67 401 images. We called the next set NSG for “no stereo, gradable.” Because AREDS images are collected under a variety of conditions (eg, lighting or eye orientation) and therefore are not of uniform quality, an ophthalmologist (KP) was tasked to annotate a subset of images (n = 7775, 5.8%) for “gradability” as a basic measure of fundus image quality. Subsequently, a machine learning method was used to extend the index of gradability over the entire image data set NS to exclude automatically the most egregious low-quality images. The NSG was derived from NS by removing 458 images (.34%) with the smallest “gradability” index. The final set was called H for human. For comparison with human performance levels, we tasked a physician to independently and manually grade a subset of AREDS images (n = 5000, 3.7%). The grades that were generated by the physician and the machine were compared with the AREDS gold standard AMD scores. The number of images that were used in each set, broken down by class, is reported in Table 1.
These data sets were further subdivided into training and testing subsets. We used a conventional K-fold crossvalidation performance evaluation method, with K = 5, in which 4-folds were used for training and one was used for testing (with a rotation of the folds). Additionally, because images from patients were collected over multiple visits, and because DCNN performance depends on having as large a number as possible of patient examples, we considered 2 types of experiments that corresponded to 2 types of data grouping and partitioning. In the baseline partitioning method (termed standard partitioning [SP]) images taken at each patient visit (occurring approximately every 2 years) were considered unique. For SP, when both stereo pairs were used (WS), care was taken that they always appeared together in the same fold. In a second partitioning method (termed patient partitioning [PP]), we ensured that all images of the same patient appeared in the same fold. Standard partitioning views patient visit as a unique entity, while PP considers that each patient (not each visit) forms a unique entity. Therefore, PP is a more stringent partitioning method that provides fewer patients to the classifier to train on; any patient with a highly abnormal or atypical retina will be represented in only one of the folds.
Quiz Ref IDThe performance metrics used included accuracy, sensitivity, specificity, positive predicted value, negative predicted value, and kappa score, which accounts for the possibility of agreement by chance.1,31 Because any classifier trades off between sensitivity and specificity, to compare methods we used receiver operating characteristic (ROC) curves that plot the detection probability ie, sensitivity vs false alarm rate (ie, 100% minus specificity) for each algorithm/experiment. To compare with human performance levels, we also showed the operating point that demonstrated the human clinician operating performance level. We also computed the area under the curve for each algorithm/experiment.
The experiments used the AREDS fundus images with the different subsets and partitioning that were previously explained. Performance levels are reported in Table 2 (SP) and Table 3 (PP) for sets H, WS, NS, and NSG, and for the 2 algorithms (DCNN-A and DCNN-U) and the human performance levels. Receiver operating characteristic curves and areas under the curve are reported in Figure 2.
Quiz Ref IDIn aggregate, performance results for both DL approaches show promising outcomes when considering all metrics. Accuracy (SD) ranged from 90.0% (0.6%) to 91.6% (0.1%) for DCNN-A (Table 2) and 88.4% (0.5%) to 88.8% (0.7%) (Table 3); for DCNN-U, it ranged from 83.2% (0.2%) to 83.9% (0.4%) (Table 2) and 82.4% (0.5%) to 83.1% (0.5%) (Table 3). As seen in Table 2, Table 3, and the ROCs, DCNN-A consistently outperformed DCNN-U. This can be explained by the fact that DCNN-A was specifically trained to solve the AMD classification problem by optimizing all of the DCNN weights over all layers of the network while for DCNN-U, with its simpler training requirement, the training only affected the final (LSVM) classification stage.
Table 2 and Table 3 also suggest that the DCNN-A results are comparable to human performance levels. Based on accuracy and kappa scores, in Table 3, DCNN-A performance (a = 88.7% [0.7]; κ = 0.770 [0.013]) is close or comparable with human performance levels (a = 90.2% and κ = 0.800), and in Table 2 it exceeds slightly the human performance levels (a = 91.6% [0.1], κ = 0.829 [0.003]). In Table 2 and 3, the kappa scores for DCNN-A (κ, 0.764 [0.010]-0.829 [0.003]) and the human grader (κ = 0.800) show substantial to near perfect agreement with the AREDS AMD gold standard grading, while DCNN-U exhibits substantial agreement (κ, 0.636 [0.011]-0.700 [0.008]). Receiver operating characteristic curves also show similar human and machine performance levels. The other metrics in Table 2 and Table 3 also echo these observations.
To test algorithms on images that are representative of the quality that one would expect in actual practice, we did not perform extensive eliminations of images based on their quality. In particular, data sets WS and NS used all images while data set NSG removed only 458 ( ~ 0.68%) of the worst-quality images. When looking at the performance of NSG vs NS, there was a small but measurable decrease in performance levels, as seen when comparing the accuracy of DCNN-A of 90.7% for NSG vs 90.0% for NS (Table 2).
Experiments that used PP showed a small degradation in performance levels when compared with experiments that used SP. This is because, for patient partitioning, the classifier was trained on 923 fewer patients (20%). The performance in SP was reflective of a scenario in which training would take advantage of knowledge that was gained during a longitudinal study, vs PP experiments that take a strict view on grouping to remove any possible correlation between fundus images across visits. In aggregate, after accounting for network and partition differences, the results that were obtained for WS, NSG, and NS were close, with a preference for WS (since there were more data to train from) and NSG (because some low-quality images were removed) over NS. For example, DCNN-A accuracies (SD) are 91.6% (0.1%) (WS), 90.7% (0.5%) (NSG), and 90.0% (0.6%) (NS) (Table 2).
We described using DL methods for the automated assessment of AMD from color fundus images. These experimental results show promising performance levels in which deep convolutional neural networks appear to perform a screening function that has clinical relevance with performance levels that are comparable with physicians. Specifically, the AREDS data set is, to our knowledge, the largest annotated fundus image data set that is currently available for AMD. Therefore, this study may constitute a useful baseline for future machine-learning methods to be applied to AMD.
Quiz Ref IDOne limitation of this data set is a mild class imbalance regarding the number of fundus images in class 1 vs 0, which may have a moderate effect on performance levels. Another potential limitation is that this data set uses digitized images that were taken from analog photographs. This possibly can negatively affect quality and machine performance when compared with digital fundus acquisition, but this possibility cannot be determined from this investigation because none of the images were digital.
Another limitation of this study is that it relies exclusively on AREDS and does not make use of a separately collected clinical data set for performance evaluation, as was done in the diabetic retinopathy studies27 (eg, training a model on EyePACS [EyePACS LLC] and testing on Methods to Evaluate Segmentation and Indexing Techniques in the Field of Retinal Ophtalmology [MESSIDOR]). The situation is different, however, for AMD in which there is currently no large reference clinical data set for use other than AREDS.
Future clinical translation of DL approaches would require validation on separate clinical data sets and using more human clinicians for comparison. While this study offers a promising foray into using DL for automated AMD analysis, future work could involve using more sophisticated networks to improve performance, expanding to lesion delineation and exploiting other modalities (eg, optical coherence tomography).
This study showed that automated algorithms can play a role in addressing several clinically relevant challenges in the management of AMD, including cost of screening, access to health care, and the assessment of novel treatments. The results of this study, using more than 130 000 images from AREDS, suggest that new DL algorithms can perform a screening function that has clinical relevance with results similar to human performance levels to help find individuals that likely should be referred to an ophthalmologist in the management of AMD. This approach could be used to distinguish among various retinal pathologies and subsequently classify the severity level within the identified pathology.
Corresponding Author: Neil M. Bressler, MD, Wilmer Eye Institute, Johns Hopkins University, 600 N Wolfe St, Maumenee 752, Baltimore, MD 21287-9227 (firstname.lastname@example.org).
Accepted for Publication: August 1, 2017.
Published Online: September 28, 2017. doi:10.1001/jamaophthalmol.2017.3782
Author Contributions: Dr Burlina and Mr Joshi had full access to all the data in the study and take full responsibility for the integrity of the data and the accuracy of the data analysis.
Concept and design: Burlina, Joshi, Pacheco, Freund, Bressler.
Acquisition, analysis, or interpretation of data: Burlina, Joshi, Pacheco, Freund, Bressler.
Drafting of the manuscript: Burlina, Joshi, Pekala, Pacheco, Freund.
Critical revision of the manuscript for important intellectual content: Burlina, Joshi, Freund, Bressler.
Statistical analysis: Burlina, Joshi, Pekala, Pacheco, Freund.
Obtained funding: Burlina, Bressler.
Administrative, technical, or material support: Burlina, Joshi, Freund.
Supervision: Burlina, Pacheco, Bressler.
Conflict of Interest Disclosures: All authors have completed and submitted the ICMJE Form for Disclosure of Potential Conflicts of Interest. Drs Burlina, Freund, and Bressler report holding a patent on a system and method for the automated detection of age-related macular degeneration and other retinal abnormalities. No other disclosures were reported.
Funding/Support: This work was supported by award R21EY024310 from the National Eye Institute, the James P. Gills Professorship, and unrestricted research funds to the Johns Hopkins University School of Medicine Retina Division for Macular Degeneration and Related Diseases Research.
Role of the Funder/Sponsor: The National Eye Institute and the Johns Hopkins University had no role in the design and conduct of the study; collection, management, analysis, and interpretation of the data; preparation, review, or approval of the manuscript; and decision to submit the manuscript for publication.
Additional Information: The AREDS dbGAP dataset was made available the from National Eye Institute of the National Institutes of Health. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.
Disclaimer: Dr Bressler is the Editor of JAMA Ophthalmology, but he was not involved in any of the decisions regarding review of the manuscript or its acceptance.