Key Points español 中文 (chinese)
How will an artificial intelligence (AI)–based grading system for diabetic retinopathy perform in a real-world clinical setting?
In a diagnostic study evaluating 193 patients (386 images), the AI system judged 17 as having diabetic retinopathy of sufficient severity to require referral. While the system correctly identified the 2 patients with true disease (severe diabetic retinopathy), the positive predictive value was only 12%, with 15 patients misclassified as needing referral.
Grading of diabetic retinopathy using AI has both potential benefits and challenges, and further study in real-world settings is needed.
There has been wide interest in using artificial intelligence (AI)–based grading of retinal images to identify diabetic retinopathy, but such a system has never been deployed and evaluated in clinical practice.
To describe the performance of an AI system for diabetic retinopathy deployed in a primary care practice.
Design, Setting, and Participants
Diagnostic study of patients with diabetes seen at a primary care practice with 4 physicians in Western Australia between December 1, 2016, and May 31, 2017. A total of 193 patients consented for the study and had retinal photographs taken of their eyes. Three hundred eighty-six images were evaluated by both the AI-based system and an ophthalmologist.
Main Outcomes and Measures
Sensitivity and specificity of the AI system compared with the gold standard of ophthalmologist evaluation.
Of the 193 patients (93 [48%] female; mean [SD] age, 55  years [range, 18-87 years]), the AI system judged 17 as having diabetic retinopathy of sufficient severity to require referral. The system correctly identified 2 patients with true disease and misclassified 15 as having disease (false-positives). The resulting specificity was 92% (95% CI, 87%-96%), and the positive predictive value was 12% (95% CI, 8%-18%). Many false-positives were driven by inadequate image quality (eg, dirty lens) and sheen reflections.
Conclusions and Relevance
The results demonstrate both the potential and the challenges of using AI systems to identify diabetic retinopathy in clinical practice. Key challenges include the low incidence rate of disease and the related high false-positive rate as well as poor image quality. Further evaluations of AI systems in primary care are needed.
Diabetic retinopathy (DR), if untreated, leads to progressive visual impairment and eventual blindness.1 Timely identification and referral to ophthalmologists could reduce blindness and disease complications. Those with poorly controlled diabetes should be screened for DR at least annually2; however, only half of such patients receive screening.3 Screening currently requires referral to an eye specialist, and patients may not visit the specialist because of logistical barriers, cost of the visit, or lack of an eye specialist in their community.
One method of improving access to DR screening is for primary care practices to obtain color fundus images and send these to ophthalmologists or optometrists for reading.4 While such programs increase screening rates,5 there are logistical barriers, costs, and time delays in having the images read by ophthalmologists or optometrists.
These limitations have driven interest in computer assessment of images through fully automated artificial intelligence (AI)–based grading systems. Such a system would decide in real time whether a patient needs referral and could potentially be much cheaper than having eye experts conduct screening. Several studies have used repositories of retinal images to test the performance of AI grading systems in detecting DR,6-10 and in April 2018 the US Food and Drug Administration approved an AI algorithm, developed by IDx, used with Topcon Fundus camera (Topcon Medical) for DR identification.11
Despite enthusiasm about the potential of AI-based grading systems, to our knowledge, there has never been an evaluation of the performance of an AI system in a real-world clinical setting. In this pilot study, we describe the performance of an AI system in a primary care practice.
The study design and patient information and informed consent forms for study participants were approved by the Human Research Ethics Committee at the University of Notre Dame, Fremantle, Australia, and patients provided written informed consent. We conducted the trial according to the Standards for Reporting of Diagnostic Accuracy (STARD) reporting guideline.
AI-Based Grading System for DR
Our AI system is based on deep learning and rule-based models for DR. It was developed and evaluated based on manually outlined pathologies using color fundus images from several training data sets (altogether 30 000 images) including DiaRetDB112 and Kaggle13 (EyePACS) databases and our own Australian Tele-eye care DR database. The model was retrained using the images from the 3 data sets. The deep learning model adopted a deep convolutional neural network model. We used the convolutional neural layers from the deep learning model Inception-v3 as our base model and connected it to our customized top model with several fully connected layers for the purpose of DR image classification. By applying transfer-learning technology, the model training process includes the following steps: (1) manually classify the selected image data into 2 categories, DR disease and no DR disease; (2) divide the categorized image data into 2 parts, training data set (80% of the total) and test data set (20%) and keep the balance of the 2 categories in each data set; (3) normalize all the images and resize them to the dimension of 299 × 299 pixels; (4) load the pretrained base model weights and use the training data set to train the top model initially; (5) use the training data set to retrain the whole model; and (6) monitor the accuracy and loss on the training data set and test data set and achieve the best model. The rule-based model adopted selection criteria results in 3 outcomes: (1) a binary identification of disease or no disease for clinically significant DR, (2) identification of specific pathologies (eg, microaneurysms and exudates) related to DR, and (3) the severity of DR based on the International Clinical Diabetic Retinopathy Disease Severity Scale criteria.14 The AI system is compatible with most retinal imaging cameras (eg, Canon, Zeiss, and DRS cameras).
The image quality control system used deep learning techniques to check the quality of the images. We manually classified selected images from the data sets into 2 classes: adequate image quality for DR grading and inadequate image quality for DR grading. Then we used the only adequate quality images to train the convolutional neural network model. However, there were some images whose quality was ambiguous between adequate and inadequate, which was expected to influence some outcomes.
Deployment in a Primary Care Practice
We deployed the AI system for 6 months (December 1, 2016, to May 31, 2017) at a primary care practice in Midland, Western Australia, that employed 4 primary care physicians. The tele-retinal and AI system includes a color fundus camera (Canon CR-2 AF), a cloud computing server, and a web application server.
Over roughly 1 to 2 weeks we trained 2 nurses to use the fundus camera and our tele-retinal screening software. All patients with diabetes seen at the primary care clinic were invited to participate in the study. Macula-centered images were acquired and 1 to 3 images per eye were allowed depending on the image quality (confirmed by quality control software). After completing the imaging process, the system sent the patient information and related images to a web server using Digital Imaging and Communications in Medicine format. The DR grading system provided a binary disease or no-disease DR grade to the primary care physician via an email. Patients with moderate or severe DR were referred to an ophthalmologist immediately.
All images were also sent to an ophthalmologist for evaluation using our tele-retinal system. If the ophthalmologist’s reading differed from the AI system’s, the ophthalmologist’s reading was relayed to the physician.
The binary reading (disease or no disease) by an ophthalmologist was used as the gold standard and compared with the grading obtained from our AI system. The sensitivity of the disease grading was true-positive/(true-positive + false-negative), specificity was true-negative/(true-negative + false-positive), positive predictive value was true-positive/(true-positive + false-positive), and negative predictive value was true-negative/(true-negative + false-negative).
During the study period, the practice saw 216 patients with diabetes. Of the 193 patients who agreed to DR screening, 93 (48%) were women. The mean (SD) age was 55 (17) years with a range of 18 to 87 years.
The nurse took approximately 10 to 15 minutes to obtain images for both eyes, and the AI system provided reading outcomes in less than 3 minutes. Three hundred eighty-six images were reviewed.
Based on grading by an ophthalmologist, of the 193 patients, 183 had no signs of retinopathy, 8 had mild nonproliferative DR, and 2 had clinically significant DR (1 with moderate nonproliferative DR, 1 with severe nonproliferative DR). The 2 patients with moderate or severe disease required referral to an ophthalmologist (Table 1).
Our AI system classified 17 patients as having clinically significant DR and 176 without disease. The system classified the 2 patients with true moderate and severe DR as having disease, indicating that they should be referred to ophthalmologists. It also identified all 8 mild DR cases correctly. Of the 17 patients classified as having clinically significant disease, 15 were false-positives. This resulted in a specificity of 92% (95% CI, 87%-96%) and a positive predictive value of 12% (95% CI, 8%-18%) (Table 2).
There were several factors that led to the 15 false-positive results. Six had drusen that were similar in appearance to exudates. Other false positives were driven by dirty lens reflections or uneven light exposure at the rim of images that our image quality control process could not fully identify. The AI system also identified exudates that were sheen reflections around the optic disc, the papillomacular area, and the macula.
We evaluated the performance of an AI system that reads retinal images to identify DR in a real-world clinical setting. The system was successfully deployed and detected 2 patients with severe DR requiring referral. Though there was a limited sample size, the AI system was effective in ruling out disease. However, the system had a high rate of false-positives with a specificity of 92% and positive predictive value of just 12%.
The specificity of the deployed system (92%) is similar to our prior validation using a database of retinopathy images (93%) and similar to other AI systems for reading retinopathy images (93.4%).9,10 The high rate of false-positives was driven by the low incidence of disease (2 of 193 [1%]). Prior validations of AI systems for identifying DR have used data from retinal image databases, and images were preselected such that the incidence of disease was much higher (roughly 1 of 3). On average, when the disease incidence is lower, the positive predictive value will also be lower. This is consistent with other screening programs where false-positives are common, such as mammograms.15 The low incidence rate of DR we observed in our study is the norm in primary care; therefore, false-positives are likely to be an issue unless the specificity of our system or other systems is much higher. Given this limitation, we believe retinopathy images identified as having illness by an AI system should be reviewed by an ophthalmologist before a referral is made.
Despite these limitations, we believe the AI system has potential for improving the efficiency of screening for DR in primary care. Roughly 92% of all patients were immediately told at their primary care practice they had no DR and therefore no referral was needed. In this case, the number of patients that would have to be reviewed by an ophthalmologist was less than 10%. The ability to provide real-time eye screening at familiar primary care physician practices has many practical advantages, including comprehensive chronic disease management at a single location for patients with diabetes. There is also the potential for the AI system to be improved. Further training of the AI system to differentiate drusen, sheen reflections, and exudates can improve the specificity.
There were 2 key limitations of this study. The first is the small sample size and that only 2 of the screened patients had clinically significant disease. The second is generalizability. Our study was limited to 1 primary care practice in Western Australia and used a single AI system.
Our evaluation demonstrates both the promise and challenges of using AI systems to identify DR in clinical practice. Evaluations of AI systems should be conducted in real-world clinical practice before they are deployed widely.
Accepted for Publication: July 2, 2018.
Published: September 28, 2018. doi:10.1001/jamanetworkopen.2018.2665
Open Access: This is an open access article distributed under the terms of the CC-BY License. © 2018 Kanagasingam Y et al. JAMA Network Open.
Corresponding Author: Yogesan Kanagasingam, PhD, CSIRO Australian e-Health Research Centre, 65 Brockway Rd, Floreat, Western Australia 6009, Australia (firstname.lastname@example.org).
Author Contributions: Drs Kanagasingam and Xiao had full access to all of the data in the study and take responsibility for the integrity of the data and the accuracy of the data analysis.
Concept and design: Kanagasingam, Xiao, Vignarajan, Tay-Kearney.
Acquisition, analysis, or interpretation of data: Xiao, Preetham, Mehrotra.
Drafting of the manuscript: Kanagasingam, Xiao, Vignarajan, Preetham, Tay-Kearney.
Critical revision of the manuscript for important intellectual content: Mehrotra.
Statistical analysis: Xiao.
Obtained funding: Xiao.
Administrative, technical, or material support: Vignarajan, Preetham, Tay-Kearney.
Supervision: Kanagasingam, Preetham, Mehrotra.
Conflict of Interest Disclosures: Dr Kanagasingam reported a patent to Remote-I telemedicine system pending and licensed. Dr Xiao reported grants from the National Health and Medical Research Council during the conduct of the study. Mr Vignarajan reported grants from the National Health and Medical Research Council during the conduct of the study. No other disclosures were reported.
Funding/Support: The National Health and Medical Research Council of Australia provided funding for the research and development of the machine learning system for diabetic retinopathy.
Role of the Funder/Sponsor: The National Health and Medical Research Council of Australia had no role in the design and conduct of the study; collection, management, analysis, and interpretation of the data; preparation, review, or approval of the manuscript; and decision to submit the manuscript for publication.
et al. Improving diabetic retinopathy screening ratios using telemedicine-based digital retinal imaging technology: the Vine Hill study. Diabetes Care
. 2007;30(3):574-578. doi:10.2337/dc06-1509PubMedGoogle ScholarCrossref
Y. Deep learning for automatic detection and classification of microaneurysms, hard and soft exudates, and hemorrhages for diabetic retinopathy diagnosis. Invest Ophthalmol Vis Sci
. 2016;57(12):5962.Google Scholar