Variation in the accuracy and reliability of remote retinopathy of prematurity (ROP) diagnosis. A, Nasal disease classified by ophthalmoscopy and 3 readers as ROP requiring treatment. B, Nasal disease classified by ophthalmoscopy and 2 readers as mild ROP and by 1 reader as type 2 prethreshold ROP.
Chiang MF, Keenan JD, Starren J, Du YE, Schiff WM, Barile GR, Li J, Johnson RA, Hess DJ, Flynn JT. Accuracy and Reliability of Remote Retinopathy of Prematurity Diagnosis. Arch Ophthalmol. 2006;124(3):322-327. doi:10.1001/archopht.124.3.322
To determine the accuracy and reliability of retinopathy of prematurity (ROP) diagnosis using remote review of digital images by 3 masked ophthalmologist readers.
An atlas was compiled of 410 retinal photographs from 163 eyes of 64 low-birth-weight infants taken using a wide-angle digital fundus camera. All the images were independently reviewed by 3 readers, and the diagnosis in each eye was classified into 1 of 4 ordinal categories: no ROP, mild ROP, type 2 prethreshold ROP, or ROP requiring treatment. Findings were compared with a reference standard of dilated indirect ophthalmoscopy with scleral depression performed by an experienced pediatric ophthalmologist.
Sensitivities/specificities of the diagnosis of any ROP were 0.845/0.910 for the first reader, 0.816/0.955 for the second reader, and 0.864/0.493 for the third reader. Sensitivities/specificities of the diagnosis of ROP requiring treatment were 0.850/0.960 for the first reader, 0.850/0.973 for the second reader, and 0.900/0.953 for the third reader. When ROP was classified into ordinal categories, the overall weighted κ for interreader reliability was 0.743. Intrareader reliability for detection of low-risk prethreshold ROP or worse was 100% for all readers.
The accuracy, interreader reliability, and intrareader reliability of remote diagnosis of clinically relevant ROP based on digital imaging are substantial.
Store-and-forward telemedicine is an emerging technology in which static medical data and images are captured and transmitted to a remote storage device for subsequent retrieval and interpretation by an expert.1,2 This technology has the potential to improve the quality and delivery of medical care, research, and education. However, the widespread adoption of telemedicine has been limited by technical obstacles that have only recently been overcome, by the lack of reimbursement for providers, and by the lack of substantive evaluation data regarding its diagnostic efficacy, reliability, acceptability to patients and providers, and cost-effectiveness.3- 6
Retinopathy of prematurity (ROP) is an ideal disease for telemedicine research for several reasons. (1) There are established, evidence-based standards for disease classification, diagnosis, and treatment.7- 9 Similarly, there are accepted guidelines for identifying high-risk premature infants who require examination and follow-up involving dilated ophthalmoscopy with scleral depression performed by an experienced ophthalmologist.10 (2) If detected early, ROP may be treated effectively with cryotherapy or laser.8,9,11 (3) Retinopathy of prematurity continues to be a leading cause of childhood blindness in the United States and throughout the world. The direct and indirect costs of infancy-acquired blindness are enormous, with an estimated annual governmental cost of $38.3 to $64.9 million in the United States alone.12 (4) Current ROP examination methods are costly, time intensive, and frequently impractical. Depressed scleral examinations are physiologically stressful to premature infants and are associated with systemic complications, such as apnea, bradycardia, and aspiration.13 (5) Adequate ophthalmic expertise is often limited to larger academic centers worldwide and, therefore, may be unavailable at the point of care.
Pilot studies14,15 have shown that remote interpretation of wide-angle digital retinal images may have adequate sensitivity and specificity to identify cases of severe ROP that warrant ophthalmologic referral, although other studies16,17 have raised concerns about the quality of photographs obtained by existing image capture devices.7 However, little published research has compared the accuracy and reproducibility of remote image interpretation among multiple readers. This is an important gap in knowledge because a large-scale telemedicine screening strategy would require high accuracy across multiple readers and sufficient interreader and intrareader reliability of diagnostic classification.
The purpose of this study is to determine the sensitivity, specificity, interreader reliability, and intrareader reliability of remote ROP diagnosis among 3 trained readers reviewing the same set of digital images compared with a reference standard of dilated indirect ophthalmoscopy with scleral depression performed by an experienced pediatric ophthalmologist. All the retinal images were captured using a commercially available wide-angle digital fundus camera.
This study was approved by the institutional review boards at Columbia University Medical Center and the University of Miami School of Medicine.
Infants at Jackson Memorial Hospital who met the criteria for ROP examination between January 1, 1999, and December 31, 2000, and whose parents provided informed consent for imaging were included in this study. Specific criteria were birth weight less than 1300 g, or birth weight of 1300 to 1800 g with greater than 72 hours of oxygen therapy. Patients were excluded if they had structural ocular anomalies or systemic malformations, if they had previously been treated for ROP with laser photocoagulation or other ocular surgery, or if they were considered by their neonatologist to be unstable for examination because of poor general health.
Each infant in this study underwent 2 examinations, which were sequentially performed under topical anesthesia at the neonatal intensive care unit bedside. First, dilated indirect ophthalmoscopy with scleral depression was performed by an experienced pediatric ophthalmologist (J.T.F.) based on well-established protocols.10,16 The presence or absence of ROP, its location and extent, and the presence or absence of plus disease were documented according to the international classification of ROP.7 Subsequently, wide-angle retinal imaging was performed by an experienced ophthalmic photographer (D.J.H.) using a digital camera system (RetCam-120; Clarity Medical Systems, Pleasanton, Calif) based on manufacturer guidelines. This imaging was performed independent of clinical examination, with the aim of capturing pictures of the posterior pole and as much of the peripheral retina as feasible.
Photographs were selected for compilation into an image atlas by an ophthalmologist (J.T.F.) masked to examination findings. No patients were excluded because of poor image quality or inability to obtain digital images. To test intrareader reliability for interpreting the same images at different times, the atlas included 7 sets of duplicated images. To maintain patient confidentiality, atlas images were not annotated with any individually identifiable data, such as dates, birth weight, or gestational age. Therefore, remote diagnosis was based on the appearance of the retinal images alone.
Three image readers participated in this study: 2 fellowship-trained retina specialists (readers A and B) and 1 board-certified general ophthalmologist (reader C). Readers were trained to analyze digital images by a pediatric ophthalmologist experienced in ROP (J.T.F.) using a set of teaching photographs unrelated to the retinal atlas compiled for this study. After readers correctly classified at least 8 of 10 ROP images, they proceeded to interpret study images. Masked readers independently interpreted each image set and recorded the following information for each eye: the presence of ROP and the zone, stage, and clock hours of ROP.
Interpretations were converted to an ordinal scale based on established criteria from the multicenter Cryotherapy for Retinopathy of Prematurity8 and Early Treatment for Retinopathy of Prematurity (ETROP)9 studies: (1) no ROP; (2) mild ROP, defined as ROP less than ETROP type 2 prethreshold disease; (3) ETROP type 2 prethreshold ROP (zone I, stage 1 or 2 ROP without plus disease; or zone II, stage 3 ROP without plus disease)9; and (4) ROP requiring treatment, defined as ETROP type 1 prethreshold ROP (zone I, any stage ROP with plus disease; zone I, stage 3 ROP without plus disease; or zone II, stage 2 or 3 ROP with plus disease)9 or threshold ROP (≥5 contiguous or ≥8 noncontiguous clock hours of stage 3 ROP in zone I or II in the presence of plus disease).8
All the data were tabulated and analyzed using statistical software (SPSS 13.0; SPSS Inc, Chicago, Ill). Sensitivity, specificity, intrareader reliability, and interreader reliability for digital imaging were determined for the presence of any ROP, the presence of type 2 prethreshold ROP or worse, and the presence of ROP requiring treatment. The κ statistic was used to measure chance-adjusted agreement for the presence of disease based on an accepted scale (0-0.20 = slight agreement, 0.21-0.40 = fair agreement, 0.41-0.60 = moderate agreement, 0.61-0.80 = substantial agreement, and 0.81-1.00 = almost-perfect agreement).18 The weighted κ statistic was also used for analysis of agreement for classification of ROP into ordinal categories because it adjusts for small vs large disagreements in grading.19 Results of dilated ophthalmoscopic examination were used as the reference standard.
The retinal atlas included 163 unique sets of digital images (81 right eyes and 82 left eyes) from 64 consecutive infants whose parents consented to participation. Each image set consisted of 1 to 7 photographs from a single eye taken at 32 to 45 weeks' postmenstrual age. Several infants contributed multiple image sets that were taken at 1- to 2-week intervals: 3 infants contributed 1 examination of 1 eye only, 48 contributed 1 examination of both eyes, 10 contributed 2 examinations of both eyes, 1 contributed 3 examinations of both eyes, 1 contributed 4 examinations of both eyes, and 1 contributed 5 examinations of both eyes. Because bedside imaging was performed under typical working conditions, it was not possible to capture a standard set of photographs on each infant.
Mean birth weight of study infants was 812 g (range, 480-1440 g), and mean gestational age was 26 weeks (range, 23-32 weeks). Results of reference standard ophthalmoscopic examination classifications are given in Table 1. The overall incidence of any ROP in the study population was 60.7%. Based on reference standard examinations, the observed agreement between ordinal disease classifications of right and left eyes in the same patient was 74.4%.
To test intrareader reliability, 7 image sets (4 right eyes and 3 left eyes) were duplicated at random intervals in the image atlas. Therefore, the final atlas contained a total of 410 images from 163 eyes of 64 infants, including the duplicated image sets. Among the 7 duplicated image sets, ophthalmoscopic classification showed that 3 had no ROP, 2 had mild ROP, none had type 2 prethreshold ROP, and 2 had ROP requiring treatment.
The sensitivity, specificity, and κ statistic for remote diagnosis by the 3 readers are given in Table 2 for 3 cutoff values: detection of any ROP, detection of type 2 prethreshold ROP or worse, and detection of ROP requiring treatment. Readers A and B had sensitivity of 75% or greater and specificity greater than 90% at all the cutoff values. Reader C had similarly high sensitivity at each cutoff value, and similarly high specificity for type 2 prethreshold ROP (98.5%) and ROP requiring treatment (95.3%). However, reader C had significantly lower specificity for any ROP compared with the other 2 readers (49.3%; P<.001 by McNemar test).
The κ statistic for each of these 3 cutoff values showed substantial to almost-perfect agreement with reference standard diagnoses for readers A and B (Table 2). For reader C, the κ for detection of type 2 prethreshold ROP and ROP requiring treatment showed substantial agreement with the reference standard, whereas the κ for detection of any ROP showed only fair agreement. Table 3 displays κ and weighted κ from ordinal classification by each reader and compared with the reference standard. Reader A had moderate-to-substantial agreement, reader B had substantial agreement, and reader C had moderate agreement with the reference standard.
Table 4 provides κ and weighted κ values based on ordinal ROP classification and comparison of interpretations by each pair of readers. Readers A and B had substantial to almost-perfect agreement, readers B and C had moderate agreement, and readers A and C had substantial agreement. For interrater reliability of ordinal classification among all readers, the overall κ and weighted κ (SE) were 0.622 (0.037) and 0.743 (0.031), respectively, indicating substantial agreement.
For the 7 duplicated image sets, intrareader reliabilities for readers A and B were perfect (κ = 1.00 for ordinal classification). For reader C, intrareader reliability for ability to detect any ROP was lower than would be expected by pure chance (κ = −0.235). However, intrareader reliability for reader C for ability to detect type 2 prethreshold ROP or ROP requiring treatment was perfect (κ = 1.00). For intrareader reliability of ordinal classification among all readers, the κ and weighted κ (SE) were 0.599 (0.246) and 0.775 (0.135), respectively, indicating moderate-to-substantial agreement.
This study evaluates the accuracy and reliability of ROP diagnosis by 3 masked readers based on remote digital image interpretation compared with a reference standard of dilated ophthalmoscopy with scleral depression. For data analysis, examination findings were categorized into an ordinal system: no ROP, mild ROP, type 2 prethreshold ROP, or ROP requiring treatment.
There was variation in accuracy among the 3 readers (Table 2 and Figure). For detecting a cutoff value for type 2 prethreshold ROP, all 3 readers had sensitivity greater than 72% and specificity greater than 90%. Similarly, for detecting a cutoff value for ROP requiring treatment, all 3 readers had sensitivity of 85% or greater and specificity of 96% or greater. This compares with results from Ells et al15 showing 100% sensitivity and 96% specificity for the detection of “referral-warranted ROP,” which was defined similarly to type 2 prethreshold ROP in the present study. For detecting a cutoff value for any ROP, 2 of the readers had sensitivity greater than 81% and specificity greater than 91%, whereas the third reader had high sensitivity (86.4%) but significantly lower specificity (49.3%). This compares with results from Roth et al16 showing 82% sensitivity and 94% specificity for the detection of any ROP. Taken together, these findings suggest 2 conclusions. First, accurate diagnosis of ROP using telemedicine is technically feasible by a range of readers, although careful selection and training are required. None of the readers in the present study were actively performing ROP examination or treatment. Readers A and B were retina specialists with limited experience in examination and treatment, whereas reader C was a general ophthalmologist with only limited experience in examination. Second, the accuracy of an image-based telemedicine strategy depends on the cutoff value used to define “abnormal” readings. In particular, reader C had low specificity for the detection of any ROP because of mistakenly classifying images as mild ROP when reference standard examinations found no ROP.
The objective of a remote telemedicine screening strategy is to diagnose disease of sufficient severity to warrant referral for specialist workup. This approach offers potential benefits, such as improved access and reduced barriers to care, decreased waiting time, and reduced transportation costs.1,2 Patients with mild ROP as defined in this study (ie, stages 1 and 2) are typically observed every 1 to 2 weeks without other intervention. Therefore, the type 2 prethreshold ROP and ROP requiring treatment cutoff values are likely to be more clinically relevant than the detection of any ROP in assessing the accuracy of a telemedicine screening strategy.15 The high specificity for the type 2 prethreshold ROP and ROP requiring treatment cutoff values suggests that remote telemedicine diagnosis may be an effective and practical method for ruling-in clinically significant ROP. Future research must examine whether the false-negative rate of this method is acceptable from clinical and cost-effectiveness perspectives.
It is worthwhile to compare our accuracy findings for ROP with results from a larger body of literature examining the accuracy of diabetic retinopathy diagnosis based on dilated ophthalmoscopy or remote image interpretation.20 Clinical diabetes mellitus research has been supported by the wide acceptance of stereoscopic 7-field 35-mm color photographs as a reference standard for the detection and classification of diabetic retinopathy and the establishment of a national Fundus Photograph Reading Center for interpretation by trained readers.21- 23 The agreement between this reference standard and ophthalmoscopy by a retina specialist has been demonstrated to be imperfect,24 with a κ of 0.49 in 1 study.25 Compared with the reference standard, single-field nonmydriatic photography was found to have 61% sensitivity and 85% specificity for detecting moderate-to-severe nonproliferative retinopathy26 and to have 78% sensitivity and 86% specificity for detecting retinopathy of sufficient severity to warrant ophthalmologic referral.27These results suggest that remote classification of ROP into ordinal categories using digital imaging has comparable accuracy to the classification of diabetic retinopathy using dilated ophthalmoscopy by a retina specialist or single-field photography.
Although there is a widely accepted international classification system for ROP, clinical ROP research has typically relied on ophthalmoscopic examination as a reference standard.7,9,16,17,28 The development of a photographic reference standard for ROP, along with protocols for interpretation by trained readers, may help support clinical care and research in this area.29 Similar standards for the photographic capture and interpretation of other findings may become increasingly important for telemedicine, particularly because ophthalmologic diagnosis is heavily image based.
No published research, to our knowledge, has examined the reliability of remote image interpretation for ROP diagnosis. However, the reliability findings in this study may be compared with results from similar studies involving diabetic retinopathy. Bursell et al30 showed that the interreader reliability between 2 readers of stereo nonmydriatic digital-video retinal images for the presence of diabetic retinopathy lesions was moderate to substantial (unweighted κ = 0.45-0.76), whereas the intrareader reliability ranged from moderate to almost-perfect (unweighted κ = 0.41-1.00). In the Early Treatment of Diabetic Retinopathy Study 7-field photographic reference standard, interreader reliability for grading lesions related to diabetic retinopathy was found to be moderate to substantial, depending on the type of lesion (weighted κ = 0.41-0.80).22 Based on these results, the interreader and intrareader reliabilities for the 3 graders in this study were comparable with findings from diabetic retinopathy studies, including the Early Treatment of Diabetic Retinopathy Study reference standard.
Several factors and limitations regarding this study should be noted. (1) To protect patient confidentiality, images were not annotated with any clinical data. This may have biased against the readers' ability to interpret photographs accurately. (2) Data were analyzed by eye in this study, which may have biased the readers because ROP findings in right and left eyes of the same patient are not independent. (3) The accuracy and reliability of the reference standard—dilated examination by an experienced ophthalmologist—have not been established, to our knowledge. It would be informative to determine the frequency of photographically documented ROP that is missed by dilated ophthalmoscopy. (4) None of the 3 readers had extensive clinical experience in ROP, although 2 were retina specialists. Comparison of our results with future studies involving “expert ROP readers” may be revealing. (5) Although failure to obtain retinal images from participating infants did not occur in this study, existing technologies require image capture by a skilled photographer. Standardized pictures could not be taken in this study, even by an experienced ophthalmic photographer. As imaging technologies improve, this is likely to become easier.
Telemedicine is an emerging technology that has the potential to improve the delivery, quality, speed, and cost of ophthalmic care.1- 4 This study demonstrates that remote diagnosis of ROP is technically feasible based on the finding that images captured using a digital camera system may be interpreted with high accuracy, interreader reliability, and intrareader reliability. Careful selection and training of readers and the use of clinically relevant cutoff values for referral must be considered when planning remote screening strategies. In addition, numerous other research questions must be answered before telemedicine can become a practical mechanism for diagnosing ophthalmic disease. Strategies must be developed for deploying imaging devices and training local personnel in their use. Technical factors, such as network maintenance and data security, need to be evaluated. Finally, the cost-benefit impact and acceptability of telemedicine to patients and medical providers must be demonstrated before its benefits can be realized.
Correspondence: Michael F. Chiang, MD, MA, Department of Ophthalmology, Columbia University College of Physicians and Surgeons, 635 W 165th St, Box 92, New York, NY 10032 (firstname.lastname@example.org).
Submitted for Publication: February 2, 2005; final revision received April 12, 2005; accepted April 17, 2005.
Author Contributions: Dr Chiang had full access to all the data in the study and takes responsibility for the integrity of the data and the accuracy of the data analysis.
Financial Disclosure: None.
Funding/Support: This study was supported by a Career Development Award from Research to Prevent Blindness, New York (Dr Chiang), and by grant EY13972 from the National Eye Institute of the National Institutes of Health, Bethesda, Md (Dr Chiang).