What is the level of interobserver and intraobserver agreement among 5 glaucoma specialists in the use of visual field testing and optical coherence tomography software to detect glaucoma progression?
In this study, while intraobserver agreement was substantial for both software programs, interobserver agreement was moderate and associated with the variability among tests. Interobserver agreement was substantial in patients with no progression but was only fair in patients with questionable glaucoma progression or glaucoma progression.
The moderate interobserver agreement using both software programs indicates that these programs used in isolation are insufficient for decision making regarding glaucoma damage progression.
It is important to evaluate intraobserver and interobserver agreement using visual field (VF) testing and optical coherence tomography (OCT) software in order to understand whether the use of this software is sufficient to detect glaucoma progression and to make decisions regarding its treatment.
To evaluate agreement in VF and OCT software among 5 glaucoma specialists.
Design, Setting and Participants
The printout pages from VF progression software and OCT progression software from 100 patients were randomized, and the 5 glaucoma specialists subjectively and independently evaluated them for glaucoma. Each image was classified as having no progression, questionable progression, or progression. The principal investigator classified the patients previously as without variability (normal) or with high variability among tests (difficult). Using both software, the specialists also evaluated whether the glaucoma damage had progressed and if treatment change was needed. One month later, the same observers reevaluated the patients in a different order to determine intraobserver reproducibility.
Main Outcomes and Measures
Intraobserver and interobserver agreement was estimated using κ statistics and Gwet second-order agreement coefficient. The agreement was compared with other factors.
Of the 100 observed patients, half were male and all were white; the mean (SD) age was 69.7 (14.1) years. Intraobserver agreement was substantial to almost perfect for VF software (overall κ [95% CI], 0.59 [0.46-0.72] to 0.87 [0.79-0.96]) and similar for OCT software (overall κ [95% CI], 0.59 [0.46-0.71] to 0.85 [0.76-0.94]). Interobserver agreement among the 5 glaucoma specialists with the VF progression software was moderate (κ, 0.48; 95% CI, 0.41-0.55) and similar to OCT progression software (κ, 0.52; 95% CI, 0.44-0.59). Interobserver agreement was substantial in images classified as having no progression but only fair in those classified as having questionable glaucoma progression or glaucoma progression. Interobserver agreement was fair regarding questions about glaucoma progression (κ, 0.39; 95% CI, 0.32-0.48) and consideration about treatment changes (κ, 0.39; 95% CI, 0.32-0.48). The factors associated with agreement were the glaucoma stage and case difficulty.
Conclusions and Relevance
There was substantial intraobserver agreement but moderate interobserver agreement among glaucoma specialists using 2 glaucoma progression software packages. These data suggest that these glaucoma progression software packages are insufficient to obtain high interobserver agreement in both devices except in patients with no progression. The low agreement regarding progression or treatment changes suggests that both software programs used in isolation are insufficient for decision making.
Progression of glaucomatous damage is one of the most important and difficult challenges in the management of glaucoma. There are no accepted gold-standard criteria for determining worsening (progression) of glaucoma changes in the visual field (VF) or retinal nerve fiber layer thickness (RNFL). The challenges of identifying glaucoma progression include difficulty discriminating true disease-related changes from variability, artifacts, patient cooperation, and natural age-related changes in structural or functional measurements.1,2 Clinicians most often review serial VFs or optic disc stereophotographs to detect progression. However, these methods are time consuming and highly subjective.3,4
Several computational approaches recently have become available to analyze serial VF or RNFL measurements for progression that rely on trend or event analysis. Glaucoma Progression Analysis (Carl Zeiss Meditec) is a commercially available software that compares an algorithmic method to identify glaucomatous VF progression with the Humphrey Field Analyzer II (Carl Zeiss Meditec). Guided Progression Analysis software analyzes the progression of optic disc and RNFL damage using Cirrus-Optic Coherence Tomography (OCT) (Carl Zeiss Meditec).5 The goals of the current study were to evaluate the interobserver and intraobserver agreement of both software programs, to determine if the interobserver and intraobserver agreement was better with the VF algorithm or the OCT algorithm used to characterize progressive change, and to understand the factors associated with interobserver agreement when glaucoma progresses.
One hundred patients with glaucoma or ocular hypertension were recruited to evaluate the damage resulting from glaucoma progression. The principal investigator (J.M.-M.) selected the 100 patients. The sample included patients with stable glaucoma or progression over time based on subjective interpretation of the overview printouts and retrospective evaluation of the clinical reports (intraocular pressure, glaucoma type, and stage of damage). The principal investigator defined difficult cases as those with low change or progression and high variability among tests in trend analysis in either VF or OCT software programs.
The demographic data from these 100 patients are shown in Table 1. The initial VF was classified previously as glaucomatous according to the Glaucoma Staging System (GSS).6 Patients were excluded who had myopia (>6 D), astigmatism (>3 D), or hyperopia (>6 D). The Ethics Committee of the Clinica Universidad de Navarra approved the study, which adhered to the Declaration of Helsinki. According to this committee, no written informed consent was needed, because this evaluation was performed on data that are normally obtained in clinical practice.
The printout pages of the analysis from the Humphrey Field Analyzer II (1 page) and the Cirrus-OCT (2 pages) were merged into one 3-page PDF file (Adobe Acrobat Reader, Adobe Systems Software), and the images were numbered from 1 to 100. The files were uploaded to a folder on the Basic Dropbox Business (Dropbox) file-sharing site. The intraocular pressure, glaucoma type, number of treatments, and duration of glaucoma were not included in the information provided to the specialists. All included patients had reliable VF tests using the Swedish Interactive Threshold Algorithm (Carl Zeiss Meditec). Patients with an improved visual field index (VFI) in the 3 initial and consecutive tests were considered part of the learning curve and excluded from the study. One experienced operator, who was not one of the examiners who performed the VF testing and was masked to the other findings, performed all OCT examinations. All OCT scans had a signal strength of 6 or higher.
Five glaucoma specialists (A.A., J.M.L., J.M.-d.-l.-C., G.R., and F.U.) from 5 different university departments of ophthalmology evaluated the 100 images. These specialists have published glaucoma studies and had used both software programs. All specialists answered the following questions: (1) Do you think that the VF damage is progressing? (2) Do you think that the OCT damage is progressing? (3) According to the changes in the VF and the OCT images, do you think the glaucoma is progressing? (4) Do you think a treatment change is needed? Each question could be answered with no, “questionable,” or “progression,” except for question 4, for which the answers were no, “questionable,” and “definitely.” No other comments or instructions were provided.
The specialists received an Excel file (Microsoft Corporation) with the 100 images ordered randomly and differently to avoid a fatigue effect in the answers. Only the cited answers were included in each box of the Excel file. After answering all questions, the specialists sent the Excel file to the principal investigator, and the 100 files disappeared from the Dropbox shared folder. No expert knew the names of the other specialists or their answers.
One or more months later, the 100 images were reordered randomly, assigned a new identification number from 1 to 100, and uploaded again to a new shared Dropbox folder. A new Excel file was sent to the specialists using a new randomized order of images to answer. The questions remained the same as in the first analysis. The specialists sent the Excel file with the answers, and the files were removed from Dropbox. The principal investigator and an assistant (V.A.) combined the answers to evaluate the intra-agreement and interagreement.
Agreement among the different specialists was reported using nominal κs (ignoring the sequence) and weighted κs (using linear weights),7 prevalence-adjusted and bias-adjusted κs,8 and Gwet second-order agreement coefficients,9 with PEPI-for-Windows version 11.15 (Microsoft). Multiple raters’ κs were used to assess the agreement among the specialists. Confidence intervals for these κ values were computed using bootstrap (bias corrected and accelerated method; 100 000 repetitions) with Stata SE version 12.1 (StataCorp). All κ values were interpreted according to the guidelines of Landis and Koch.10
Descriptive statistical analyses were performed using SPSS version 20.0 (IBM Corporation). Interobserver agreement was classified as high agreement (the same answer from 5 or 4 specialists) and low agreement (the same answer from 3 or 2 specialists). This level of interobserver agreement was compared with the other parameters using logistic regression analysis with odds ratios (ORs). These parameters included the initial VFI, final VFI, change between initial and final VFIs, number of VFs, VF follow-up time, Glaucoma Staging System classification, initial RNFL, final RNFL, change in the initial and final RNFLs, number of OCT images, OCT follow-up time, and case difficulty.
The κ statistics for the intraobserver analysis among the glaucoma specialists and the 4 questions are shown in Table 2. According to the Landis and Koch classification,10 the agreement was almost perfect for question 1 about VF progression among the 4 specialists and substantial for 1 specialist; similarly, the κ values were almost perfect regarding question 2 on progression on the OCT images. No difference was found between the VF and OCT intraobserver agreement. The κ for question 3 on glaucoma damage progression showed slightly lower agreement, with substantial to almost perfect agreement among the examiners. Finally, question 4 on change in treatment saw moderate to almost perfect agreement, depending on the examiners. The agreement generally was higher although different among the examiners.
Table 3 shows the overall κ statistics for interobserver agreement. The combined VF agreement was moderate, although it was substantial in the answer reflecting no progression. The lowest κ values were associated with the questionable or progression answers in all questions. The agreement was similar in the combined OCT progression and slightly higher in answers reflecting no progression. The differences in interobserver agreement between question 1 and question 2 were small in either the first (κ difference, −0.03; 95% CI, −0.12 to 0.06; P = .53) or the second (κ difference, −0.06; 95% CI, −0.15 to 0.03; P = .18) evaluations. The κ values from question 3 were 0.39 (95% CI, 0.32-0.48) and 0.38 (95% CI, 0.30-0.47) in the first and second evaluations, respectively, but the agreement was substantial in patients with no progression. Similar low agreement was obtained for question 4.
The agreement among the specialists according to the classification of the principal investigator regarding the ease or difficulty of the images for detecting progression is shown in Table 4. The κ values were higher in the normal images than in the difficult images regarding combined agreement and agreement in patients with no progression.
Comparison of Interobserver Agreement With Other Factors
Interobserver agreement regarding the VF was associated with the Glaucoma Staging System classification (OR, 1.52; 95% CI, 1.07-2.16; P = .01), initial VFI (OR, 0.98; 95% CI, 0.97-0.99; P = .03), final VFI (OR, 0.99; 95% CI, 0.96-0.99; P = .005), difference in initial to final VFI (OR, 1.04; 95% CI, 1.00-1.09; P = .04), and easy/difficult diagnosis (OR, 8.40; 95% CI, 2.53-27.85; P = .001). Thus, the agreement was higher in early-stage glaucoma, with high initial VFI, with low final VFI, and with higher changes in the VFI. No association was found between interobserver agreement in the VF and the number of VFs (OR, 1.05; 95% CI, 0.94-1.18; P = .40) or months of follow-up (OR, 1.02; 95% CI, 0.99-1.04; P = .12). Interobserver agreement of the OCT values was associated with the difference in the initial and final RNFL thicknesses (OR, 0.93; 95% CI, 0.87-0.99; P = .04). No association was found between interobserver agreement in the OCT and number of OCT images (OR, 1.04; 95% CI, 0.67-1.62; P = .85) or follow-up time (OR, 1.00; 95% CI, 0.98-1.00; P = .73), initial RNFL thickness (OR, 0.98; 95% CI, 0.96-1.02; P = .35), final RNFL thickness (OR, 0.99; 95% CI, 0.97-1.03; P = .97), or easy/difficult diagnosis (OR, 0.04; 95% CI, 0.13-1.24; P = .11).
A goal in the follow-up of patients with glaucoma is to distinguish whether the observed changes are associated with test variability, artifacts, or actual disease progression. Different software to measure glaucomatous progression, such as those used in the current study, automatically adjust each follow-up test to the baseline test to detect subtle changes in event or trend analyses. These algorithms are designed to limit the background noise and, therefore, to provide a more valid analysis of longitudinal data, theoretically increasing the ability to discern true progression of glaucomatous disease.
Previous articles on agreement among specialists using different functional or structural tests to detect progression have been published. Viswanathan et al11 examined the level of agreement among 5 clinicians in assessing progressive deterioration in a VF series using standard Humphrey printouts and linear regression analysis. The authors reported a κ of 0.59 using Progressor analysis and a κ of only 0.32 using Humphrey printouts (P = .006). Other authors have reported similar levels of agreement.12-15 However, Tanna et al,16 who studied the agreements in 5 VF tests in 100 eyes evaluated by 5 glaucoma subspecialists, reported a moderate interobserver agreement (κ, 0.45) that did not improve when each specialist had access to Humphrey progression software.16 The authors suggested that reviewers varied regarding the degree to which they agreed with the software and that their determinations regarding progression were infrequently altered by the software data. Specialist readers might have dismissed certain progression patterns classified as such by the progression software as an artifact or error. In the current study, intraobserver agreement was high, but interobserver agreement was only moderate (κ, 0.48) using this progression software, similar to previous articles.11,16
The current study also evaluated the agreement in the progression analysis from the Cirrus-OCT machine. Our results suggest only moderate interobserver agreement. To our knowledge, the current study is the first to analyze the agreement in this software from the Cirrus-OCT instrument. However, previous studies of the agreement of other structural tests have been published. Breusegem et al17 evaluated the interobserver κ coefficients among 3 specialists of a set of 2 serial optic disc color stereoslides for glaucomatous changes in 40 patients. The authors reported moderate interobserver agreement (κ, 0.51; 95% CI, 0.33-0.69).17 In another report, using Heidelberg Retina Tomograph (Heidelberg Engineering) printouts for 168 patients, 3 experienced observers found moderate to good agreement among all methods from the Heidelberg Retina Tomograph.18 In the current study, the comparison between κ values of the agreement in OCT progression was slightly higher than in VF progression, but the data did not indicate that a confident difference was identified. These results suggest that the agreement values between structural and functional software regarding progression are similar and that one is not superior to the other. However, the agreement differed in the 3 answers to each question. Thus, the agreement of the negative answers (no progression) was moderate to substantial whereas agreement was fair to moderate in the answers regarding findings that were questionable or progression. The different levels of agreement in the 3 evaluated answers for each question suggested that the progression software programs used alone are more useful in patients with no progression than in those with questionable glaucoma progression or glaucoma progression.
The main cause of low agreement regarding the answers of questionable or progression might be the variability of the different tests. There are many causes of variability, such as patient-reported alertness, increasing age, test-retest variability, glaucoma stage, test duration, and fixational eye movements.19-22 Previous studies have suggested that even experienced examiners have difficulty assessing progression in reliable VF series when tests are variable.1 A consensus has indicated that measurement variability affects the ability of any device to detect progression.23 The goal of the current study was not to analyze the causes of variability in OCT scans or VFI from the Humphrey software. There are 4 frequently used measures of variability: range, interquartile range, variance, and standard deviation. The only available data used to analyze the variability of the printout pages from the Humphrey progression software were the 95% CIs of the VFI trend analysis, whereas the standard deviation of VFI trend analysis was used in the progression software in the Cirrus-OCT instrument. In the current 100 patients, we found that the 95% CI was higher (greater than 1%) than the mean VFI in 32 patients and a standard deviation higher than the mean in 31 patients in the average RNFL thickness from OCT. In patients with a high change value in the VFI or RNFL thickness with high variability among tests, the case will be diagnosed as progression. However, in patients with a low change value but with high variability among tests, it is difficult to distinguish between variability and true progression (Figure). A previous report24 showed that the VF fluctuation is low in normal eyes and those with early glaucoma and increases from stage 0 to 4 from the Glaucoma Staging System. Similar findings were published indicating that a thin RNFL in patients with advanced glaucoma was associated with more variability in the RNFL thickness.25 These results suggest that the variability among different tests is the cause of the low agreement in the current study.
Questions 3 and 4 showed only fair to moderate agreement. These questions are important because they indicate whether the software used to measure progression is useful alone to make clinical decisions in daily practice. The patients in this study had early and advanced glaucoma. Previous studies have suggested that glaucoma progression could be better detected using OCT in patients with early glaucoma, while VF testing could better detect advanced glaucoma.26,27 Our results suggest that the examiners have different criteria to detect progression or determine treatment changes probably because no other clinical parameters have been included to facilitate decisions. Nowadays, neither of the most widely used algorithms meant to inform ophthalmologists whether a VF or a RNFL thickness has deteriorated are sufficiently reliable for determining whether a particular patient with glaucoma has worsened or not. Therefore, to base therapeutic decisions on these algorithms in isolation is inaccurate and could increase the risk of damage in glaucomatous patients.
Our study had limitations. First, the sample included patients with variability among different tests, which caused more doubts when trying to identify progression or no progression. However, many patients in clinical practice show variability among tests. Also, the same examiner performed all OCT scans, which excluded the examiner as a variability factor. Second, the Cirrus-OCT scans were obtained without an eyetracker. An eyetracker has only recently been included in the Cirrus software. Third, the specialists received no recommendations for evaluating the printout pages; thus, some specialists might have placed more emphasis on the event analysis and others on the trend analysis. However, we preferred to not provide advice so that the specialists could interpret glaucoma progression as they did in daily clinical practice. The follow-up time was short (up to 101 months). However, the Cirrus is the newest-generation OCT device from Carl Zeiss Meditec, and it has only been available for clinical use for the last 8 years. Additionally, the number of scans or the follow-up time was unrelated to interobserver agreement. Finally, only a subjective evaluation of agreement was evaluated and not accuracy or precision; the goal of this study was to determine agreement regarding progression using both software packages in a group of patients to facilitate clinical decision making about progression or treatment changes. Despite these limitations, our results might contribute to the knowledge about the advantages and problems of both software packages in detecting glaucoma progression.
In conclusion, substantial or almost perfect intraobserver agreement was found in the evaluated questions. Nevertheless, only moderate interobserver agreement was found in the combined results. No progression had more agreement than the answers about questionable progression or progression. The factors associated with interobserver agreement depended on variability. In patients with low mean and high variability values in the trend analysis, the diagnosis of progression was questionable. The interobserver agreement levels of the global progression of glaucoma damage and treatment changes were only fair. Our results suggest that both progression software packages have similar agreement and are more clinically useful in patients with no progression than in patients with questionable progression or progression. Hence, the data suggest that these glaucoma progression software packages used alone are insufficient to obtain high interobserver agreement in patients with glaucoma progression. More effort must be made to improve the software to reduce the variability and facilitate diagnosis of glaucoma progression.
Corresponding Author: Javier Moreno-Montañés, MD, PhD, Department of Ophthalmology, Clínica Universidad de Navarra, Universidad de Navarra, Av. Pío XII, 36, Pamplona, Spain (firstname.lastname@example.org).
Accepted for Publication: January 8, 2017.
Published Online: February 23, 2017. doi:10.1001/jamaophthalmol.2017.0017
Author Contributions: Dr Moreno-Montañés had full access to all the data in the study and takes responsibility for the integrity of the data and the accuracy of the data analysis.
Study concept and design: Moreno-Montanes, A. Anton, Larrosa.
Acquisition, analysis, or interpretation of data: V. Anton, A. Anton, Larrosa, Martínez-de-la-Casa, Rebolleda, García-Granero.
Drafting of the manuscript: Moreno-Montanes, Martínez-de-la-Casa, García-Granero.
Critical revision of the manuscript for important intellectual content: Moreno-Montanes, V. Anton, A. Anton, Larrosa, Martínez-de-la-Casa, Rebolleda, García-Granero.
Statistical analysis: Moreno-Montanes, Martínez-de-la-Casa, García-Granero.
Administrative, technical, or material support: Moreno-Montanes, V. Anton, Martínez-de-la-Casa.
Study supervision: Moreno-Montanes, Larrosa, Martínez-de-la-Casa, Rebolleda.
Conflict of Interest Disclosures: All authors have completed and submitted the ICMJE Form for Disclosure of Potential Conflicts of Interest and none were reported.
Funding/Support: Supported in part by Instituto de Salud Carlos III, “Red temática de Investigación Cooperativa, Proyecto RD07/0063. OftaRed: Patología ocular del envejecimiento, calidad visual y calidad de vida.”
Role of the Funder/Sponsor: The funder had no role in the design and conduct of the study; collection, management, analysis, and interpretation of the data; preparation, review, or approval of the manuscript; and decision to submit the manuscript for publication.
Previous Presentation: This article was presented in part at the Annual Meeting of the Association for Research in Vision and Ophthalmology; May 1, 2016; Seattle, Washington.
J. Comparison of methods to detect visual field progression in glaucoma. Ophthalmology
. 1997;104(8):1228-1236.PubMedGoogle ScholarCrossref
KI. Variability of automated visual fields in clinically stable glaucoma patients. Invest Ophthalmol Vis Sci
. 1989;30(6):1083-1089.PubMedGoogle Scholar
B. Diagnosis of early glaucoma with flicker comparisons of serial disc photographs. Invest Ophthalmol Vis Sci
. 1989;30(11):2376-2384.PubMedGoogle Scholar
Carl Zeiss Meditec Inc. Details Define Your Decisions. Jena, Germany: Carl Zeiss Meditec; 2007.
et al. Categorizing the stage of glaucoma from pre-diagnosis to end-stage disease. Am J Ophthalmol
. 2006;141(1):24-30.PubMedGoogle ScholarCrossref
MC. Statistical Methods for Rates and Proportions. 3rd ed. New York, NY: John Wiley & Sons; 2003.
KL. Handbook of Inter-Rater Reliability: the Definitive Guide to Measuring the Extent of Agreement Among Raters. 2nd ed. Gaithersburg, MD: Advances Analytics; 2010:76-78, 80-81.
et al. Interobserver agreement on visual field progression in glaucoma: a comparison of methods. Br J Ophthalmol
. 2003;87(6):726-730.PubMedGoogle ScholarCrossref
et al. A comparison of experienced clinical observers and statistical tests in detection of progressive visual field loss in glaucoma using automated perimetry. Arch Ophthalmol
. 1988;106(5):619-623.PubMedGoogle ScholarCrossref
EB. Technique for determining glaucomatous visual field progression by using animation graphics. Am J Ophthalmol
. 1994;118(4):485-491.PubMedGoogle ScholarCrossref
P. Agreement in detecting glaucomatous visual field progression by using guided progression analysis? and Humphrey overview printout. Eur J Ophthalmol
. 2011;21(5):573-579.PubMedGoogle ScholarCrossref
et al. Agreement of visual field interpretation among glaucoma specialists and comprehensive ophthalmologists: comparison of time and methods. Br J Ophthalmol
. 2011;95(6):828-831.PubMedGoogle ScholarCrossref
et al. Interobserver agreement and intraobserver reproducibility of the subjective determination of glaucomatous visual field progression. Ophthalmology
. 2011;118(1):60-65.PubMedGoogle ScholarCrossref
T. Agreement and accuracy of non-expert ophthalmologists in assessing glaucomatous changes in serial stereo optic disc photographs. Ophthalmology
. 2011;118(4):742-746.PubMedGoogle ScholarCrossref
RN, Martinez de la Casa
et al. Clinicians agreement in establishing glaucomatous progression using the Heidelberg retina tomograph. Ophthalmology
. 2009;116(1):14-24.PubMedGoogle ScholarCrossref
WH. Variability of visual field measurements is correlated with the gradient of visual sensitivity. Vision Res
. 2007;47(7):925-936.PubMedGoogle ScholarCrossref
AM. What reduction in standard automated perimetry variability would improve the detection of visual field progression? Invest Ophthalmol Vis Sci
. 2011;52(6):3237-3245.PubMedGoogle ScholarCrossref
DF. The relationship between variability and sensitivity in large-scale longitudinal visual field data. Invest Ophthalmol Vis Sci
. 2012;53(10):5985-5990.PubMedGoogle ScholarCrossref
et al; CIGTS (Collaborative Initial Glaucoma Treatment Study) Study Group. The collaborative initial glaucoma treatment study: baseline visual field and test-retest variability. Invest Ophthalmol Vis Sci
. 2003;44(6):2613-2620.PubMedGoogle ScholarCrossref
FA. Progression of Glaucoma. Amsterdam, the Netherlands: Kugler Publications; 2011:137. Consensus Series; Vol 8.
et al. Long-term perimetric fluctuation in patients with different stages of glaucoma. Br J Ophthalmol
. 2011;95(2):189-193.PubMedGoogle ScholarCrossref
et al. Factors associated with variability in retinal nerve fiber layer thickness measurements obtained by optical coherence tomography. Ophthalmology
. 2007;114(8):1505-1512.PubMedGoogle ScholarCrossref
RN. The structure and function relationship in glaucoma: implications for detection of progression and measurement of rates of change. Invest Ophthalmol Vis Sci
. 2012;53(11):6939-6946.PubMedGoogle ScholarCrossref
et al. Retinal nerve fibre layer and visual function loss in glaucoma: the tipping point. Br J Ophthalmol
. 2012;96(1):47-52.PubMedGoogle ScholarCrossref