Hematoxylin-eosin stained biopsy samples (original magnification ×20). A, Arrowheads indicate mitotic figures in invasive melanoma. B, Arrowheads indicate eosinophils in perivascular and interstitial zones of spongiotic dermatosis.
Customize your JAMA Network experience by selecting one or more topics from the list below.
Kent MN, Olsen TG, Feeser TA, et al. Diagnostic Accuracy of Virtual Pathology vs Traditional Microscopy in a Large Dermatopathology Study. JAMA Dermatol. 2017;153(12):1285–1291. doi:10.1001/jamadermatol.2017.3284
Is diagnosis from whole-slide images (WSI) noninferior to diagnosis from glass slides for cutaneous specimens?
This large retrospective study of 499 representative dermatopathology cases found 94% intraobserver concordance between WSI and traditional microscopy (TM). Diagnosis from WSI was found statistically noninferior to diagnosis from glass slides compared with ground truth consensus diagnosis.
In most cases, we found diagnosis from WSI noninferior to TM. The WSI method approached, but did not achieve, noninferiority for the subset of melanocytic lesions. Discordance in TM diagnosis of this challenging group is broadly recognized and further investigation of melanocytic neoplasms is recommended.
Digital pathology represents a transformative technology that impacts dermatologists and dermatopathologists from residency to academic and private practice. Two concerns are accuracy of interpretation from whole-slide images (WSI) and effect on workflow. Studies of considerably large series involving single-organ systems are lacking.
To evaluate whether diagnosis from WSI on a digital microscope is inferior to diagnosis of glass slides from traditional microscopy (TM) in a large cohort of dermatopathology cases with attention on image resolution, specifically eosinophils in inflammatory cases and mitotic figures in melanomas, and to measure the workflow efficiency of WSI compared with TM.
Design, Setting, and Participants
Three dermatopathologists established interobserver ground truth consensus (GTC) diagnosis for 499 previously diagnosed cases proportionally representing the spectrum of diagnoses seen in the laboratory. Cases were distributed to 3 different dermatopathologists who diagnosed by WSI and TM with a minimum 30-day washout between methodologies. Intraobserver WSI/TM diagnoses were compared, followed by interobserver comparison with GTC. Concordance, major discrepancies, and minor discrepancies were calculated and analyzed by paired noninferiority testing. We also measured pathologists’ read rates to evaluate workflow efficiency between WSI and TM. This retrospective study was caried out in an independent, national, university-affiliated dermatopathology laboratory.
Main Outcomes and Measures
Intraobserver concordance of diagnoses between WSI and TM methods and interobserver variance from GTC, following College of American Pathology guidelines.
Mean intraobserver concordance between WSI and TM was 94%. Mean interobserver concordance was 94% for WSI and GTC and 94% for TM and GTC. Mean interobserver concordance between WSI, TM, and GTC was 91%. Diagnoses from WSI were noninferior to those from TM. Whole-slide image read rates were commensurate with WSI experience, achieving parity with TM by the most experienced user.
Conclusions and Relevance
Diagnosis from WSI was found equivalent to diagnosis from glass slides using TM in this statistically powerful study of 499 dermatopathology cases. This study supports the viability of WSI for primary diagnosis in the clinical setting.
Whole-slide images (WSI) are virtual slides produced by digitally scanning conventional glass slides. Advantages of WSI over traditional microscopy (TM) are found in training and education, tumor boards, research, frozen tissue diagnosis, permanent archiving, teleconsultation, collaboration, inclusion in patient electronic medical records, and quality assurance testing.1-5 Whole-slide images permit manual and computer-aided quantitative image analysis of diagnostic and prognostic protein and genetic biomarkers, with several algorithms having US Food and Drug Administration (FDA) approval for clinical use.6 Whole-slide image biorepositories provide for the development of pattern recognition algorithms as a diagnostic tool.7-10 Digital WSI allow annotation, insertion of comments, and enhanced accessibility to specialists, especially from geographically remote locations.11
The ability to diagnose from WSI is uniquely pertinent to dermatology, where residency curricula require training and competence that qualifies dermatologists to interpret their own dermatopathology specimens.12 Moreover, dermatopathology is ideal for WSI diagnosis owing to the relatively small specimen size, low slide counts, and overall higher case volume found in the specialty.13
Although WSI are accepted adjunctively in various clinical applications and in several specific tests, lack of commonly accepted validation study standards has impeded the acceptance of WSI for primary diagnosis in the United States. However, after years of scrutiny, the FDA announced on April 12, 2017, approval of the first WSI system for primary diagnosis.14 We interpret this decision as a major step to facilitate widespread adoption of digital pathology in many workflow settings and anticipate that additional WSI systems will receive FDA approval in the near future.15 During the interim, individual laboratories are not prohibited from validation of WSI systems for their own use as laboratory developed tests.16
A further concern for acceptance of WSI for primary diagnosis is the considerable variability between organ system subdisciplines that may require system by system validation.16 There have been a number of studies evaluating concordance between WSI and TM; however, studies are often small, lack specificity of specialties, and are dominated by interobserver rather than intraobserver validation methodologies.17 More recent investigations have included dermatopathology-specific cases and emphasis on intraobserver observations.17-19 Similar studies have appeared outside the dermatologic literature.20
Here, we investigate intraobserver concordance of diagnosis from WSI compared with TM in a large cohort of 499 cases proportionally representing the spectrum of diagnoses encountered in our high-volume dermatopathology laboratory. We compare these diagnoses with interobserver ground truth consensus (GTC) diagnosis. We tested the hypothesis that interpretation from WSI was noninferior to that from TM; and that the discordance observed is no greater than the inherent variability in diagnosis among pathologists using TM.19,21-23
An efficiency study of WSI vs TM read rates by pathologists with different levels of experience using WSI is included.
Prior to the study, institutional review board (IRB) approval was sought and exempted by the IRB at the Boonshoft School of Medicine, Wright State University. Whole-slide images were viewed, diagnosed and reported using a patented in-house–developed Laboratory Information System (LIS) platform (Clearpath). The WSI system was previously validated for primary diagnosis of hematoxylin-eosin (H&E) stained slides per the College of American Pathologists (CAP) guidelines. Special stains and immunohistochemistries were not included in this study because they require separate validation for each stain.16
A total of 505 cases, proportionally representing the 30 most frequent diagnoses (approximately 95% of total diagnoses reported in our laboratory), were sequentially pulled from cases received in the 3 months prior to commencement of the study. Eight melanomas (2 in situ and 6 invasive malignant melanomas) were added for a total of 15 melanomas. A total of 513 cases were initially considered for the study, reflecting the spectrum and complexity of specimen types and diagnoses encountered during routine practice.
Case materials were deidentified and entered into a database that included reference number, patient demographics, available clinical information, and original diagnosis. Glass slides and data, including the original diagnosis, were viewed simultaneously on a multi-headed microscope by 3 board-certified dermatopathologists (T.O., S.S., and M.M.) and interobserver GTC diagnosis established. If consensus could not be achieved, the case was not included in the study. A total of 499 cases were included and are listed in Table 1.
Glass slides were digitized as WSI at ×20 magnification (Aperio AT2 Image Scope, Leica Biosystems). Cases were divided into 3 groups of 166, 166, and 167 cases, each proportionally representing the spectrum of cases seen in the laboratory.
Three board-certified dermatopathologists (intraobserver group, J.M., M.C., and M.J.K.) diagnosed one half of their cases by TM and the second half by WSI. Glass slides were read on conventional microscopes, while WSI were read on an in-house WSI system (Clearpath). Following a minimum 30-day washout period, each dermatopathologist diagnosed the same cases using the alternate method.
Intraobserver concordance between TM and WSI was determined as well as interobserver concordance between each interpretation and GTC diagnosis. Assignment of major or minor discrepancies between diagnoses were determined by effect on patient care and outcome, according to the criteria used by Bauer et al.20 Major discrepancy represented a difference in diagnosis associated with an alteration in patient care. Minor discrepancy was defined as having no significant clinical impact.
Statistical analysis was performed by the Statistical Consulting Center at Wright State University. Discrepancies were analyzed by paired noninferiority tests with a 4% noninferiority margin. Major and minor discrepancies were analyzed separately with their own noninferiority tests of hypothesis. A Bonferroni correction was applied to control for potentially inflated type I error from performing 2 tests, which resulted in a level of significance of α = .025 that was used throughout. Wald-type test statistics and confidence intervals were calculated using methods described by Liu et al.24
Separately, we compared read rates of 3 dermatopathologists (ratio of cases read/h by WSI vs cases read/h by TM) using a proprietary in-house digital pathology system (Clearpath) for WSI. Two dermatopathologists read WSI and TM cases at 1 hour intervals, for a total of 2 hours of reading by each methodology. A third pathologist read 1 hour by WSI and 2 hours by TM. Cases were included regardless of number of slides and without concern for level of difficulty; however, cases requiring additional stains or recuts were not included. Timers were stopped for interruptions such as phone calls or extraneous transactions to allow comparison of actual reading times.25
Intraobserver concordance between WSI and TM diagnosis was 94% (471 of 499). Twenty-eight (6%) cases were designated discordant diagnoses; 14 were minor and 14 were major. Of the major discrepancies shown in Table 2, there were 6 melanocytic lesions, 2 inflammatory lesions and 6 cases involving epithelial dysplasia/atypia vs epithelial hyperplasias/hypertrophies. Major discrepancies among melanocytic lesions included 2 cases of invasive melanoma vs in situ melanoma, 1 case of melanoma vs severely atypical nevus, and 3 cases where the margins were involved and coinciding histologically with a moderate or mild degree of atypism vs a benign nevus.26-30 Major discrepancies involving 2 inflammatory cases included spongiotic dermatitis by WSI vs benign keratosis by TM and another case of spongiotic dermatitis where TM interpretation was granuloma annulare. The 6 major discrepancies of nonmelanocytic lesions involved interpretation of disorganization and cytologic atypia of actinic keratosis vs various epithelial hyperplasias and/or hypertrophies.
Interobserver concordance between WSI and GTC diagnosis was 94%, while interobserver concordance between TM and GTC diagnosis was 94%. Concordance between all observations (GTC, WSI, and TM) was 91%.
Melanocytic lesions comprised 30% of total cases, to include 15 malignant melanomas, and inflammatory cases represented 7%. Mitotic figures within melanomas were not difficult to identify by WSI. Inflammatory cases, particularly spongiotic dermatoses, yielded a mixed infiltrate with eosinophils that were easily visualized by WSI (Figure).
Noninferiority was determined comparing discrepancies of paired samples by each readout method to GTC, allowing a 4% noninferiority margin between methods. Rates of both major and minor discrepancies of WSI and TM for all cases were not significantly different. When considering major discrepancies in 3 subgroups, we found no significant difference between methodologies for nonmelanocytic lesions and inflammatory lesions; however, major discrepancies for melanocytic lesions fell outside the 4% noninferiority margin (Table 3).
Pathologist’s efficiency, measured by ratio of number of WSI cases read per minute to TM cases read per minute, suggests that WSI efficiency can achieve parity with TM, and efficiency correlated with pathologist’s experience with WSI (Table 4).
Acceptance of WSI for primary diagnosis requires rigorous demonstration that patient care will not be adversely affected. The accuracy of this methodology may well be dependent on individual subspecialty validation. The primary outcome in this large dermatopathology validation study is that diagnosis of dermatologic cases from WSI is not inferior to diagnosis by TM. Noninferiority tests are a standard intended to show that the effect of a new method is not worse than that of an active control by more than a specified margin. This testing methodology was the foundation for FDA approval of the first WSI system approved for primary diagnosis.14 Our findings support increasing evidence that using WSI does not compromise patient care and allows pathologists to gain confidence in WSI for primary diagnosis.
Unlike previous noninferiority testing of WSI,17,20 we used paired testing, with the advantage of comparing results from the same sample by the 2 different methods. Our study provides a statistically powerful and comprehensive addition to previous validation studies.
A recent publication by Shah et al17 lends additional support to the dermatopathology subspecialty in their finding that intraobserver and interobserver concordance rates between WSI and TM were equivalent in 181 dermatopathology cases when adhering to the 2013 CAP validation guidelines. In an earlier, smaller study, Al-Janabi et al18 found WSI and TM intraobserver concordance to be 94% in 100 dermatopathology cases. The 6 discordant cases were graded as minor, having no impact on patient care. In 2016, Goacher et al31 published a systematic review of 38 validation studies across all pathology subspecialties and found the mean diagnostic concordance of WSI and TM, weighted by the number of cases per study, was 92.4%. Our study would meet the criteria for inclusion in the Goacher et al review and provides further support for confirmation of the efficacy of WSI.
We found total (major and minor) intraobserver concordance between WSI and TM was unequivocal in 94% of all cases. Further, intraobserver concordance between WSI and TM achieved 97% when minor discrepancies were included as agreement. Importantly, our 94% interobserver concordance of WSI to GTC and TM to GTC supports the position that diagnostic methodology does not impact a pathologist’s interpretation.
Among the 499 cases reviewed, we evaluated 3 subgroupings: nonmelanocytic lesions, melanocytic lesions, and inflammatory cases, similar to the selected clustering of cases reported by other investigators.13,17,19 The 14 major intraobserver discrepancies by case and subgroup provide a frame of reference for discussion (Table 2).
The highest intraobserver concordance between WSI and TM was the nonmelanocytic grouping, with 98% agreement. The 6 major discrepancies involved interpretation of an actinic keratosis vs a variety of benign epithelial hyperplasias, to include seborrheic keratoses, verrucoid keratoses, and verruca vulgaris. Cases involving discordance between actinic keratoses, carcinoma in situ (Bowen disease) and invasive squamous cell carcinoma were not observed. We interpret the occurrence of major discrepancies, both intraobserver and interobserver, not as a failure of WSI methodology but rather the inherent subjectivity of pattern recognition and integration of degrees of dysplasia when biopsies are taken from chronically sun-damaged skin. While WSI yielded a less malignant diagnosis vs TM in 4 of 6 cases, WSI was concordant to GTC in 5 of 6 cases. Shah et al17 published similar results in terms of the common occurrence of actinic keratosis as a discordant diagnosis in the nonmelanocytic lesion grouping, as well as 2 discordant cases of an actinic keratosis vs invasive squamous cell carcinoma.
In melanocytic lesions, although WSI vs TM concordance was 96% for major discrepancies, TM agreement with GTC was greater than WSI agreement with GTC, and the 2.5% lower confidence bound fell below the 4% noninferiority margin. This is not unexpected because diagnostic discordance among pathologists for this challenging group of lesions by TM alone has long been documented in the literature.32-35 We believe the method of readout (WSI or TM) is not a factor in the variability inherent in diagnosing this subgroup. It should also be noted that of the 15 melanomas in the study, WSI and TM each had 1 major disagreement with GTC.
As WSI emerges as a new methodology for diagnosing pathology cases, the intraobservational and interobservational discordance in the diagnosis of melanocytic lesions remains evident. Okada et al36 found 100% concordance after reviewing 23 benign melanocytic lesions and 12 malignant melanomas, both in situ and invasive malignant melanoma. Lienweber et al,37 in a larger series of 560 melanocytic neoplasms, reported a concordance rate of 94.4% to 96.4%, but using a binary (benign or malignant) reporting system. Shah et al17 recorded an overall lower concordance rate of melanocytic lesions at 75.6%, yet the figure corrected to a 97.4% intraobserver concordance using the binary system methodology. We expect that further, more statistically powerful WSI studies of melanocytic lesions, considering the effect of scanning magnification and including immunohistochemistry stains, will better elucidate the noninferiority of WSI and TM vs GTC for this challenging group. Case 1 reflected the major discrepancy of in situ melanoma vs invasive malignant melanoma with WSI favoring an in situ process and an invasive melanoma diagnosis by TM (Table 2). The GTC favored invasive malignant melanoma. On rereview of WSI, a section was identified with invasive melanoma. Case 2 was again a situation of an in situ melanoma diagnosed by WSI and GTC, as opposed to TM diagnosis of invasive melanoma. Further review was unable to reproduce the TM diagnosis of invasive melanoma. In both cases, one could reasonably attribute differences in interpretation to variations in logistical habit sign-out patterns when one is faced with multiple blocks that produce multiple fields of view under the microscope or scanned by WSI.
The remaining 4 intraobserver major discrepancies included a severely atypical nevus by TM vs melanoma by WSI and 3 cases where WSI diagnosis of nevus was less severe than dysplastic diagnoses by TM, with 2 of the 3 cases being moderate atypia and the third mild. These results might suggest a lower degree of image resolution by WSI vs TM, particularly because our scanning magnification was ×20 vs the ×40 magnification used by Shah et al17 and Al Habeeb and colleagues.13 Indeed, Al Habeeb et al noted that among their surveyed staff pathologists, participating majority opinion indicated ×40 magnification by WSI produced a superior image to the microscope. Nonetheless, with the more severe diagnosis of invasive melanoma by WSI vs severely atypical nevus by TM, in addition to the lack of disparities between WSI vs GTC and TM vs GTC, respectively, we attribute the differences to subjective nuances observed in melanocytic neoplasms.17,32-35
Our data reflected a high level of concordance with inflammatory lesions, with 32 of 34 cases, representing a wide range of dermatoses, being concordant. In fact, even though the concordance rate for this subgroup was not as high as for melanocytic lesions, noninferiority was achieved, as WSI achieved greater concordance with GTC than did TM (Table 3). One major discrepancy yielded a diagnosis of spongiotic dermatitis vs verrucoid keratosis. This variance, namely inflammation vs neoplasm, has been observed by others and attributed to the lack of a complete data set, such as clinical photographs. The second major discrepancy, spongiotic dermatitis vs granuloma annulare, was rereviewed with the diagnosis of granuloma annulare being an interpretative error, with the variance attributed to ropelike basophilic collagen bundles interpreted as mucin. Although Massone et al38 have reported a low 75% concordance among inflammatory lesions, the authors acknowledge limitations of their study to include incomplete clinical data, unfamiliarity with the virtual microscope, and intrinsic difficulties that accompany inflammatory skin pathologic analyses. Because of the high level of concordance of inflammatory lesions in our series, we believe inflammatory skin pathologic abnormalities can be readily interpreted using WSI for primary diagnosis, assuming an awareness and focus on clinicopathological correlations and clinical photographs available for review.
There has been little agreement regarding effect of WSI on pathologist workflow efficiency.39-44 This is not unexpected as the nascent technology is developing, innovative approaches to workflow design and implementation are being tested, and a new generation of pathologists, many used to digital lifestyles, come of age. It may also be that, as with implementation of electronic medical records, efficiencies in one area may compromise those in another, even as the total system provides gains for patient care.44-46 Our preliminary findings suggest that experience is a key factor and that parity can be achieved.
Although our efficiency data are small and relatively uncontrolled, after having the opportunity to work with WSI for 5 years it is our assessment that if the workflow involves 100 cases or less, WSI efficiencies can equal the microscope with training and practice. This is particularly relevant if there is an integrative laboratory information management system (LIS) and order and result interfaces.
We did not evaluate intraobserver TM to TM or WSI to WSI concordance in this study; however, data from the literature and interobserver observations of WSI to GTC and TM to GTC are in agreement with our intraobserver WSI vs TM results. The study did not include rare dermatologic diagnoses or cases referred to other specialists (ie, lymphoreticular processes and cytopathologic diseases). We did not include granulomatous pathologic abnormalities or foreign bodies because, to our knowledge, WSI does not currently include polarizing capabilities.
Diagnosis from WSI was found to be noninferior compared with diagnosis from TM in this validation study of 499 cases reflecting the spectrum and complexity of specimen types and diagnoses encountered in a dermatopathology practice. Our substudy regarding reading efficiency suggests that pathologists experienced with WSI can achieve parity with TM. As WSI systems improve and pathologists gain experience in this transformative technology, diagnostic time should not be a barrier in adoption of WSI for primary diagnosis.
Corresponding Author: Thomas G. Olsen, MD, Department of Dermatology, Boonshoft School of Medicine, Wright State University, Dermatopathology Laboratory of Central States, 7835 Paragon Rd, Dayton, OH 45459 (email@example.com).
Accepted for Publication: July 9, 2017.
Published Online: October 11, 2017. doi:10.1001/jamadermatol.2017.3284
Author Contributions: Drs Olsen and Kent had full access to all of the data in the study and take responsibility for the integrity of the data and the accuracy of the data analysis.
Concept and design: Kent, Olsen, Feeser, Tesno, Moad, Stephenson, Peacock, Brumfiel.
Acquisition, analysis, or interpretation of data: Kent, Olsen, Feeser, Tesno, Moad, Conroy, Kendrick, Murchland, Khan, Peacock, Brumfiel, Bottomley.
Drafting of the manuscript: Kent, Olsen, Feeser, Tesno, Khan, Peacock.
Critical revision of the manuscript for important intellectual content: Kent, Olsen, Feeser, Tesno, Moad, Conroy, Kendrick, Stephenson, Murchland, Peacock, Brumfiel, Bottomley.
Statistical analysis: Khan, Bottomley.
Administrative, technical, or material support: Kent, Olsen, Feeser, Tesno, Moad, Conroy, Kendrick, Murchland, Khan, Peacock.
Study supervision: Kent, Olsen, Feeser.
Conflict of Interest Disclosures: Dr Olsen is minority owner of Clearpath, the Laboratory Information System (LIS) used to view and report the scanned whole slide images. No other disclosures are reported.
Additional Contributions: Keith J. Kaplan, MD, editor of http://www.tissuepathology.com, provided advice for the study design. Joel Crockett, MD, Dermatopathology Laboratory of Central States, provided technical contribution to the efficiency study. The contributors did not receive compensation.