Stoler MH, Schiffman M, for the Atypical Squamous Cells of Undetermined Significance–Low-grade
Squamous Intraepithelial Lesion Triage Study (ALTS) Group . Interobserver Reproducibility of Cervical Cytologic and Histologic InterpretationsRealistic Estimates From the ASCUS-LSIL Triage Study. JAMA. 2001;285(11):1500–1505. doi:10.1001/jama.285.11.1500
Author Affiliations: University of Virginia Health System, Charlottesville (Dr Stoler); and National Cancer Institute, Bethesda, Md (Dr Schiffman).
Toward Optimal Laboratory Use Section Editor: David H. Mark, MD, MPH, Contributing Editor.
Context Despite a critical presumption of reliability, standards of interpathologist
agreement have not been well defined for interpretation of cervical pathology
Objective To determine the reproducibility of cytologic, colposcopic histologic,
and loop electrosurgical excision procedure (LEEP) histologic cervical specimen
interpretations among multiple well-trained observers.
Design and Setting The Atypical Squamous Cells of Undetermined Significance–Low-grade
Squamous Intraepithelial Lesion (ASCUS-LSIL) Triage Study (ALTS), an ongoing
US multicenter clinical trial.
Subjects From women enrolled in ALTS during 1996-1998, 4948 monolayer cytologic
slides, 2237 colposcopic biopsies, and 535 LEEP specimens were interpreted
by 7 clinical center and 4 Pathology Quality Control Group (QC) pathologists.
Main Outcome Measures κ Values calculated for comparison of the original clinical center
interpretation and the first QC reviewer's masked interpretation of specimens.
Results For all 3 specimen types, the clinical center pathologists rendered
significantly more severe interpretations than did reviewing QC pathologists.
The reproducibility of monolayer cytologic interpretations was moderate (κ
= 0.46; 95% confidence interval [CI], 0.44-0.48) and equivalent to the reproducibility
of punch biopsy histopathologic interpretations (κ = 0.46; 95% CI, 0.43-0.49)
and LEEP histopathologic interpretations (κ = 0.49; 95% CI, 0.44-0.55).
The lack of reproducibility of histopathology was most evident for less severe
Conclusions Interpretive variability is substantial for all types of cervical specimens.
Histopathology of cervical biopsies is not more reproducible than monolayer
cytology, and even the interpretation of LEEP results is variable. Given the
degree of irreproducibility that exists among well-trained pathologists, realistic
performance expectations should guide use of their interpretations.
The interpretive reproducibility of cervical cytology and histopathology
is critical to cervical cancer prevention programs. There is a societal presumption
of high reproducibility of cytologic screening. In the medical community,
histopathologic interpretations are generally considered the reference standard
upon which treatment of cervical disease is based. No test or interpretation
is perfect, and both society and the medical profession may have excessively
high expectations. In fact, realistic standards of interpathologist agreement
for cytology and histology have not been well defined by rigorous studies
of large series of specimens.
Cytology screening interpretations define which women require focused
clinical attention. Organized screening programs based on periodic conventional
Papanicolaou (Pap) smears are successful in greatly reducing cervical cancer
deaths.1 In recent years, however, cervical
cytologic screening has come under attack because of a growing awareness of
the test's imperfections, including irreproducibility and false negativity.2- 12
Problems with specimen collection and preparation may be partly ameliorated
with monolayer preparations.1,13
Ultimately, cervical cytologic screening is entirely predicated on the combined
judgment of the cytotechnologist and pathologist. Although clinicians vary
in their management of women with abnormal cytology, different diagnostic
interpretations of any given cytologic specimen may lead to radically different
Clinicians often evaluate abnormal cytologic findings using colposcopy,
with guided biopsies of visually abnormal areas. Colposcopy itself is not
well standardized14,15 and the
reproducibility of biopsy interpretation might actually be as variable and
problematic as cytologic interpretation.6,16- 27
Previous studies on the reproducibility of cervical preneoplasia interpretation
have been limited in size and for the most part statistically inadequate.
Many clinicians now treat significant intraepithelial neoplasia, either
proven or suspected, using the loop electrosurgical excision procedure (LEEP)
to remove the cervical transformation zone. This procedure produces a large
histology specimen that is processed similarly to a cone biopsy, oriented
as a clock face in 12 sections. The resultant pathology report further defines
the grade of neoplasia and guides the patient's subsequent management. Despite
the widespread use of LEEP for the treatment of substantial cervical neoplasia,
the reproducibility of LEEP histopathology has not been rigorously evaluated.
We evaluated the reproducibility of cytology, biopsy histopathology,
and LEEP histopathology among multiple, well-trained observers in the context
of an ongoing multicenter clinical trial.
The Atypical Squamous Cells of Undetermined Significance–Low-grade
Squamous Intraepithelial Lesion (ASCUS-LSIL) Triage Study (ALTS) is a multicenter
randomized clinical trial with 3 study arms designed to evaluate the management
of mildly abnormal cytology findings by 3 alternative methods: immediate colposcopy,
conservative cytologic follow-up, or triage by human papillomavirus (HPV)
DNA testing.28 At enrollment into ALTS during
1996-1998, women referred for ASCUS or LSIL conventional Pap smears had a
repeat cytologic interpretation on monolayer cytology (ThinPrep, Cytyc, Boxborough,
Mass). Women triaged to colposcopy as required by the study protocol underwent
biopsy if lesions were visible upon application of acetic acid. Histologically
confirmed cervical intraepithelial neoplasia (CIN) grades 2 to 3 was treated
by LEEP. The few cases of prevalent, invasive carcinoma were treated more
extensively as appropriate. A full description of the study is available elsewhere.28
During enrollment, the ALTS clinical centers interpreted 4948 monolayer
cytology slides, 2237 biopsies (taking only the most severe result for each
woman, as described below), and 535 LEEP specimens that were independently
reviewed by the Pathology Quality Control Group (QC). There were 1 to 2 staff
pathologists per clinical center (7 in all), who worked independently. No
conferences were held regarding cases. The initial QC review was randomly
assigned to 1 of the 4 QC pathologists. The QC review was masked to the clinical
center interpretation and all other test results. The present analysis is
based on the comparison of clinical center to first QC interpretations. When
the first QC reviewer disagreed with the original clinical center interpretation,
additional reviews were performed.28 While
the QC algorithms and panel interpretations were used for ALTS to define disease
end points, the patients were managed by the original clinical center interpretations
unless CIN3 or cancer was suspected, in which case the final QC opinion was
For cytologic specimens, cytotechnologists' screening marks were not
removed during rescreening at the QC center at Johns Hopkins University. Quality
control histology reviews were performed on all the original slides interpreted
at the clinical centers. No recuts or substitute slides were used. Interpretations
were coded using the Bethesda System squamous intraepithelial lesion (SIL)
categories. Histologic and LEEP interpretations were categorized by severity
of overall interpretation for a case rather than by individual block, analogous
to actual clinical management. Thus, no woman contributed more than 1 interpretation
to the data tables for a given specimen type. Analyses were repeated to look
for trends in subgroups. These included dividing the data by each individual
QC pathologist, dividing the data by each of 4 clinical centers, and analyzing
the data over time to see if interpretative reproducibility varied over the
period of enrollment. Finally, the results of HPV testing were briefly considered
in association with the cytology and histology interpretations.28
Reactive, reparative, and inflammatory changes were grouped as negative
for this analysis. The very few invasive cancer interpretations were included
in the high-grade intraepithelial group. For histology, the results were combined
into cytology-like (Bethesda System SIL) groupings although the CIN terminology
was retained (eg, CIN2 or 3 is analogous to high-grade squamous intraepithelial
lesion [HSIL]), to permit comparisons across equal-sized data tables with
κ Values were calculated to test for reproducibility while taking
chance agreement into account.29,30
We composed 4 × 4 diagnostic tables, as well as more condensed 2 ×
2 diagnostic tables, using each possible binary cutpoint (ie, negative vs ≥ASCUS, ≤ASCUS
vs ≥LSIL, and ≤LSIL vs ≥HSIL). Both unweighted and weighted κ
values were considered. Weighted values, with weights inversely proportional
to the number of categories of distance between 2 ratings, are sensitive to
severe disagreements as opposed to 1-category disagreements. Specifically,
the weights were 1.00 for data cells on the diagonal (ie, exact agreement),
0.67 for cells adjacent to the diagonal, 0.33 for cells 2 units from the diagonal,
and 0 for cells 3 units from the diagonal. Weighted κ values are higher
than unweighted values when disagreements are common but tend to be close
to the diagonal. As a rough guide, a κ value of less than 0 indicates
poor agreement, 0 to 0.2 represents slight agreement, 0.2 to 0.4 is fair agreement,
0.4 to 0.6 indicates moderate agreement, 0.6 to 0.8 shows substantial agreement,
and 0.8 to 1.0 is almost perfect agreement.
However, κ values were compared cautiously, for 2 reasons. First,
the interpretation of the κ statistic is affected by large differences
in disease prevalence.31 When disease prevalence
is very high or very low (rather than intermediate), the κ values are
decreased relative to the percentage of agreement, which does not take chance
agreement into account. The presentation notes when this might affect interpretation.
In general, as measured by the κ statistic, the rates of agreement observed
in this ALTS referral population would tend to be lower (relative to percentage
of agreement) in a screening population in which disease is rare.31 Second, κ statistics vary by the number of
diagnostic categories. Hence, κ statistics were computed for 4 ×
4 tables and 2 × 2 tables; they cannot be directly compared. For the
4 × 4 tables, the symmetry χ2 statistic
was used to compare the severity of clinical center vs QC interpretations.
Analogously, for the 2 × 2 tables, the McNemar statistic was used.
The primary comparison data for each specimen type are listed in Table 1a for monolayer cytology, cervical
biopsies, and LEEP specimens, respectively. For each of the 3 data sets, the
shaded diagonal represents the proportion of concordant specimens. The boxed
data cells indicate the most discordant comparisons. There was only moderate
interobserver reproducibility, regardless of specimen type. The κ statistics
are compared in Table 2. The modest
increase in κ values based on weighting suggests that most disagreements
were relatively close. There was significant asymmetry in each class of comparison.
This suggested a systematic pattern of disagreement between the clinical center
and the QC pathologists, with the QC pathologists tending to give less severe
interpretations for all 3 types of specimens.
Not surprisingly, the greatest source of disagreement in monolayer cytology
results involved ASCUS interpretations (Table 1). Of 1473 original interpretations of ASCUS, the QC reviewer
concurred in only 43.0%, rendering less severe readings for most of the rest.
Another significant source of variation included HSILs in which concordance
was only 47.1%, with 27.0% and 22.6% of the remainder interpreted as LSIL
or ASCUS by the QC reviewers respectively.
Histologic interpretative reproducibility on biopsies was no better
overall than cytologic reproducibility. However, histologic variability derived
largely from disagreements about grade CIN1 (including koilocytotic atypia).
An interpretation of CIN1 by the clinical center was corroborated by the QC
group in only 42.6% of 887 biopsies. Virtually an equal proportion of originally
diagnosed CIN1 biopsies (41.0%) were intepreted as negative by the pathology
An equivocal histologic interpretation (ie, a histologic equivalent
of ASCUS) was rarely used, although ALTS is a study in which originally equivocal
cytologic interpretations predominate. Most of these problematic cases were
due to sample limitations (eg, quality of staining, crush or thermal artifact)
that caused difficulty in making subtle distinctions between SIL and normal/reactive.
Clinical center pathologists rendered an equivocal histologic interpretation
in only 8.2% of 2237 biopsies; similarly, the QC pathologists used an equivocal
categorization for biopsy interpretation in only 3.5%. The extremes of interpretation,
ie, biopsies categorized as negative or high grade (≥CIN2), demonstrated
good concordance in 90.8% and 76.9% of cases diagnosed by the clinical centers,
Histologic reproducibility based on LEEP was not better than for other
specimen types. Of note, LEEP specimens were much more likely than either
cytology or biopsies to represent grades of CIN2 or higher. Despite smaller
and skewed numbers of specimens, it was observed that the interpretation of
CIN1 was still poorly reproduced, with only 43.8% of original interpretations
being corroborated by the QC group.
More condensed 2 × 2 diagnostic tables were composed using each
possible binary cutpoint (ie, negative vs ≥ASCUS, ≤ASCUS vs ≥LSIL,
and ≤LSIL vs ≥HSIL). The κ statistics for these are shown in Table 3. Cytologic interpretations equaling
or exceeding HSIL were uncommon enough to merit caution over comparisons of
this κ statistic to others in the table. Cytologic interpretations of
LSIL were more reproducible than histologic interpretations of comparable
severity. High-grade colpobiopsy and LEEP biopsy interpretations showed substantial
Subgroup analyses by individual QC pathologist or by individual clinical
center did not significantly alter any of the results or reveal any significant
trends over time.
Compared to prior studies, the data available from the ALTS trial are
significant for the size of the data set and for the ability to compare common
cytologic and histopathologic findings directly. This analysis shows that
the interobserver reproducibility of cytologic and histologic interpretations
is similar and only moderate. Finer distinctions may be difficult when broader
ones are only moderately successful. Nonetheless, an additional study is being
pursued to evaluate whether CIN2 can be reproducibly separated from CIN3 within
the high-grade group, particularly for histology, where the Bethesda SIL terminology
is less accepted. This issue is important for case definition in studies such
as the ALTS trial. Obviously, the distinction also has management implications
for the minority of clinicians who believe that CIN2 is a reproducible category,
not a true cancer precursor, and can be managed differently than CIN3.
For monolayer cytology, the greatest source of variability between the
clinical centers and the QC pathologists was the interpretation of ASCUS,
which the QC group interpreted as negative in 38.6% of cases. These cytologic
smears had a 37% HPV positivity rate much like the HPV positivity rate of
31% among the concordant negative cases (data not shown). Thus, if HPV testing
is used as an independent adjudicator of this process, QC-revised interpretations
seem likely to be accurate. The other major source of interpretative variability
for cytology was the fraction of HSIL clinical center interpretations that
were reviewed as either LSIL or ASCUS by QC. These represent different problematic
sets of cases. Transitions of HSIL to LSIL undoubtedly reflect the difficulty
referenced in the literature of trying to separate mild from moderate dysplasia.
On the other hand, transitions from HSIL to ASCUS represent the current controversy
surrounding small atypical cells of immature metaplastic type and whether
these hard-to-interpret cells represent an entity distinct from HSIL.10,32
On biopsy, the κ values were remarkably similar in magnitude to
the cytology data. However, the overwhelming source of interpretative variability
was the marked tendency of the QC group to review clinical center CIN1 biopsy
interpretations as negative. This reflects problems in implementation of criteria
for recognizing HPV cytopathic effect in tissue. Although CIN1 is a frequently
overcalled interpretation in cervical pathology practice, most of the CIN1
cases reviewed by QC as negative were HPV DNA-positive on the correlated monolayer
cytology, suggesting that these disagreements may have been excessive (data
not shown). The data for LEEP were similar in this trend. Further direct HPV
testing of these samples after microdissection may help clarify the accuracy
of the revised interpretations, compared to correlations using HPV tests derived
from the temporally related monolayer cytology specimens.
Neither the clinical center nor the QC group was free from important
error. The final QC interpretations considered both clinical center and first
QC interpretations, as well as additional QC reviews in case of discrepancy.
Notably, there was no effect of using the final interpretation on the clinical
center agreement rates for LEEP, there was an intermediate improvement on
biopsy, and there was the most improvement on cytology (data not shown). However,
significant variability was still present and the reasons for these trends
are not known.
Two points mitigate the appearance of mediocre reproducibility. First,
it is possible that the ALTS population provided slightly heightened diagnostic
challenges, compared with a typical pathology case load. All of the women
were referred for a mild cytologic abnormality. Women with easily reproducible,
completely negative cytology results or with obviously high-grade results
were underrepresented. Secondly, the reproducibility of high-grade colpobiopsies
and LEEP biopsies was substantial, which has important treatment implications.
Histologic confirmation of low-grade lesions is more suspect, suggesting that
the management of many women is subject to chance. Interestingly, the cytologic
diagnosis of LSIL appeared to be more reproducible than the histologic diagnosis.
We speculate that these differences are based on the reliability of the criteria
applied to individual cells in excellent cytologic preparations compared with
the rigor with which these same criteria are applied in histologic sections.
Caveats aside, the data reported in this study probably underestimate
the level of variability between groups of pathologists nationally. Most of
the pathologists in the trial are academic gynecologic pathologists with a
research interest in cervical cancer precursor interpretation and management.
In contrast, in many clinical practices, Pap smears are often read in large
commercial laboratories whereas biopsies are read locally by community hospital
pathologists. Cytohistologic correlation opportunities are decreasing with
this unfortunate economic reorganization of cytopathology practice. This problem
can potentially be addressed in the future by using the ALTS data set to clarify
criteria, revise classification systems, and implement educational tools to
help improve interpretative reproducibility within the pathology community.
In future works we will focus on whether ASCUS is a useful interpretative
entity, what constitutes a CIN1 histologic pattern, and clarification of the
variations of HSIL including atypical immature squamous metaplastic cells.
Finally, the need for reproducible interpretations is self-evident.
Beyond clinical needs, today's medicolegal environment requires adequate documentation
of what is and is not diagnostically possible. Unrealistic expectations of
accuracy, reproducibility, and truth determination fuel many of these malpractice
actions. It is possible to achieve moderate to substantial reproducibility
in a highly refined environment. However, substantial does not equal perfect
and this should provide some basis for understanding and defense in cases
based on differences of expert opinion. Indeed, if experts were required to
present data on their personal levels of intraobserver and interobserver reproducibility
(ideally developed in a standardized objective manner with independent HPV
adjudication), then the testimony of many so-called experts would be easier
to evaluate. In this regard, the results of ALTS could stand as a benchmark
for the current state of realistic interpretive reproducibility.