Santucci M, Biggeri A, Feller AC, Burg G, for the European Organization for Research and Treatment of Cancer (EORTC) Cutaneous Lymphoma Project Group. Accuracy, Concordance, and Reproducibility of Histologic Diagnosis in Cutaneous T-Cell LymphomaAn EORTC Cutaneous Lymphoma Project Group Study. Arch Dermatol. 2000;136(4):497-502. doi:10.1001/archderm.136.4.497
To assess the level of observer variability in the histologic identification of cutaneous T-cell lymphoma (CTCL) and its discrimination from diseases with similar histologic features.
Cutaneous T-cell lymphoma specimens and randomly mixed controls were evaluated twice by 3 examiners.
The European Organization for Research and Treatment of Cancer (EORTC) Cutaneous Lymphoma Project Group.
The study was conducted with histologic specimens from 32 patients with mycosis fungoides (MF). In addition, 13 specimens of spongiotic, lichenoid, or psoriasiform simulators of MF were blindly and randomly mixed with the CTCL specimens as controls.
Main Outcome Measures
To evaluate the accuracy and concordance among and individual reproducibility of raters of histologic diagnoses.
Overall, the concordance among raters was fair to moderate (range, 0.283-0.562; weighted overall κ, 0.412). Individual reproducibility of examiners ranged from moderate to almost perfect (range, 0.473-0.896; weighted overall κ, 0.709) and was not significantly different for the definite lymphoma (range, 0.551-0.921; overall κ, 0.802) and nonlymphoma (range, 0.368-0.950; overall κ, 0.793) categories. Accuracy was similarly variable among raters: sensitivity ranged from 49.3% to 78.1% (overall κ, 0.654), and specificity (control series) ranged from 46.2% to 69.2% (overall κ, 0.595). Adding the diagnoses of probable lymphoma to those of definite lymphoma, sensitivity ranged between 73.5% and 84.9%. Although for each examiner there was a trend toward a lower sensitivity in the detection of early lesions compared with later lesions, the difference in sensitivity between the 2 groups was not statistically significant.
The levels of concordance and reproducibility found in this investigation were similar to those obtained with comparable studies in the most varied fields of pathology, confirming that the identification of CTCL for our observers did not cause particular problems. Our findings also revealed that pitfalls in CTCL identification are not only limited to early lymphomatous lesions, as commonly postulated.
RESEARCHERS IN many fields have become increasingly aware of the observer (rater or interviewer) as an important source of measurement error. Of all the sources of data that are analyzed in medicine, human observation is the least standardized. Although the observations made by clinicians, radiologists, or pathologists provide critical information for the diagnosis and treatment of sick people, the observers are seldom subjected to the type of scientific testing that is imposed on inanimate equipment. Only during the past few decades have reliability studies been conducted in experimental or survey situations to assess the level of observer variability in the measurement procedures used in data acquisition, namely when physicians inspect roentgenograms, perform physical examinations, take medical histories, or interpret cytologic and histologic specimens.1
The correct identification of cutaneous T-cell lymphomas (CTCLs) and their proper differentiation from both inflammatory dermatoses and reactive lymphoid hyperplasias often pose vexing challenges, especially when dealing with the initial phases of the lymphomatous process.2- 6 Even advanced and experimental diagnostic techniques, such as immunophenotyping, quantitative DNA cytophotometry, and molecular genetic analysis, have proven to be unsuitable in solving the problem. Thus, light microscopy remains the criterion standard.7
The research reported herein was prompted by the discovery, during a European cooperative study on CTCL,8 that many discrepancies had emerged when a series of histologic specimens was classified by a group of dermatopathologists and histopathologists. These discrepancies had at least 2 potential sources: (1) Dermatopathologists and histopathologists may have used different histologic criteria. (2) Regardless of the particular criteria that were used, the different raters may have used the criteria inconsistently.
The object of this study was to explore these possible sources of disagreement. The divergence in criteria (interobserver variability) can be ascertained by having the same specimen interpreted by different observers (concordance), and the inconsistency of single observers (intraobserver variability) can be ascertained by having them interpret the same specimen repeatedly (reproducibility).
This study is unique in many respects. (1) The specimens (slides) used were from patients with complete follow-up data (from the onset of the disease till death), thus leaving no doubt as to the diagnosis. (2) Randomly mixed controls with adequate follow-up data (specimens of eczematous, spongiotic, or psoriasiform simulators of CTCL) were used in the study to exclude possible CTCL in the initial stages of the disease. (3) Interpretation of slides was performed by raters with different professional backgrounds (dermatopathology, hematopathology, or surgical pathology). (4) Raters were blinded (ie, provided with no clinical information or follow-up data) in order to test the actual reliability of histopathologic readings. (5) We assessed the accuracy, concordance, and reproducibility of histologic diagnoses. (6) Statistical analyses were used to correct for chance agreement.9
The specimens for this study were collected by the European Organization for Research and Treatment of Cancer (EORTC) Cutaneous Lymphoma Project Group. The referring physicians contributed specimens for which histologic material was available from the beginning of the disease. The specimens were taken from a series of 32 patients (total slides, 73; mean biopsy specimens per patient, 2.3; range, 1-3 biopsy specimens), all of whom had complete clinical information (ie, age of lesions, staging, treatment, and duration of the disease) and follow-up data. For all of these patients, the diagnosis of lymphoma was unequivocally established by clinical events, namely, later development of plaques, nodules, or tumors, and/or death caused by lymphoma. In addition, 13 specimens (slides) of spongiotic, lichenoid, or psoriasiform simulators of mycosis fungoides (MF) were blindly and randomly added to the CTCL specimens to serve as controls. This was done by a person who did not take part in the histologic evaluation.
Controls were obtained from the files of the Department of Dermatology of the University of Würzburg, Germany, and selected according to the following 3 criteria: (1) The histologic features of the specimens were highly suspicious for or indistinguishable from early MF. (2) The clinical differential diagnosis did not include MF in any of these cases. (3) Long-term follow-up documented the absence of progressive disease, including the development of lesions suspicious for MF either clinically or histologically. The final diagnoses of the control specimens were as follows: allergic contact dermatitis (4 cases), drug eruption (4 cases), lichen striatus (2 cases), erythema multiforme (2 cases), and psoriasis (1 case). All specimens were large wedge biopsy samples; the tissue fragments were fixed in buffered formalin, routinely processed, and stained with hematoxylin-eosin and Giemsa stain.
The test panel included 3 raters who were well trained in the histopathology of lymphoproliferative disorders; each rater had a different professional background (a dermatologist with expertise in dermatopathology with special reference to cutaneous lymphomas [G.B.], a pathologist with expertise in hematopathology [A.C.F.], and a surgical pathologist with experience in dermatopathology [M.S.]).
In order to assess the reliability and the consistency of the diagnostic criteria used by each investigator as well as the interobserver and intraobserver variability, all participants independently studied the specimens twice with an interval longer than 9 months between the 2 sessions. Additionally, none of the investigators knew the original diagnoses or the other investigators' findings or had access to clinical and follow-up data. For the first reading, the slides containing the specimens were randomly numbered from 1 to 86. For the second interpretation, the slides were renumbered with figures chosen from a table of randomly coupled numbers. The labeling of slides was performed by a person who did not take part in the histologic evaluation.
In order to determine the accuracy of the histologic diagnoses and taking into account the degree of variation linked to the presence of initial lymphomatous lesions in the present series, sensitivity and false-negative rates were calculated both for the whole series and separately for early and later lymphomatous lesions; specificity was evaluated for the control series. For this purpose, specimens representative of the initial phases of the lymphomatous process were those obtained from patients with stage IA disease (namely, limited plaques, papules, or eczematous patches covering less than 10% of the body surface) at least 5 years before any progression of the disease towards more advanced stages. Twenty-four specimens from 18 patients fulfilled these criteria. The remaining 49 specimens represented later stages of disease.
The 3 investigators did not meet individually or collectively to discuss the histopathologic criteria or definitions, nor did they meet to agree on any approach to histologic evaluation in preparation for the study. Each investigator reviewed the specimens with his own experience and understanding of CTCL, including criteria crucial for the identification of early lesions and their proper differentiation from both inflammatory dermatoses and reactive lymphoid hyperplasias.
Diagnoses were identified as follows:
Definite lymphoma: lymphoma without any doubt;
Probable lymphoma: the histologic features are consistent with CTCL, but a diagnosis of lymphoma cannot be confidently made;
Possible lymphoma: the histologic features are not consistent with CTCL, but a diagnosis of lymphoma cannot be confidently excluded; and
Nonlymphoma: nonlymphoma without any doubt.
The data obtained were statistically analyzed using the SPSS-X program (SPSS Inc, Chicago, Ill) for preparation of frequency tables and cross-tables.10
Interrater agreement was assessed by cross-tabulating the whole set of paired observations of the first reading into a symmetric square contingency table11 and producing separate tables for each single pair. The data layout for the analysis of intraobserver agreement consisted of contingency tables reporting the number of slides assigned by each rater to the different diagnostic categories on the first and second reading.
For the present study, we used the κ statistics of Cohen.12 This measure incorporates a correction for chance and therefore indicates the degree of agreement over and above that which would be expected by chance alone. For example, κ values that are greater than 61% may be taken to represent substantial to perfect agreement beyond chance. Values below 21% represent slight to poor agreement beyond chance, and values between 21% and 61% represent fair to moderate agreement beyond chance. Negative values denote less than chance agreement.13
Specific κ values were calculated for each category. The original table was collapsed into a 2 × 2 table according to presence or absence of the specific categories analyzed. Overall κ values and χ2 tests for homogeneity were used when appropriate. Ninety-five percent confidence intervals were computed according to the method of Fleiss.9
In addition, a weighted κ value was calculated.14 The weighted κ value takes into account the degree to which disagreements concern neighboring categories. We used the Fleiss-Cohen weights. When raters felt that they were unable to assign the biopsy specimen to 1 of the 4 categories, it was categorized as impossible to evaluate. This category was assigned a weight of zero.
In the intraobserver analysis, we evaluated both the crude agreement and the specific agreement on particular diagnoses (ie, the conditional probability of a specimen being reassigned to the same category, given that it had been assigned to that category once).
To assess accuracy, we tabulated specimens by true status and rater response separately for each rater. Sensitivity and false-negative rates were computed for both the whole series and early and later CTCL lesions. Specificity was analogously calculated for the control series. Ninety-five percent confidence intervals were computed from the binomial variance. Overall values were obtained as precision-weighted averages, and χ2 homogeneity test results were calculated.
For each specimen, only the first reading was used. Three raters reading 86 specimens produced a total of 258 results (Table 1). In 7 instances, the raters were unable to assign the biopsy specimens to 1 of the 4 categories; these 7 readings were categorized as impossible to evaluate. The distribution of the possible 516 (86 × 3 × 2) paired ratings is presented in Table 1. The last column of Table 1 gives the κ values and the 95% confidence intervals for each category and for the whole series. The multirater κ value was 0.284, suggesting that there was a fair level of agreement among the 3 raters. Assigning different penalty weights to different degrees of disagreement resulted in a higher value for agreement among raters (weighted κ, 0.412). The degree of agreement varied widely among categories; it was moderate (κ, 0.500) for the definite lymphoma category, fair (κ, 0.273) for the nonlymphoma category, and slight to poor for the other 3 categories.
Both unweighted and weighted κ values are presented for each single pair of raters in Table 2. The category-specific κ values for the diagnosis of definite lymphoma and nonlymphoma for each single pair of raters are presented in Table 3. The results of χ2 analyses revealed the absence of significant heterogeneity among raters.
Table 4 shows the crude and specific agreement and κ values for the individual raters between the first and second readings. Overall κ and weighted κ values ranged from 0.391 and 0.473 to 0.797 and 0.896, respectively. The results of χ2 analyses revealed highly significant heterogeneity among the 3 raters.
Lesion-specific sensitivity and false-negative rates estimated for each observer are reported in Table 5. Specificity equated with sensitivity for the control series. For each specimen, only the first reading was used. The results of χ2 analyses demonstrated highly significant heterogeneity among the 3 raters in identifying lymphoma cases; conversely, the differences observed in identifying controls were not statistically significant.
All anatomical pathology diagnoses are formed by value judgments that result from conscious interpretation of histologic imagery. Because these value judgments are ultimately subjective, it is no surprise that interpretative variability exists, even among experienced pathologists.
Previous studies in other fields (eg, lung cancer,15 proliferative breast lesions,16,17 cervical intraepithelial neoplasia,11,18 cutaneous pigmented lesions19) have suggested that the ability of histopathologists to identify or subclassify certain lesions consistently and reproducibly has become a matter of legitimate concern. This was especially true when participants were asked to use the diagnostic criteria they employed in their daily practices and no attempt was made to standardize the diagnostic criteria among the participants before evaluations of the study cases.15,16,18,19 Conversely, there was generally less interobserver variation in the categorization when raters agreed to use the same diagnostic criteria and all study participants were provided with educational materials to maximize the likelihood that each had a similar level of understanding of these criteria.17
However, the fact that cutaneous biopsy specimens are so widely used to make diagnoses of CTCL was one of the reasons we deliberately avoided the definitions, agreements, or discussions of histologic criteria given in the literature.2- 8,20- 26 This enabled us to have some index of observer reproducibility, to have an understanding of each rater's concept of CTCL, and to determine the reliability of histologic criteria that, at the beginning of the study, were thought by all participants to be well understood and not in need of strict definition.
After the completion of the study, however, the 3 raters collectively met and discussed the criteria used by each of them for this investigation. Surprisingly, despite the differences observed in both sensitivity and specificity, the criteria used by the panelists were found to be almost identical; ie, they used those criteria already reported in detail in a previous study by the EORTC Cutaneous Lymphoma Project Group.8 Therefore, the differences observed were possibly a result of the different tuning or weighting of these criteria more than the use of personal criteria. In particular, the 3 raters unanimously agreed that the crucial features to establish a diagnosis of definite lymphoma were the presence of cells that were considered neoplastic and a disproportionate epidermotropism, while a diagnosis of nonlymphoma was confidently made only when the typical constellation of cytologic and architectural criteria2- 8,20- 26 considered indicative or suspicious for lymphoma was absent.
The difficulties in diagnosing CTCL are well known.2- 8 Accurate diagnosis has as much to do with years of experience as it does with strict histologic criteria that can be applied by less-seasoned pathologists. Diagnostic accuracy and the reliability of conventional histopathologic features of CTCL, especially in the initial stages of the disease, have been reported to be extremely limited, even in the hands of experienced and well-trained pathologists.8,27,28 In addition, investigations have documented that major interrater variability and intrarater variability among pathologists and dermatopathologists were common when evaluating skin biopsy specimens for the diagnosis of CTCL.8,28 However, the real extent of the problem is not presently known, since studies dealing with the histologic diagnosis of CTCL have almost always been biased by several major problems. First, the real nature of the diseases featured in the slides was not determined—in fact, study cases did not generally have long-term follow-up data documenting the progression of the diseases or death of the patient caused by the diseases, leaving doubt as to the neoplastic nature of the lymphoproliferative disorder. Second, there was an absence of proper controls. Third, there were no clear statements of the protocol of the studies (ie, whether diagnoses were made with or without knowledge of clinical data). Fourth, there were inadequate statistical evaluations of the results obtained. Our investigation was designed to take these problems into account and minimize their impact on results.
We found significant concordance and reproducibility among examiners that exceeded statistical hazard. The levels of concordance and reproducibility found in this investigation are similar to those obtained with comparable studies in the most varied fields of pathology,11,19,28- 35 thus pointing out that a certain degree of variability linked to the observer is a common phenomenon inherent to the human being and that CTCL does not evoke particular diagnostic problems, as commonly postulated.
In the present study, histologic diagnoses were made without the raters being given any clinical information. This may have negatively affected the diagnostic accuracy (Table 5). In fact, in other fields of pathology, when clinical information is provided, diagnostic accuracy increases,36 and errors in the reporting of biopsy findings are reduced to an acceptable minimum.37 This assumption is strengthened by the relatively high numbers of diagnoses of probable lymphoma (data not shown) that, if added to the diagnoses of definite lymphoma, would significantly raise sensitivity (79.6% for rater A, 73.5% for rater B, 84.9% for rater C). However, we decided to do a blinded study because if we had provided clinical information to the raters, its interpretation would have been yet another source of variation among them, which might have prejudiced the measurement of the reliability and reproducibility of the histologic diagnoses, which was the primary objective of our investigation. This objective is of particular importance for examining relationships between the histopathologic and clinical findings for CTCL. In fact, these data are important to the clinician treating the individual patient.
The level at which the histopathologic categories can be defined also affects reliability. We postulate that as the inconsistency of lesions increases, the accuracy of diagnoses will decrease. In fact, many authorities have stated that a specific diagnosis of CTCL cannot be made in the initial lymphomatous stages and more reliance has to be placed on the clinical picture than on the histologic features.
Our results did not confirm this. In fact, although for each examiner there was a trend toward a lower sensitivity in the detection of early lesions compared with later ones, the difference in sensitivity between the 2 groups was not statistically significant. This may be owing to the relatively small sample size of the early group, and further studies may help to determine if this difference is a significant one.
Our results stress the absolute need for the clearer definition and standardization of the histologic features of CTCL to improve the reliability of histologic diagnosis. This is crucial for the accurate identification of CTCL and its distinction from diseases with similar histologic features.
Accepted for publication August 18, 1999.
The material for this study was collected by the European Organization for Research and Treatment of Cancer (EORTC) Cutaneous Lymphoma Project Group (Chairman: Günter Burg, MD) for the Symposia on the Histopathology of Early Mycosis Fungoides Lesions held in Ghent, Belgium, May 5-7, 1989, and in Würzburg, Germany, September 14-16, 1990.
The authors are indebted to the following colleagues who secured biopsy material with the pertinent clinical information and follow-up data used for the present investigation: M. Aelbrecht, MD, Ghent, Belgium; M. F. Avril, MD, Villejuif, France; E. Berti, MD, Milan, Italy; N. Bourgeois, MD, Antwerp, Belgium; G. Burg, MD, Würzburg, Germany; M. M. Delaunay, MD, Bordeaux, France; C. De Wolf-Peeters, MD, Leuven, Belgium; T. Estrach, MD, Barcelona, Spain; M. L. Geerts, MD, Ghent, Belgium; H. Kerl, MD, Graz, Austria; I. Koller, MD, Salzburg, Austria; K. Meissner, MD, Hamburg, Germany; C. Neumann, MD, Hannover, Germany; M. Nilles, MD, Giessen, Germany; E. Ralfkiaer, MD, Copenhagen, Denmark; N. Sepp, MD, Innsbruck, Austria; J. Wechsler, MD, Créteil, France. The authors pay particular tribute to Susanne Ziffer, MD, for selecting and randomizing the control slides for this study.
This work was done in part in the Departments of Dermatology and Pathology of the University of Würzburg School of Medicine, Würzburg, Germany; in the Department of Dermatology of the University of Zürich School of Medicine, Zürich, Switzerland; and the Institute of Anatomic Pathology of the University of Florence Medical School, Florence, Italy.
Corresponding author: Marco Santucci, MD, Dipartimento di Patologia Umana ed Oncologia, Università degli Studi di Firenze, Viale G. B. Morgagni 85, I-50134 Firenze, Italia (e-mail: Marco.Santucci@UNIFI.IT).