Confocal images from the test set (0.5 × 0.5 mm). Level: superficial layer. A, Honeycombed pattern in a nevus. The honeycombed architecture is clearly distinguishable throughout the whole image. This image achieved 100% agreement. B, Pagetoid cells in a melanoma. Pleomorphic large cells with bright cytoplasm and dark nucleus (white circles) are present within a disarranged epidermal architecture. This image also achieved 100% agreement (original magnification ×30; numerical aperture, 0.9).
Confocal images from the test set (0.5 × 0.5 mm). Level: dermal- epidermal junction. A, Mild atypical cells. Three round to oval cells with bright cytoplasm and a dark nucleus (arrowheads) can be identified. In this case, the observers showed disagreement for the presence of mild atypia. B, Marked atypia. Numerous and pleomorphic bright nucleated cells are clearly present in this picture within the basal layer. The presence of inflammatory plump bright cells within dermal papillae (asterisk) seemed not to be a confounding factor for this parameter. This image achieved 100% agreement (original magnification ×30; numerical aperture, 0.9).
Pellacani G, Vinceti M, Bassoli S, Braun R, Gonzalez S, Guitera P, Longo C, Marghoob AA, Menzies SW, Puig S, Scope A, Seidenari S, Malvehy J. Reflectance Confocal Microscopy and Features of Melanocytic LesionsAn Internet-Based Study of the Reproducibility of Terminology. Arch Dermatol. 2009;145(10):1137-1143. doi:10.1001/archdermatol.2009.228
Copyright 2009 American Medical Association. All Rights Reserved. Applicable FARS/DFARS Restrictions Apply to Government Use.2009
To test the interobserver and intraobserver reproducibility of the standard terminology for description and diagnosis of melanocytic lesions in in vivo confocal microscopy.
A dedicated Web platform was developed to train the participants and to allow independent distant evaluations of confocal images via the Internet.
Department of Dermatology, University of Modena and Reggio Emilia, Modena, Italy.
The study population was composed of 15 melanomas, 30 nevi, and 5 Spitz/Reed nevi. Six expert centers were invited to participate at the study.
Evaluation of 36 features in 345 confocal microscopic images from melanocytic lesions.
Main Outcome Measure
Interobserved and intraobserved agreement, by calculating the Cohen κ statistics measure for each descriptor.
High overall levels of reproducibility were shown for most of the evaluated features. In both the training and test sets there was a parallel trend of decreasing κ values as deeper anatomic skin levels were evaluated. All of the features, except 1, used for melanoma diagnosis, including roundish pagetoid cells, nonedged papillae, atypical cells in basal layer, cerebriform clusters, and nucleated cells infiltrating dermal papillae, showed high overall levels of reproducibility. However, less-than-ideal reproducibility was obtained for some descriptors, such as grainy appearance of the epidermis, junctional thickening, mild atypia in basal layer, plump bright cells, small bright cells, and reticulated fibers in the dermis.
The standard consensus confocal terminology useful for the evaluation of melanocytic lesions was reproducibly recognized by independent observers.
In vivo reflectance mode confocal microscopy (RCM) is an emerging technique that allows for noninvasive imaging of the epidermis and the upper dermis at near-histologic resolution.1 The clinical application of RCM in skin oncology seems to be very promising, particularly for evaluation of melanocytic lesions, owing to the high contrast provided by hyperreflective melanin and melanosomes.2 Since 2001, RCM features that are useful for the distinction between melanocytic nevi and melanomas (MMs) have been identified and have been correlated with histopathologic features.3- 7 A diagnostic algorithm was recently proposed and blindly tested on a large population of equivocal melanocytic lesions, resulting in high diagnostic accuracy.8,9
Reflectance mode confocal microscopy has the potential to improve diagnostic confidence and diagnostic accuracy.9- 11 Similar to other branches of medicine that require visual analysis of images, RCM image interpretation remains subjective and operator dependent. However, despite this subjectivity, a high degree of agreement among different observers is achievable. Toward this end, it is paramount to first establish a glossary of terms for RCM evaluation of melanocytic lesions; a consensus terminology along with illustrative images was recently published by experts in the field.12 The next step requires validation of this new terminology. We tested the interobserver and intraobserver reproducibility of the standard terms used for description and diagnosis of melanocytic lesions. For this purpose, we developed a dedicated Web-platform that allowed for training of study participants with an online RCM tutorial and for testing the reproducibility of their RCM image evaluations by interobserver and intraobserver agreement.
Based on a critical review of the scientific literature, a list of 36 RCM descriptive terms was previously agreed on by expert consensus.12 A Web-based tutorial was provided to the participants, and it contained the definitions of the RCM descriptive terms and illustrative images for each term. The tutorial and the whole data set are available online.13 The study was entirely conducted online, using the dedicated Web site. In Italy, ethics committee review of a deidentified database is not required.
The tutorial and training sets were based on individual images that were previously used in a published study by Pellacani et al.8 The training set consisted of 45 images that were acquired from benign and malignant melanocytic lesions at 3 different depths based on the anatomic levels of the skin (15 images from superficial epidermis, 15 from the dermal epidermal junction, and 15 from the papillary dermis). No mosaics were evaluated.
The test set consisted of 270 images, corresponding to 90 images per skin layer. It included images from 50 new cases of melanocytic lesions that were not used in previous studies, corresponding to 15 consecutive MMs, 5 consecutive Spitz nevi, and a random set of 30 melanocytic nevi, acquired from March 1, 2006, to June 30, 2006, at the Department of Dermatology at the University of Modena and Reggio Emilia, Modena, Italy. All lesions were diagnostically challenging on clinical and/or dermoscopic examination and were excised to rule out MM. All RCM imaging was acquired according to a standardized protocol previously published.8 On average, 250 RCM images were acquired per lesion, and all images were classified according to the anatomic depth or level into superficial epidermis, dermal epidermal junction (including the basal cell layer of the epidermis), or papillary dermis. In addition, all images were also classified according to image quality. The test set consisted of randomly selected images of good quality, and for each lesion at least 1 image was selected from the superficial epidermis level, 1 from the dermal epidermal junction, and 1 from the papillary dermis. An image was judged to have good quality based on a lack of artifacts and on a good overall contrast (excluding too bright or too dim images derived from acquisition mistakes). Finally, from each lesion, no more than 3 images per anatomical layer were selected and included in the final test set cases. Overall, each participant performed 300 test evaluations. To test for intraobserver agreement, 30 randomly selected images were duplicated within the test set. The order of images in the test set was randomized. Overall, more than 20 000 observations were considered for statistical analysis, enough to acquire a satisfactory statistical power.
Study participants were invited to view the tutorial. Six centers, of which 4 were in Europe (Madrid and Barcelona, Spain; Modena, Italy; and Zurich, Switzerland), 1 in the United States (New York, New York), and 1 in Australia (Sydney), with renowned expertise in the use of RCM for diagnosing melanocytic lesions, were involved. Subsequently, the training set of 45 images was used to allow participants to test and refine their skills of applying the descriptive terms to the RCM images. Each participant filled out an online electronic response form consisting of questions and accompanying choices for each image. The form included questions regarding all descriptive terms used in the study stratified by the anatomic skin level. Participants were required to first submit their evaluation of each image, after which they were able to compare their evaluation with that proposed in consensus by 2 of us (G.P. and J.M.). This real-time feedback was designed to assist the participants in applying the criteria to the RCM images. After the completion of the training set evaluations, the investigators were asked to access the test set. The same electronic data sheets were used, this time without feedback. Participants could evaluate only 1 image at a time, and each case had to be submitted before participants could access the next image. Although images from the tutorial and training sets were accessible at all times, participants could not go back to previously analyzed images in the test set. Each participant was given 4 months (January 1–April 30, 2007) to complete the evaluation of the images.
Data analysis was carried out using the SPSS statistical package (release 10.0.06; SPSS Inc, Chicago, Illinois) and Stata statistical software (release 10.0; Stata Corp LP; College Station, Texas). We computed the κ statistics for evaluation of interobserver and intraobserved agreement by calculating the Cohen κ statistics measure for each descriptor and considering P < .01 as significant.
The interobserver reproducibility on the training set was assessed for each study participant and for each RCM feature by calculating the percentage of ratings concordant with the proposed ones considered as gold standards. The overall interrater reproducibility was determined on the test set by calculating the proportion of concordant outcomes, considering reproducibility to be in full agreement if evaluation of an RCM feature in an image was concordant for all 6 participants and in good agreement if 5 of the 6 participants were concordant. Disagreement was defined as images that resulted in agreement inferior to good agreement. For each feature, the frequency of positive observations, corresponding to the percentage of observations reporting the feature as present, was calculated. For each RCM feature across the entire test set, we considered a feature to have a high overall level of reproducibility if more than 70% of the images showing this RCM feature resulted in good agreement (ie, disagreement in no more than 30% of images). The intraobserver reproducibility was assessed in a subset of 30 images, randomly selected from the test set, and submitted twice to each study participant.
To overcome the intrinsic limitations of the κ statistics interpretation in evaluating reproducibility on observations lacking a gold standard and showing an imbalance of positive vs negative responses, we decided to consider also the proportion of concordant evaluations. Thus, the interpretation of the results should take into account both the proportion of concordant evaluations (full agreement and good agreement) and the κ value. To avoid misinterpretation, only significant κ values were considered, and a very low P value (P < .01) was considered significant in order to obtain more consistent and stable results.
Training set results are summarized in Table 1. A high proportion of evaluations were concordant with the proposed ones for each parameter, ranging from 68% for isolated cells in the papillary dermis to 99% for a honeycombed epidermal pattern. Although not significant (P = .17), reproducibility of RCM features showed a decreasing trend from the superficial layers to the dermis; overall, concordance in RCM feature recognition was 86% for the features observable in the superficial epidermis, 83% for terms related to the dermal-epidermal junction, and 81% for dermal aspects.
Test set results are reported in Table 2. In the superficial epidermis, high overall levels of reproducibility (>70% of the images showing the feature achieved agreement by at least 5 participants) were obtained for honeycombed and cobblestone patterns and their variants (broadened honeycombed and cobblestone with small nucleated cells) (Figure 1A). Complete disarrangement or uneven pattern of the superficial epidermis (ie, atypical honeycombed and atypical cobblestone patterns) resulted in high overall levels of reproducibility, with good to full agreement seen in over 74% of images. A grainy pattern was not reproducibly recognized, resulting in disagreement in almost half of images showing this feature. For pagetoid cells, high overall levels of reproducibility were obtained both for the recognition of the presence of pagetoid cells and for their morphologic description (Figure 1B), obtaining good agreement in more than 80% of images, with the exception of the descriptors “small cells” and “cell density.”
Evaluating the dermal-epidermal junction (including the basal cell layer of the epidermis), edged and nonedged papillae, junctional clusters, marked atypia, and sheetlike structures showed high overall levels of reproducibility, whereas the distinction of junctional thickening, typical cells, or mild atypia resulted in disagreement in approximately half of the images (Figure 2).
Within the papillary dermis, identification of cell clusters and their distinction into different subtypes showed high overall levels of reproducibility, with the exception of dishomogeneous clusters, which showed disagreement in approximately 35% of cases. Although nucleated cells within the papillary dermis were reproducibly recognized, identification of plump bright cells or small bright cells and particles resulted in disagreement in almost 45% of image evaluations. For collagen structures, identification of collagen bundles showed high overall levels of reproducibility, whereas identification of reticulated fibers was not reproducible.
Table 3 lists the κ values for the intraobserver reproducibility. With the exception of a grainy pattern in superficial epidermis, mild atypia, sheetlike structures at dermal-epidermal junction, dishomogeneous and sparse clusters, and small bright cells and particles in the dermis, all the other 28 parameters showed good intraobserver reproducibility, with κ values greater than 0.4. In 11 features, the κ value was at least 0.7. Two parameters (collagen bundles and cerebriform clusters) were not eligible for intraobserver evaluation owing to their scarcity in the set of 30 reevaluated images.
In vivo RCM is a noninvasive, morphologic characteristic–based technique that may have application in the field of dermatological oncology. The use of this technique resulted in high diagnostic accuracy rates for MM7,9 and basal cell carcinoma.14 In a recent publication,15 a blinded evaluation of a series of more than 300 lesions, including 136 MMs, showed that the diagnosis based on RCM was much more specific than dermoscopic diagnosis, whereas sensitivity values were comparable between the 2 modalities. These findings suggest that the use of RCM may be a useful adjunct to clinical and dermoscopic examination. Although an overall good agreement was reported in the previous studies,9,15 both studies were conducted in close collaboration by 2 medical centers, and each evaluation concerned the whole lesion in context, introducing a possible bias in image interpretation by adjusting the interpretations of individual features accordingly with the overall diagnostic impression. In the study described herein, we focused on the reproducibility of RCM terminology, evaluating the possibility of recognizing features within single confocal images out of the lesion contest, including evaluators from independent medical centers. Although physicians are becoming increasingly aware of the value of RCM, with more and more groups engaging in RCM research, interpretation of RCM features has lacked standardization and has relied mostly on personal expertise. For this reason, we previously proposed a standard RCM terminology for melanocytic lesions that was based on a consensus among experts.12 In the present study, we evaluated the interobserver and intraobserver reproducibility in recognition of these standard RCM parameters. The fact that RCM analysis is image-based enabled us to design the study as a Web-based platform, offering remote tutorials for participants, followed by blinded evaluation of a large test set of images.
The intraobserver agreement was assessed in 30 images and showed good κ values for all assessed parameters. However, cerebriform clusters and collagen bundles were not assessed because there were not enough cases present in the test set to allow for intraobserver statistical analysis.
On the training set of images, all study participants showed consistency in RCM feature recognition. Preselection of a highly illustrative set of images may account for the overall higher agreement on feature recognition compared with the subsequent test set analysis. However, in both the training and test sets there was a parallel trend of decreasing κ values as deeper anatomic skin levels were evaluated. This may be partially explained by decreasing image resolution with increasing imaging depth.
Notably, pagetoid infiltration (Figure 1B) showed high overall levels of reproducibility; over 80% of the images that included pagetoid infiltration resulted in agreement among at least 5 of the 6 observers. The good agreement was maintained in morphologic characterization of the pagetoid cells (eg, cell size and shape). Because pagetoid infiltration was previously reported to be one of the most relevant RCM features for MM diagnosis,9 these data further support the inclusion of this parameter in an RCM-based diagnostic algorithm. Distinction of typical and mildly atypical cells at the dermal-epidermal junction was difficult, showing good agreement in a little more than 50% of evaluated images and suggesting a deficiency in the definition or intrinsic limitations in the differentiation of both cellular morphologic characteristics in the dermal-epidermal junction; however, marked atypia, a specific feature for MM diagnosis,9 was strongly reproducible. The greater variability in the background architecture of the dermal-epidermal junction, comprising edged papillae, nonedged papillae, and/or junctional clusters or thickenings, could be responsible for the greater difficulty in consistently assessing cellular morphologic characteristics in the absence of striking atypia (Figure 2). In contrast, pagetoid cells usually stand out in sharp contrast with the more homogeneous background arrangement of the keratinocytes in the superficial epidermis (ie, small polygonal hyporeflective structures in a honeycombed pattern or round bright structures with a cobblestone architecture). Single, large, nucleated bright cells, corresponding to atypical melanocytes, were recognizable with good agreement within dermal papillae, probably due again to the sharp contrast between the bright cells and the usually dark homogeneous background of the dermis.
Cellular aggregate recognition was reproducible, particularly for large, roundish to oval structures forming junctional or dermal clusters. Analyzing different cluster subtypes, a good agreement was maintained in all the 3 subtypes (dense, sparse-cell, and cerebriform). The differences observed in κ values were influenced by the prevalence of the feature in the image set, which was high for dense clusters and rare for sparse-cell and cerebriform clusters. Junctional thickenings were not reproducibly recognized. Similarly, evaluating dermal-epidermal architecture, edged papillae, which are rings of bright cells (mostly pigmented keratinocytes at the basal layer) surrounding dermal papillae, were highly reproducible, whereas nonedged papillae, lacking a clear rim of bright polygonal cells, obtained a good, but lower, agreement. Both small bright and plump bright cells, which correspond to an inflammatory infiltrate, and collagen structures obtained a good agreement in just over the 50% of observed images.
Comparing intraobserver and interobserver reproducibility for features showing lower concordance, we speculate that some features were not sufficiently described to obtain an unequivocal interpretation (grainy epidermis or mild atypia), whereas others were probably subjectively but consistently interpreted (junctional thickenings, plump bright cells, and small bright particles). Better definitions and consequent further tutorial and training are required to improve agreement on these attributes.
On the whole, 5 of the 6 features used for MM diagnosis in the recently proposed RCM algorithm,8,9 corresponding to roundish pagetoid cells, nonedged papillae, atypical cells in basal layer, cerebriform clusters, and nucleated cells infiltrating dermal papillae, showed high overall levels of reproducibility, with concordance in 5 of 6 observers in over 70% of evaluated images; identification of mild atypia in basal layers showed lower agreement and is potentially a weaker diagnostic criterion. Diffuse pagetoid infiltration, corresponding to the sixth feature of the RCM algorithm, was not evaluated because it requires the evaluation of a wider lesion area (using RCM mosaic images) than that assessed in the individual high-resolution study images.
In conclusion, the standard consensus RCM terms useful for the evaluation of melanocytic lesions were reproducibly recognized by independent RCM observers. Specifically, the parameters that were recently described in an RCM algorithm for MM diagnosis showed a high level of agreement among observers. Considering the disagreement among pathologists on the diagnosis of MM vs benign melanocytic lesions and the less-than-ideal reproducibility of some histologic features of dysplastic nevi and MM,16 we may state that the identified RCM features can be reproducibly identified by trained observers. The good level of concordance for RCM descriptors was notably obtained after allowing participants access to a common training resource that consisted of a tutorial and a short training set of images. Of note, all participants were skilled in RCM use and interpretation of its images and knowledgeable of the literature and standard RCM terminology. Thus, more elaborate training may be required for beginners to achieve this level of feature recognition. Such training may be offered via didactic lectures and online tutorials. Finally, it is also important to stress that the final diagnosis with RCM is based not only on the evaluation of an isolated image but on the global evaluation of much more numerous images obtained per tumor, including VivaBlocks (mosaics of 1 × 1 to 8 × 8 mm of images at the same level of the epidermis), VivaStacks (several images at different depth), or videos. In the present study, only the reproducibility of isolated images was evaluated and not the reproducibility of a complete evaluation.
Correspondence: Giovanni Pellacani, MD, Department of Dermatology, University of Modena and Reggio Emilia, Via del Pozzo 71, 41100 Modena, Italy (email@example.com).
Accepted for Publication: March 17, 2009.
Author Contributions: All of the authors contributed to the study in a manner that meets authorship criteria, saw the final draft of the manuscript, and approved the validity of the data presented. Dr Pellacani had full access to all the data in the study and takes responsibility for the integrity of the data and the accuracy of the data analysis. Study concept and design: Pellacani, Puig, Seidenari, and Malvehy. Acquisition of data: Pellacani, Bassoli, Braun, Longo, Marghoob, and Malvehy. Analysis and interpretation of data: Pellacani, Vinceti, Bassoli, Braun, Gonzalez, Guitera, Marghoob, Menzies, Puig, Scope, and Malvehy. Drafting of the manuscript: Pellacani. Critical revision of the manuscript for important intellectual content: Pellacani, Vinceti, Bassoli, Braun, Gonzalez, Guitera, Longo, Marghoob, Menzies, Puig, Scope, Seidenari, and Malvehy. Statistical analysis: Vinceti. Study supervision: Pellacani, Puig, Seidenari, and Malvehy.
Financial Disclosure: None reported.
Funding/Support: This study was supported in part by a grant from the Istituto Superiore di Sanità (ISS), Italy (project No. 527/B/3A/4).
Additional Contributions: Fabio Cannillo, informatics engineer, created the dedicated Web site and provided technical assistance.