Dermoscopic image of lesion 25, superficial spreading melanoma: Breslow depth, 0.65 mm.
Dermoscopic image of lesion 3, level I superficial spreading melanoma.
Dolianitis C, Kelly J, Wolfe R, Simpson P. Comparative Performance of 4 Dermoscopic Algorithms by Nonexperts for the Diagnosis of Melanocytic Lesions. Arch Dermatol. 2005;141(8):1008-1014. doi:10.1001/archderm.141.8.1008
To assess 4 dermoscopy methods in a nonexpert setting.
Sixty-one medical practitioners, mainly primary care physicians in Australia, were trained in 4 dermoscopy algorithms. Participants then assessed macroscopic and dermoscopic images of 40 melanocytic skin lesions. Each of the dermoscopic images was assessed with pattern analysis, the 7-point checklist, the ABCD rule, and the Menzies method.
The Menzies method showed the highest sensitivity, 84.6%, for the diagnosis of melanoma, followed by the 7-point checklist (81.4%), the ABCD rule (77.5%), pattern analysis (68.4%), and assessment of a macroscopic image (60.9%). Pattern analysis and assessment of the macroscopic image showed the highest specificity, 85.3% and 85.4%, respectively. The ABCD rule showed a specificity of 80.4%; the Menzies method, 77.7%; and the 7-point checklist, 73%. The Menzies method had a diagnostic accuracy of 81.1%; the ABCD rule, 79.0%; the 7-point checklist, 77.2%; pattern analysis, 76.8%; and clinical assessment, 73.2%.
All algorithms performed well in the hands of relatively inexpert practitioners who had undertaken self-guided training provided on compact disc. The Menzies method showed the highest diagnostic accuracy and sensitivity for melanoma diagnosis and was preferred by study participants.
Over recent decades, there has been an increasing incidence of melanoma.1- 3 The most effective management strategy for a melanoma is early diagnosis and removal while it is still thin, when the chance of metastasis is low.4 As the features that suggest the presence of melanoma tend to develop as the tumor grows, however, thin melanomas are more difficult to diagnose.5 This difficulty of early clinical diagnosis has led to the development of dermoscopy (also known as dermatoscopy, epiluminescence microscopy, and oil immersion microscopy), which has been shown to improve the early diagnosis of melanomas.6 This technique has been widely adopted, and dermoscopic training has been shown to increase diagnostic accuracy.7,8 Dermoscopy has also been shown to increase the sensitivity for melanoma diagnosis by primary care physicians.9
Some of the dermoscopic algorithms currently in use are the 7-point checklist, the ABCD rule, pattern analysis, and the Menzies method.10- 13 A recent study involving 40 expert dermoscopists concluded that pattern analysis has the highest diagnostic performance of these 4 methods.14 Other algorithms had sensitivity similar to that of pattern analysis but poorer specificity.
Clinicians with significant dermoscopic experience tend to look at a dermoscopic image and immediately process all the information leading to a diagnosis without using any specific algorithm. However, dermoscopic algorithms are helpful in reaching a diagnosis for those with less experience in dermoscopy. This study aims to determine which dermoscopic diagnostic algorithm shows the highest sensitivity and diagnostic accuracy for the diagnosis of melanoma in a less expert setting.
Primary care physicians, dermatologists, and dermatology trainees were invited to participate in this study, which was advertised at several medical meetings and on a Web site for primary care physicians. The study was conducted over a 12-month period from July 2001 to June 2002. The Alfred Hospital ethics committee granted approval.
Participants were required to use a computer to access study materials. They were given explanatory written material as well as 3 compact discs (CDs). Two CDs contained educational material on dermoscopy, one from the American Academy of Dermatology and the other from the Web site www.dermoscopy.org. Participants were advised to work through all the educational material prior to assessing the test set of images. The third CD contained 5 test sets, each containing 40 images of melanocytic skin lesions as well as 1 example. Test set 1 contained the macroscopic images of the skin lesions. The corresponding 40 dermoscopic images appeared on each of the remaining 4 test sets for assessment by 1 of 4 dermoscopic methods: pattern analysis for test set 2, ABCD for test set 3, the Menzies method for test set 4, and the 7-point checklist for test set 5. Nonmelanocytic lesions and clinical history were excluded to allow a better comparison between dermoscopic methods.
Test set 1 was assessed first by all participants, but to avoid recall bias, the macroscopic image was not shown with the corresponding dermoscopic image. The order of lesions was randomly arranged in each test set so that participants assessed the lesions in a different order for each test set. Also, the random order of the dermoscopy images in test sets 2 through 5 made it difficult for participants to compare the results of a particular dermoscopic image using different methods. Participants were randomly allocated to 1 of 4 groups (A-D) to determine the order in which they were asked to assess the images in test sets 2 through 5: the test set order for group A was 2, 3, 4, 5; group B, 5, 4, 3, 2; group C, 3, 5, 2, 4; and group D, 4, 2, 5, 3. These precautions in the design of the study were used to minimize the chance of learning bias.
The test set of lesions was randomly selected by one author (C.D.) from a collection of dermoscopic images belonging to another author (J.K.). Only good-quality macroscopic and dermoscopic images were included, where the whole lesion was visible, including the entire periphery. Dermoscopic images had a ×10 magnification. Twenty melanomas and 20 nonmelanomas were chosen. With regard to the ABCD rule, lesions scoring higher than 4.75 (lesions of concern) were included in the final assessment with the melanomas (lesions with a score >5.45).
Participants noted their answers on worksheets and had no time restraints. These worksheets were returned to the research team in postage-paid envelopes together with responses to a short questionnaire on the participants’ experience. At the conclusion of the study, the worksheet results were entered into 2 separate computer databases, and the resulting 2 data sets were compared to assess any discrepancies. Discrepancies were checked against the original worksheet, and a final data set was produced with the discrepancies corrected. This allowed for the estimation of the rate of error in the data entry process.
To summarize interclinician agreement in diagnosis and identification of specific features within diagnostic methods, κ statistics were calculated for binary features and quadratic-weighted κ statistics were calculated for features on an ordinal scale. Intraclinician agreement was assessed by comparing responses made to similar features on different diagnostic methods for the same lesion, again using κ statistics. The κ statistic is equal to 0 when the amount of agreement is exactly what is expected by chance and 1 when there is perfect agreement. We used the following interpretations of intermediate values (Landis and Koch): <0.0, poor; 0.0 to <0.2, slight; 0.2 to <0.4, fair; 0.4 to <0.6, moderate; 0.6 to 0.8, substantial; and >0.8, almost perfect agreement.
Confidence intervals (CIs) were calculated for diagnostic accuracy, sensitivity, and specificity using logistic regression, with “robust” standard errors calculated using the information sandwich formula to allow for any correlation within lesion assessments by the same clinician. To calculate CIs and P values to compare diagnostic accuracy, sensitivity, and specificity between pairs of methods, we used a model for risk differences because the interpretation from this analysis is simpler than for a logistic regression model for odds ratios and we found the conclusions from the 2 sets of analyses were identical for our data. Robust standard errors were also calculated for these comparisons. Positive likelihood ratios were calculated to quantify the relative increase in odds of a lesion being a melanoma from before an assessment is done (ie, based on community prevalence of melanoma) to after a positive assessment is obtained. All statistical analyses were performed using Stata Statistical Software, version 7.0 (StataCorp LP, College Station, Tex).
Questionnaires and study materials were sent to 200 medical practitioners who showed interest in the project: 61 (30.5%) completed and returned the questionnaires, including 10 dermatologists, 16 dermatology trainees, and 35 general practitioners. They had a range of experience levels with assessment of skin lesions, as outlined in Table 1, and a significant number were novices in dermoscopy.
Table 2 lists the 20 benign lesions (including, among others, 7 dysplastic nevi, 3 Spitz nevi, 3 junctional nevi, and 2 compound nevi) and 20 melanomas (18 superficial spreading melanomas and 2 lentigo maligna melanomas) included in the study. Six of the 20 melanomas were level I melanomas. The 14 invasive melanomas had a median thickness of 0.50 mm. This set of melanomas represents a group of early thin melanomas.
For some lesions, diagnostic accuracy was good across the different methods (eg, lesion 9), but for other lesions, the accuracy of the diagnosis was poor (eg, lesion 37). In some lesions, the clinical diagnosis stood out as performing worse than the other methods (eg, lesions 19, 33, and 34) (Table 2).
As summarized in Table 3, the Menzies method showed the highest diagnostic accuracy and sensitivity. Sensitivity was significantly higher for the Menzies method than for all other methods except the 7-point checklist, where the difference approached significance (P = .07) (Table 4). Clinical assessment had the highest specificity overall, although the specificity of pattern analysis was effectively identical. There is a trade-off for those methods with higher sensitivity in that they fared worst on specificity. All methods had similar likelihood ratios (Table 3). For each participant, a sensitivity of 93.8% was found if any of the 4 dermoscopic methods showed the lesion to be a melanoma. If all 4 methods showed the lesion to be benign, this was correct 58.8% of the time.
There was accurate diagnosis of some individual lesions across the different methods (eg, lesion 9, a 0.46-mm-thick, level II, superficial spreading melanoma), but for other lesions, the accuracy of the diagnosis was poor (eg, lesions 36 and 37, both dysplastic nevi, and lesion 25, a level IV, 0.65-mm, superficial spreading melanoma [Figure 1]). Lesion 25 showed a sensitivity of 21.3% with the ABCD method in that the scores for each of the 4 main features were in or below the middle range, leading to a total score lower than 4.75 in 79% of cases. With the Menzies method, 48% of participants scored the lesion to have radial streaming and/or peripheral black dots and/or globules, thus leading to a diagnosis of melanoma. With the 7-point checklist, participants found irregular streaks, irregular dots and/or globules, or irregular pigmentation most frequently, but the score was 3 or higher in only 36%. Using pattern analysis, 25% of participants classified lesion 25 as a melanoma, with starburst pattern, irregular streaks, atypical pattern, and irregular dots and/or globules being the more common positive findings.
In some lesions, the results of clinical diagnosis were much less accurate than the results of the dermoscopic methods (eg, lesions 19, 33, and 34). Lesion 3, a level I superficial spreading melanoma (Figure 2), was diagnosed more accurately by the Menzies method than by other methods. The sensitivity of the Menzies method was 83% for lesion 3, with participants noting peripheral black dots and/or globules, multiple brown dots, and/or pseudopods. The ABCD method showed a sensitivity of 57% for this lesion, with participants rating the lesion 1.3 most commonly for asymmetry and noting 2 or 3 colors and 2 to 4 dermoscopic features. The 7-point checklist showed a sensitivity of 49%, with atypical pigment network, irregular streaks, irregular pigmentation, and irregular dots and/or globules being found most frequently. Pattern analysis showed 36% sensitivity despite a high level of participants rating the global pattern as multicomponent. Atypical pigment network, irregular streaks, and irregular dots and/or globules were the most common findings.
Intrarater agreement was moderate for blue-white veil (κ = 0.54) and regression structures (κ = 0.51), and poorer agreement was seen for other features such as the presence of streaks or pigment network (Table 5). The κ statistics were strongly influenced by the proportion of lesions that were positive for the feature, and this in part explained the observed patterns in κ statistics.
Overall melanoma diagnosis showed moderate agreement between raters (Table 6), with the Menzies method doing best and macroscopic image–based diagnosis worst. Within each method, the interrater agreement for individual features ranged from slight to moderate, though in most cases, the agreement for the final diagnosis of melanoma or nonmelanoma was higher than the agreement for individual features within the method. For the individual features in the 7-point checklist and the pattern method, the agreement was fair or worse, except for moderate agreement for regression structures. Within the Menzies method, the presence of a single color, symmetry of pattern, and scarlike depigmentation were the only features for which moderate agreement among clinicians was observed, whereas the assessment of pseudopods, peripheral black dots, and broadened network showed only slight agreement between observers. Agreement for the ABCD rule ranged from slight for branched streaks to moderate for the presence of light brown color and asymmetry analysis.
When comparing the performance of the 17 clinicians who preferred the Menzies method with the 44 whom we knew not to have a preference for the Menzies method, we found that these 17 clinicians had better performance (higher sensitivity and specificity) with the Menzies method in our study. However, they also had better performance on all other methods, and this improved performance was found to be approximately equal across all methods. Clinicians with no preferred method prior to the commencement of the study showed slightly higher sensitivities and slightly poorer specificities for all methods compared with clinicians with a preferred method.
Errors made by a physician within a method occurred in 1.8% of assessments for the 7-point checklist, 2.4% for the Menzies method, 2.7% for pattern, and 10.3% for ABCD. However, these errors did not necessarily lead to an error in diagnosis (relative to what would have been diagnosed had no errors been made).
For the Menzies method, errors included 37 instances (1.6% of all physician assessments) in which one or both negative features were present but the clinician diagnosed melanoma. In 18 instances (0.8%), negative features were not present and at least 1 positive feature was and still the clinician diagnosed the lesion as nonmelanoma.
Thirty-five primary care physicians, 16 dermatology trainees, and 10 dermatologists performed this study. Fifty-seven percent of participants used the dermoscope at least on a daily basis; 60% of participants diagnosed fewer than 5 melanomas per year. This latter group was composed of practitioners who assessed pigmented skin lesions regularly but were not highly expert in dermoscopy. The time necessary to complete the study (10-20 hours, as indicated by participants) was a significant factor contributing to nonparticipation, with only 30.5% of people who initially showed interest completing the study.
It could be argued that the conditions under which the test sets of images were assessed were somewhat artificial. There was no clinical history, no macroscopic image to accompany the dermoscopic image, and all images were viewed on a computer screen. Since this study aimed specifically to assess the dermoscopic methods, clinical history was left out. The corresponding macroscopic image was excluded to remove the bias that seeing this image might have had on diagnostic decision making. Participants were asked to assess the 40 clinical images first and then to assess the 4 test sets of dermoscopic images in a specified order. Having participants evaluate the test sets with the 4 diagnostic methods in different order eliminated learning bias toward any particular method.
The melanomas included in this study were early thin melanomas, at the stage where dermoscopic assessment might help in making the diagnosis. It should be noted that the melanomas chosen were lesions to which the algorithms could be applied, ie, pigmented melanomas with an associated radial growth phase. Amelanotic and nodular melanomas were excluded and are a source of selection and verification bias in studies such as this. The sensitivities and specificities quoted in this and most other studies apply only to pigmented melanomas with an associated radial growth phase. As many as 15% to 20% of melanomas are nodular or amelanotic and, as such, are not addressed by these algorithms, thus inflating the apparent accuracy of diagnosis compared with the assessment of all melanomas together.
Education of participants and assessment of the test set of lesions were possible with the use of a computer. Nonexperts in dermoscopy have been shown to benefit from computer-based training.15 Digital dermoscopic images have been shown to be as informative as photographic slides,16 so the use of digital images for assessment of the diagnostic methods is justified.
Only a few studies have compared dermoscopic algorithms. Carli et al17 concluded that pattern analysis is the most reliable diagnostic dermoscopic method and should be taught to trainees in dermatology. This conclusion was based on the finding that pattern analysis showed the best mean diagnostic accuracy. In their study, however, the 7-point checklist showed the highest sensitivity of the 3 methods examined (ABCD, pattern analysis, and 7-point checklist). The Menzies method was not examined.
With the first part of the 2-part algorithm in differentiating melanocytic from nonmelanocytic nevi,18 high specificity is important to rule out melanoma. In the second part of this algorithm, high sensitivity is more important to detect the most melanomas. When these methods are combined, diagnostic accuracy is a more representative measure of which method has better results. Methods showing the highest sensitivity for the diagnosis of melanoma will be the most useful in the primary care setting resulting in fewer melanomas being missed.
Our results differ significantly from those of Argenziano et al,14 who conducted a somewhat similar study with highly expert dermoscopists. In that study, participants were given the clinical history, a macroscopic clinical image, and then the dermoscopic image to be assessed with the 4 diagnostic algorithms. The authors found sensitivities of the different methods to range from 82.6% to 85.7% compared with 68.4% to 84.6% in our study, with the Menzies method being highest in both. The range of specificities in both studies was similar, with pattern analysis showing the highest specificity in both studies. By eliminating the macroscopic image, we were able to make a comparison between dermoscopic algorithms and the naked-eye assessment. The latter showed good specificity.
One explanation for the difference in results between our study and that by Argenziano et al14 is that our study included many participants who were not expert in dermoscopy. It is likely that the experts in the study of Argenziano et al used their experience to make an intuitive (gestalt) dermoscopic diagnosis and were influenced by this in their application of the algorithms. Other clues to the diagnosis provided in their study were the clinical history and macroscopic images, and these may also have influenced the use of algorithms and reduced their role in achieving the diagnosis.
With regard to specific dermoscopic features, intraobserver agreement was highest for blue-white veil and for regression structures. The interobserver agreement was highest for assessment of lesion symmetry, regression structures, and presence of a single color. The interobserver agreement showed generally poor κ values for most other individual features but much better results for the method as a whole. Similar findings in the study by Argenziano et al14 were explained by the overall dermoscopic gestalt impression of a pigmented skin lesion leading to the correct diagnosis independent of agreement for individual dermoscopic features. Many of our participants did not have sufficient experience in dermoscopy to account for the development of significant dermoscopic gestalt. However, education in the 4 dermoscopic methods may have contributed to development of some intuitive assessment, as suggested by the observation that critical physician errors in application of algorithms did not always lead to the wrong diagnosis.
The number of errors made in the application of diagnostic methods was small relative to the number of lesions assessed by all participants. The rate of errors was a few percent of lesion assessments for the 7-point checklist and the Menzies method. Although the ABCD method showed a worryingly high error rate, these errors often did not lead to changes in diagnosis. In fact, the ABCD and Menzies methods had a similar rate of errors leading to misdiagnosis. The rates of error are more difficult to determine with pattern analysis. Features such as the blue-white veil, regression structures, and the multicomponent pattern corresponded with a diagnosis of melanoma 70% to 80% of the time.
The Menzies method was the preferred method of diagnosis for 28% of participants prior to completing our study, and this preference may have been a source of bias. However, the clinicians who preferred the Menzies method prior to the study performed equally well across methods, which suggests that this preference was not a major influence on their performance in other methods. Possible reasons for this stated preference include a preference by inexpert dermoscopists for a simpler method and bias toward a method that is locally described and taught.
In the study by Argenziano et al,14 most participants preferred pattern analysis, which was found to have the best overall performance in that study. Pattern analysis is the most difficult method to learn and requires the most experience. Pattern analysis is likely to be the best method for experts as demonstrated by the study of Argenziano et al.10
On the other hand, for those who wish to learn dermoscopy and for primary care physicians, the Menzies method is simple to learn and has the highest sensitivity for the diagnosis of melanoma. When combined with the macroscopic view (which demonstrated high specificity in our study) and the first part of the dermoscopic algorithm, there is a good basis for effective early diagnosis of melanoma.
We conclude that the Menzies method shows the highest sensitivity and accuracy for the diagnosis of melanoma, and this may be the best method for those who wish to learn dermoscopy and for primary care physicians.
Correspondence: Con Dolianitis, MB, BS, 191 Hotham St, Elsternwick 3185, Australia (firstname.lastname@example.org).
Accepted for Publication: April 23, 2005.
Funding/Support: This study was supported by a grant from the Scientific Research Fund of the Australasian College of Dermatology.
Acknowledgment: We thank the American Academy of Dermatology, Washington, DC, for their permission to use the 1999 educational CD Dermoscopy: A Practical Guide.
Financial Disclosure: None.