After the first-step diagnostic algorithm, the 3387 SLs evaluated as melanocytic underwent the next 4 diagnostic algorithms. ABCD indicates asymmetry, border, color, and diameter; SLs, skin lesions.
aMelanocytic vs nonmelanocytic SLs.
bMelanoma vs benign melanocytic SLs.
cMalignant vs benign SLs.
The first-step algorithm (study 1) aimed to differentiate between melanocytic and nonmelanocytic lesions. The pattern analysis algorithm (study 1) aimed to differentiate between melanoma and benign melanocytic lesions. The 3-point checklist algorithm (study 2), aimed to differentiate between malignant (melanoma and basal cell carcinoma) and benign skin lesions. With increasing group size, the true-positive rate increased and the false-positive rate decreased for each combination of collective intelligence approach and each diagnostic algorithm. Data are expressed as mean values. The dashed lines represent the mean individual true- and false-positive rates (ie, group size of 1).
Each point is obtained by setting a different quorum threshold, starting at 0, with increments of 0.05 to 1. Data are shown for the first-step, pattern analysis, and 3-point checklist diagnostic algorithms. Data are based on a group size of 11.
eFigure 1. True-Positive Rate of Melanoma and Basal Cell Carcinoma in Study 2
eFigure 2. Effect of Increasing Group Size for ABCD Rule
eFigure 3. Effect of Increasing Group Size for Menzies Method
eFigure 4. Effect of Increasing Group Size for 7-Point Checklist
eFigure 5. ROC Curves When Varying the Quorum Threshold
Ralf H. J. M. Kurvers, Jens Krause, Giuseppe Argenziano, Iris Zalaudek, Max Wolf. Detection Accuracy of Collective Intelligence Assessments for Skin Cancer Diagnosis. JAMA Dermatol. 2015;151(12):1346–1353. doi:10.1001/jamadermatol.2015.3149
Incidence rates of skin cancer are increasing globally, and the correct classification of skin lesions (SLs) into benign and malignant tissue remains a continuous challenge. A collective intelligence approach to skin cancer detection may improve accuracy.
To evaluate the performance of 2 well-known collective intelligence rules (majority rule and quorum rule) that combine the independent conclusions of multiple decision makers into a single decision.
Design, Setting, and Participants
Evaluations were obtained from 2 large and independent data sets. The first data set consisted of 40 experienced dermoscopists, each of whom independently evaluated 108 images of SLs during the Consensus Net Meeting of 2000. The second data set consisted of 82 medical professionals with varying degrees of dermatology experience, each of whom evaluated a minimum of 110 SLs. All SLs were evaluated via the Internet. Image selection of SLs was based on high image quality and the presence of histopathologic information. Data were collected from July through October 2000 for study 1 and from February 2003 through January 2004 for study 2 and evaluated from January 5 through August 7, 2015.
Main Outcomes and Measures
For both collective intelligence rules, we determined the true-positive rate (ie, the hit rate or specificity) and the false-positive rate (ie, the false-alarm rate or 1 − sensitivity) and compared these rates with the performance of single decision makers. Furthermore, we evaluated the effect of group size on true- and false-positive rates.
One hundred twenty-two medical professionals performed 16 029 evaluations. Use of either collective intelligence rule consistently outperformed single decision makers. The groups achieved an increased true-positive rate and a decreased false-positive rate. For example, individual decision makers in study 1, using the pattern analysis as diagnostic algorithm, achieved a true-positive rate of 0.83 and a false-positive rate of 0.17. Groups of 3 individuals achieved a true-positive rate of 0.91 and a false-positive rate of 0.14. These improvements increased with increasing group size.
Conclusions and Relevance
Collective intelligence might be a viable approach to increase diagnostic accuracy in skin cancer and reduce skin cancer–related mortality.
Incidence rates of skin cancer have been increasing during the past 5 decades in the United States and many parts of Europe.1,2 The key for reducing the mortality rate due to skin cancer includes early detection and correct classification of skin lesions (SLs).3,4 During the past 2 decades, several different approaches have been proved to increase diagnostic accuracy, most notably dermoscopy4- 7 and computer-aided diagnosis.8- 11 We herein focus on an alternative and complementary collective intelligence approach that, to our knowledge, has received little attention in skin cancer research.
Collective intelligence refers to the ability of groups to outperform single individuals when performing cognitive tasks.12- 16 Well-known examples of collective intelligence include the prediction of election outcomes and memory retrieval and number estimation tasks.12,17,18 Although collective intelligence can thus be used in a diverse range of tasks, at present, little is known about the scope for collective intelligence in medical diagnostics. We herein investigate whether a collective intelligence approach can be used to improve diagnostic accuracy in skin cancer detection.
Although most research on skin cancer detection focuses on single raters, King et al19 investigated a crowdsourcing approach in the context of skin self-examination. In their study, participants with little to no experience in dermatology independently classified 40 images as melanoma or nonmelanoma. Afterward, these decisions were combined into a single collective decision. Compared with decisions made by individuals, collective decisions achieved a higher true-positive rate (ie, hit rate or specificity) but also a higher false-positive rate (ie, false-alarm rate or 1 – sensitivity). However, collective decisions consisted of the mean ratings of 400 individuals, and the study did not include experienced dermatologists, who outperform individuals with little experience in dermatology. Thus, our understanding of how combining independent assessments by dermatologists affects diagnostic accuracy is limited, and the extent to which a collective intelligence approach can improve skin cancer detection is unclear.
Herein we investigated the potential of a collective intelligence approach in skin cancer detection. We used 2 large and independent data sets to investigate whether 2 well-known collective intelligence rules (majority rule and quorum rule) that combine the independent assessments of multiple raters can improve diagnostic accuracy when differentiating between different types of SLs.
We used 2 data sets based on 2 published studies in which patient data were deidentified.20,21 Institutional review board approval was waived by the Second University of Naples for both studies because they did not affect the routine procedures during clinical practice. Data were collected from July through October 2000 for study 1 and from February 2003 through January 2004 for study 2. Brief descriptions of each study follow.
The first study was based on a consensus meeting via the Internet, known as the Consensus Net Meeting on Dermoscopy.20 In this study, 40 experienced clinical dermoscopists (with ≥5 years of experience in dermoscopy practice, teaching, and publishing) independently diagnosed 128 digital images of SLs. Skin lesion images were obtained from the Department of Dermatology, University Frederico II; the Department of Dermatology, University of L’Aquila; the Department of Dermatology, University of Graz; the Sydney Melanoma Unit, Royal Prince Alfred Hospital; and the Skin and Cancer Associates, Plantation, Florida.20 Skin lesions were selected based on the photographic quality of the clinical and dermoscopic images. Histopathologic specimens of all SLs were available and judged by a histopathology panel. Diagnostic categories included melanoma (n = 33), benign melanocytic SLs (n = 70), basal cell carcinoma (n = 10), and other nonmelanocytic SLs (including 10 seborrheic keratoses, 2 vascular lesions, 2 dermatofibromas, and 1 lichen planus–like keratosis [n = 15]). Participants evaluated the dermoscopic images of the SLs via the Internet. Dermoscopists first underwent a training procedure consisting of 20 SLs, during which they received web-based tutorials to familiarize them with the definitions and procedures. Then, dermoscopists evaluated the remaining 108 SLs (Figure 1). First, each participant was asked to evaluate each SL using the first-step diagnostic algorithm, which differentiates melanocytic from nonmelanocytic lesions. Whenever a participant evaluated an SL as melanocytic, the participant was asked to classify the SL as melanoma or a benign melanocytic lesion. For this classification, each participant was instructed to use the following 4 diagnostic algorithms sequentially: (1) pattern analysis; (2) the ABCD (asymmetry, border, color, and diameter) rule; (3) the Menzies method; and (4) a 7-point checklist.20
The second Internet-based study included 165 digital images of SLs and 170 participants.21 The 165 SLs were seen and selected at a specialized pigmented lesion clinic established by the Department of Dermatology, Second University of Naples. Skin lesions were selected based on high image quality and the presence of melanin or hemoglobin pigmentation in all or part of the lesion. Whereas study 1 focused on the differentiation between melanocytic and nonmelanocytic lesions (and within the melanocytic lesions, on the differentiation between melanoma and benign melanocytic lesions), study 2 focused on the differentiation between malignant (including melanoma and basal cell carcinoma) and benign SLs. Results of a histopathologic examination classified each lesion as malignant (n = 49) or benign (n = 116). The participants varied in their dermatology experience (Table 1). The participants evaluated the SLs via the Internet. After a training procedure consisting of 15 SLs, the participants were asked to evaluate the remaining 150 SLs using the 3-point checklist as a diagnostic algorithm. The 3-point checklist is based on 3 dermoscopic criteria (asymmetry, atypical network, and blue-white structure), whereby the presence of 2 or more criteria is considered indicative of malignancy. Not all the 170 participants evaluated all the images. We excluded all individuals who evaluated fewer than 110 images, which resulted in 82 participants.
To summarize, the participants in study 1 first used a diagnostic algorithm (first step) to differentiate melanocytic from nonmelanocytic SLs, and if a participant evaluated an SL as melanocytic, then 4 different diagnostic algorithms (pattern analysis, the ABCD rule, the Menzies method, and a 7-point checklist) were used to differentiate melanoma from benign melanocytic lesions. Participants in study 2 used a single diagnostic algorithm (a 3-point checklist) to differentiate malignant from benign SLs. To test the performance and robustness of a collective intelligence approach, we investigated the performance of both collective intelligence rules for each diagnostic algorithm.
We tested the performance of 2 well-known collective intelligence rules: the majority and the quorum rules. These rules aggregate the independent assessments of multiple raters. We applied the majority and the quorum rules to each of the 6 diagnostic algorithms described above. In the following description we will use the first-step diagnostic algorithm as an example. All other diagnostic algorithms were analyzed following the same procedure.
The majority rule classifies each SL according to the majority opinion of the raters.22- 24 For the first-step algorithm, majority rule implies that whenever the majority of the group members classifies an SL as melanocytic, the SL is classified as melanocytic; otherwise it is classified as nonmelanocytic. For a given group size (n; range, 1-11, only using odd numbers to avoid a tie-breaker rule), we randomly drew n evaluations for each SL in the data set. For each SL, we then determined the number of evaluations supporting melanocytic and nonmelanocytic classifications. Each SL was classified based on the option that received the most support. After classifying each SL in this way, we used the histopathologic records to determine the true- and false-positive rates of the majority rule. For each group size n, we repeated this procedure 2000 times. We report the mean (SEM) true- and false-positive rates per group size.
The quorum rule uses a so-called quorum threshold to classify an SL. Each SL is classified as condition present whenever the fraction of evaluations for condition present is above the quorum threshold; otherwise the SL is classified as condition absent. For example, for the first-step diagnostic algorithm and a quorum threshold of 0.3, an SL is classified as melanocytic whenever at least 30% of the raters in a group classify it as melanocytic; otherwise it is classified as nonmelanocytic. Compared with single raters, groups using the quorum rule are predicted to increase true-positive results and decrease false-positive results whenever the quorum threshold is set halfway between the mean true- and false-positive rates of the raters.25,26 The procedure went as follows. First, we randomly assigned half of the SLs to a training set and the other half to a validation set. The training set was used to determine the quorum threshold. We calculated the mean true- and false-positive rates of the participants in the training set and set the quorum threshold halfway between both values (alternative ways of setting the quorum threshold are described below). We then investigated the performance of this quorum threshold in the validation set. For all SLs in the validation set, we randomly drew n (range, 1-11, only using odd numbers) evaluations. For each SL in the validation set, we then determined the fraction of evaluations supporting the condition-present classification (for the first-step algorithm, melanocytic). Whenever this fraction was higher than (or equal to) the quorum threshold, the SL was classified as condition present; otherwise as condition absent. After classifying each SL in the validation set in this way, we used the histopathologic records to determine the true- and false-positive rates of the quorum rule in the validation set. For each group size, we repeated this procedure 2000 times. We report the mean (SEM) true- and false-positive rates per group size. The SLs in the validation set are different from those in the training set. This cross-validation procedure ensures an independent evaluation of the performance of the quorum threshold, thereby preventing overfitting.
We determined the performance of the best individual in each group using a similar cross-validation procedure. First, we randomly assigned half of the SLs to a training set and the other half to a validation set. The training set was used to identify the best individual. For a given group size n (range, 1-11, only using odd numbers), we randomly drew n individuals and determined the performance of these individuals in the training set. In the training set, we calculated the true-positive and true-negative rates of each individual and selected the best individual, giving equal weight to that individual’s true-positive and true-negative rates. We then calculated the true- and false-positive rates of this best individual using the SLs from the validation set. We repeated this procedure 2000 times per group size.
Data were analyzed from January 5 through August 7, 2015. We analyzed the effect of group size (ie, the number of independent evaluations) on the true- and false-positive rates using generalized linear models with binomial errors and a logit-link function because true- and false-positive rates were bound from 0 to 1. We used the built-in generalized linear modeling function in R (version 3.2.0; R Development Core Team 2010). Significance levels were derived from the z scores and associated P values and set at P < .05.
Table 1 provides an overview of the demographics of the participants in studies 1 and 2. Figure 2 shows the results of applying the majority and quorum rules to the first-step, pattern analysis, and 3-point checklist diagnostic algorithms. We found that an increasing group size increased the diagnostic accuracy independently of the diagnostic algorithms. Compared with single decision makers, groups using the majority or the quorum rule achieved higher true-positive rates (Figure 2A, C, and E) and lower false-positive rates (Figure 2B, D, and F). These effects already occurred at a group size of 3 and further increased with increasing group size. To illustrate, the mean individual true-positive rate under the pattern analysis algorithm is 0.83 (Figure 2C) and the mean individual false-positive rate is 0.17 (Figure 2D). In contrast, combining 3 independent raters using the majority rule results in a true-positive rate of 0.91 (Figure 2C) and a false-positive rate of 0.14 (Figure 2D). When we compared the 2 different types of malignant lesions in study 2 (melanoma and basal cell carcinoma), we found that the true-positive rate increased with increasing group size for both types (eFigure 1 in the Supplement). Generally, improvements stabilized at a group size of approximately 10.
Groups using the quorum rule outperformed the best individual in that group. The quorum rule achieves higher true-positive rates (Figure 2A) and lower false-positive rates (Figure 2B) or higher true-positive rates (Figure 2C and E) and comparable or slightly higher false-positive rates (Figure 2D and F). The majority rule outperforms the best individual in some of the cases (Figure 2A-D) and achieves higher true-positive rates (Figure 2E) but also higher false-positive rates (Figure 2F) in other cases.
When applying the collective intelligence rules to any of the other 3 diagnostic algorithms aimed at differentiating between melanoma and benign melanocytic lesions (the ABCD rule, the Menzies method, and the 7-point checklist in study 1), we obtained qualitatively similar results (Table 2 and eFigures 2-4 in the Supplement).
In the analysis above, we always set the quorum threshold halfway between the mean true- and false-positive rates of the raters. However, a key advantage of the quorum rule compared with the majority rule is that it can be adjusted to put more weight on improving the true- or the false-positive rate.27,28 Generally, higher quorum thresholds tend to decrease true- and false-positive rates because more evaluations for condition present (eg, melanocytic SL) are needed to classify an SL as condition present. Conversely, lower quorum thresholds tend to increase true- and false-positive rates. This trade-off between true- and false-positive rates is well-known for individual decision makers,25,29 and here it is present at the group level.
To illustrate the flexibility of the quorum threshold, we used a range of different quorum thresholds (range, 0-1, with increments of 0.05) and calculated for each threshold the associated true- and false-positive rates. Figure 3 shows the results of these analyses for the first-step, pattern analysis, and 3-point checklist diagnostic algorithms illustrating the trade-off between the true- and false-positive rates at the collective level. The other 3 diagnostic algorithms showed a similar pattern (eFigure 5 in the Supplement).
Our results show that 2 well-known collective intelligence rules that combine the independent assessments of multiple raters can improve performance in detection of skin cancer (ie, increase the true-positive rate and decrease the false-positive rate). Specifically, we show that collective intelligence increases the diagnostic accuracy of melanoma in study 1 and of melanoma and basal cell carcinoma in study 2. Given that we tested our collective intelligence approach in different scenarios (2 independent data sets and 6 diagnostic algorithms), this result appears to be particularly robust.
The majority rule requires no prior information before implementation because for any given SL, it follows the decision of the majority of the raters. In contrast, the quorum rule requires information before implementation because it is based on a quorum threshold that has to be set below the mean true-positive rate and above the mean false-positive rate of the raters. Although the quorum rule thus requires some prior information, it is more flexible than the majority rule. First, the majority rule only works well when the mean true-positive rate is well above 50% and the mean false-positive rate is well below 50%.22- 24 The quorum rule, however, is more flexible and is predicted to work whenever the quorum threshold is set between the true- and false-positive rates of the raters.25 A second benefit of the quorum rule is that the threshold can be shifted upward or downward, depending on which of the 2 types of errors (ie, false-positive or false-negative) is deemed more important to prevent. In dermoscopy, the main goal is to maximize the true-positive rate while maintaining an acceptable degree of false-positive findings.30 A majority rule has a fixed threshold and thus does not allow for such adjustments.
An important future question is to understand the mechanisms underlying the observed collective improvement. A first mechanism could be that some poor performers bring down the mean individual accuracy. When combining decisions, these erroneous decisions get filtered out. This explanation seems unlikely because in our data sets, the vast majority of individual raters is outperformed by the collective approach. A second possibility is that raters differ in their relative ability to evaluate the different cues used for diagnosis. For example, the 3-point checklist requires the evaluation of 3 cues (asymmetry, atypical network, and blue-white structure). If the errors that different raters make when evaluating these different cues are not perfectly correlated, then this could give rise to collective improvement.31,32 Such a scenario would be an example of the importance of diversity for collective intelligence.33,34
At present, collective intelligence is rarely used in medical decision making, and few studies have investigated the potential of such an approach, including King et al,19 Hukkinen et al,35 Duijm et al,36 Farnetani et al,37 and Wolf et al.38 Technological developments could play an important role in facilitating a collective intelligence approach. Online exchange of information (eg, images) avoids the necessity of seeing a medical specialist and would allow for a relatively quick assessment by multiple experts. In skin cancer diagnostics, a number of technological developments are ongoing in this direction. For example, mobile teledermatology investigates the possibility of people taking pictures of SLs with mobile-phone apps, which are then made available to dermoscopists. This approach shows promising rates of accuracy39- 42 and would be highly compatible with a collective intelligence approach.
An important cost of a collective intelligence approach is the extra viewing time by medical specialists. These additional costs have to be weighed against the potential benefits: a higher true-positive rate could decrease mortality risk, and a lower false-positive rate could reduce financial (fewer erroneous additional workups) and emotional costs. Further investigations will be necessary to quantify the precise costs and benefits of a collective approach.
Although we evaluated 2 independent data sets and used different diagnostic algorithms and collective classifiers, future studies should address the generality of our results within skin cancer diagnostics and medical diagnostics in general. Further, although the observations were made in a setting closely resembling clinical practice (eg, experienced dermoscopists evaluating images of real SLs), the setting was not akin to clinical practice. In clinical practice, the SL of a patient normally undergoes direct evaluation by a dermatologist using a dermatoscope.
We show that a collective intelligence approach can improve diagnostic accuracy substantially in skin cancer detection by increasing true-positive and decreasing false-positive rates. Our results, in combination with rapid developments in technological possibilities, suggest that collective intelligence might be a viable method in the ongoing efforts to reduce skin cancer–related mortality rates.
Corresponding Author: Ralf H. J. M. Kurvers, PhD, Center for Adaptive Rationality, Max Planck Institute for Human Development, Lentzeallee 94, Berlin 14195, Germany (firstname.lastname@example.org).
Accepted for Publication: July 25, 2015.
Published Online: October 21, 2015. doi:10.1001/jamadermatol.2015.3149.
Author Contributions: Dr Kurvers had full access to all the data in the study and takes responsibility for the integrity of the data and the accuracy of the data analysis.
Study concept and design: Kurvers, Krause, Wolf.
Acquisition, analysis, or interpretation of data: Kurvers, Krause, Argenziano, Zalaudek.
Drafting of the manuscript: Kurvers, Krause, Wolf.
Critical revision of the manuscript for important intellectual content: Kurvers, Krause, Argenziano, Zalaudek.
Statistical analysis: Kurvers.
Obtained funding: Krause, Wolf.
Administrative, technical, or material support: Zalaudek, Wolf.
Study supervision: Krause.
Conflict of Interest Disclosures: None reported.
Funding/Support: This study was supported by Leibniz Competition Grant SAW-2013-IGB-2 from the Leibniz Association (Drs Krause and Wolf) and by the Rubicon Grant 825.11.014 from the Netherlands Organisation for Scientific Research (Dr Kurvers).
Role of the Funder/Sponsor: The funding sources had no role in the design and conduct of the study; collection, management, analysis, and interpretation of the data; preparation, review, or approval of the manuscript; and decision to submit the manuscript for publication.