Selection of studies according to inclusion and exclusion criteria at different stages of the meta-analysis.
FN indicates false negative; FP, false positive; TN, true negative; and TP, true positive.
Bivariate summary ROC curves comparing computer-aided diagnosis (CAD) and dermatologists for the detection of melanoma vs benign lesions in studies when both methods are available (A). Bivariate summary ROC curves comparing studies on automated systems for the detection of melanoma vs benign lesions using independent and nonindependent test sets (B), different CAD methods (C), and public or proprietary test data sets (D).
eTable 1. General Characteristics of All Included Studies (n=132)
eTable 2. Assessment of Bias Risk and Applicability Concerns of the Included Studies (n=132) Using the QUADAS-2 Tool
eTable 3. Sensitivity, Specificity and Covariable Effects Calculated With Two Identified Outliers (Wolf 2013 and Jiji 2017) Excluded
eFigure. Risk of Bias, Applicability Concerns, and Methodologic Quality
Customize your JAMA Network experience by selecting one or more topics from the list below.
Dick V, Sinz C, Mittlböck M, Kittler H, Tschandl P. Accuracy of Computer-Aided Diagnosis of Melanoma: A Meta-analysis. JAMA Dermatol. 2019;155(11):1291–1299. doi:10.1001/jamadermatol.2019.1375
What is the accuracy of computer-aided diagnosis of melanoma and how does it translate to clinical practice?
In this meta-analysis of 70 studies, the accuracy of computer-aided diagnosis is comparable to that of human experts. However, current studies are heterogeneous and most deviate significantly from real-world scenarios and are prone to biases.
Although computer-aided diagnosis for melanoma appears to be accurate according to the included studies, more standardized and realistic study settings are required to explore its full potential in clinical practice.
The recent advances in the field of machine learning have raised expectations that computer-aided diagnosis will become the standard for the diagnosis of melanoma.
To critically review the current literature and compare the diagnostic accuracy of computer-aided diagnosis with that of human experts.
The MEDLINE, arXiv, and PubMed Central databases were searched to identify eligible studies published between January 1, 2002, and December 31, 2018.
Studies that reported on the accuracy of automated systems for melanoma were selected. Search terms included melanoma, diagnosis, detection, computer aided, and artificial intelligence.
Data Extraction and Synthesis
Evaluation of the risk of bias was performed using the QUADAS-2 tool, and quality assessment was based on predefined criteria. Data were analyzed from February 1 to March 10, 2019.
Main Outcomes and Measures
Summary estimates of sensitivity and specificity and summary receiver operating characteristic curves were the primary outcomes.
The literature search yielded 1694 potentially eligible studies, of which 132 were included and 70 offered sufficient information for a quantitative analysis. Most studies came from the field of computer science. Prospective clinical studies were rare. Combining the results for automated systems gave a melanoma sensitivity of 0.74 (95% CI, 0.66-0.80) and a specificity of 0.84 (95% CI, 0.79-0.88). Sensitivity was lower in studies that used independent test sets than in those that did not (0.51; 95% CI, 0.34-0.69 vs 0.82; 95% CI, 0.77-0.86; P < .001); however, the specificity was similar (0.83; 95% CI, 0.71-0.91 vs 0.85; 95% CI, 0.80-0.88; P = .67). In comparison with dermatologists’ diagnosis, computer-aided diagnosis showed similar sensitivities and a 10 percentage points lower specificity, but the difference was not statistically significant. Studies were heterogeneous and substantial risk of bias was found in all but 4 of the 70 studies included in the quantitative analysis.
Conclusions and Relevance
Although the accuracy of computer-aided diagnosis for melanoma detection is comparable to that of experts, the real-world applicability of these systems is unknown and potentially limited owing to overfitting and the risk of bias of the studies at hand.
The rising incidence of melanoma, the benefits of early diagnosis, and the limited access to dermatologic services in some countries entailed increased efforts to develop diagnostic systems that are independent of human expertise. Most systems fall into the category of image-based, automated diagnostic systems and use either clinical or dermoscopic images. The hope is that computer-aided diagnosis (CAD) could provide decision support for physicians or could screen large numbers of images for teleconsultation services. Early studies on CAD of skin lesions relied on hand-crafted feature engineering and segmentation masks. These methods showed promising results and reached a diagnostic accuracy comparable to human ratings in experimental settings.1 The accuracy of the automated diagnostic system in real-life settings in the only prospective controlled trial to date was lower than expected.2 Recent advances in computer science3 and the introduction of convolutional neural networks and deep-learning–based approaches revolutionized the classification of medical image analysis.4 Since the last meta-analysis that was published in 2003,1 a significant number of new studies have been published on this topic, but to our knowledge, there is no study that summarizes the body of literature. The aims of this meta-analysis were to critically review the current literature on CAD for melanoma, evaluate the diagnostic accuracy in comparison with that of dermatologists, analyze the association between methodologic differences and performance measures, and explore the applicability of CAD in real-world settings.
We searched the online databases MEDLINE, arXiv, and PubMed Central, using specific search terms for each database for articles published between January 1, 2002, and December 31, 2018, without any additional limitations, as well as the reference lists of included articles. Data were analyzed from February 1 to March 10, 2019. The key words used were melanoma and (diagnosis or detection) for MEDLINE, the word melanoma included in abstracts for arXiv, and melanoma and (diagnosis or detection) and (computer aided or artificial intelligence) for PubMed Central. Additional studies were identified by 2 of us (H.K. and P.T.).
Studies were eligible for inclusion if they investigated the accuracy of CAD systems that were or could be used in a screening setting for cutaneous melanoma. Diagnostic methods for lesions that have already been excised, methods that differentiate only between different types of malignant skin lesions, or methods processing information gained by invasive techniques were excluded. If an article discussed more than 1 diagnostic method, only the best-performing method was included.
The titles and abstracts of retrieved articles were screened by 2 of us (V.D. and H.K.). At this stage, articles were excluded if they were not published in English or German, if an abstract was unavailable, or if the content was not relevant to the research question. The full texts of articles that were not excluded during initial screening were retrieved and studied by the same readers. At that time, articles were excluded if they did not present original data or if their content was not relevant to the research question. Discrepancies regarding inclusion or exclusion of specific studies were discussed and resolved by consensus. One of us (P.T.) was available to be the decision maker in case no consensus could be reached.
We used a standardized data extraction sheet to collect data from all included studies. The extracted data fields were determined in advance and included study, test, and sample characteristics, and outcome measures.
We extracted information on the selection of the study sample, characteristics of included lesions, type of diagnostic reference standard, method of automated analysis, type of classifier, preprocessing, segmentation, and extracted features, if applicable. With regard to the method of automated analysis, we differentiated between hardware-based methods, image analysis with feature extraction (computer vision), and deep learning. According to our definition, hardware-based methods use specific devices beyond simple consumer cameras or smartphones (eg, spectroscopy, multispectral imaging, or photometric stereo device). To obtain outcome measures, we extracted the raw numbers of true and false positives and true and false negatives from each study to calculate summary statistics for the diagnostic accuracy of automated diagnostic systems and, if available, of dermatologists.
We assessed applicability and risk of bias according to the modified version of the QUADAS-2 tool,5 which we adapted for our specific purpose with regard to sample selection, index test, reference standard, flow, and timing. The studies were also evaluated using the good-quality criteria suggested by Rosado et al,1 which, if followed, should ensure transferability of the results to a real-world setting. These criteria, however, were not applicable to studies from the field of computer science.
For the quantitative meta-analysis, we used the results of studies that either presented absolute numbers for true and false positives and true and false negatives or offered sufficient information to calculate these numbers for the detection of melanoma vs benign lesions. If the same group of authors published more than 1 study with overlapping sets of lesions, only 1 publication was included in the statistical analysis. If studies directly compared CAD with dermatologists, the corresponding sensitivities and specificities of dermatologists were extracted in the same way.
The R Statistics6 package mada7 and the SAS8 Macro MetaDAS9 were used for analyses. A coupled forest plot of sensitivity and specificity was created using RevMan, version 5.3.10 Summary receiver operating characteristic (ROC) curves and mean estimates of sensitivity and specificity and the corresponding 95% CIs were calculated by the bivariate model of Reitsma et al.11 Heterogeneity and the presence of outliers were visually checked and the presence of between-study variance was tested.12 A bivariate meta-regression with potential covariables was modeled to reduce any heterogeneity noted between the studies. For all studies, the use of independent test sets, of proprietary or public test sets, and the method of analysis (computer vision, deep learning, or hardware based) were available and investigated. In sensitivity analyses, potential outliers were excluded to assess their association with the results. All tests were based on a 2-sided significance level of P = .05.
We identified 1694 potentially eligible articles, of which 132 were included in the qualitative analysis and 70 provided sufficient data for a quantitative meta-analysis (Figure 1, Figure 2, and Figure 3).2,13-81 We attributed 105 articles to the field of computer science and 27 to the field of medicine. The methods used were computer vision (n = 58), deep learning (n = 55), and hardware based (n = 19) . Artificial neural networks and support vector machines were the most commonly applied machine-learning techniques for classification (eTable 1 in the Supplement).
Fifty studies included only melanocytic lesions, while nonmelanocytic lesions were included in 67 studies. Fifteen studies did not specify whether nonmelanocytic lesions were included (eTable 1 in the Supplement). Twenty-two studies reported melanoma thickness and 28 studies noted the inclusion of in situ melanomas. The median thickness of invasive melanomas ranged from 0.2 to 1.5 mm. Publicly available images were used in 76 studies, while 56 studies used proprietary data sets. Most studies (n = 119) did not select lesions randomly and 13 studies used consecutively collected samples.
According to the QUADAS-2 tool,5 13 studies showed moderate applicability, and the concerns about the applicability of the remaining studies was judged as low (eTable 2 in the Supplement). The concerns about the risk of bias were judged as high in at least 1 category in all but 4 studies, and 58 studies presented a high risk of bias in at least 2 categories (eTable 2 and eFigure in the Supplement). The quality assessment of the 27 studies from the medical field, using the quality criteria proposed by Rosado et al,1 showed that between 1 and 7 of 9 quality criteria were met (eFigure in the Supplement). The general characteristics of the 70 studies that were included in the quantitative analysis are shown in eTable 1 in the Supplement.
Based on the 70 studies that were included in the quantitative analysis, the summary estimate for the melanoma sensitivity of CAD systems was 0.74 (95% CI, 0.66-0.80) and the specificity was 0.84 (95% CI, 0.79-0.88) (Table). The sensitivity was significantly lower for the 45 studies that used independent test sets (0.51; 95% CI, 0.34-0.69 vs 0.82; 95% CI, 0.77-0.86; P < .001). The summary estimates for the corresponding specificities were similar (0.83; 95% CI, 0.71-0.91 vs 0.85; 95% CI, 0.80-0.88; P = .67).
The 33 studies that used proprietary test sets had a significantly higher sensitivity than the 37 studies that used publicly available test sets (0.87; 95% CI, 0.82-0.91 vs 0.57; 95% CI, 0.44-0.68; P < .001); however, the specificity was significantly lower (0.72; 95% CI, 0.63-0.79 vs 0.91; 95% CI, 0.88-0.94; P < .001).
Computer-aided diagnosis systems using deep learning achieved a sensitivity of 0.44 (95% CI, 0.30-0.59; P < .001) and a specificity of 0.92 (95% CI, 0.89-0.95; P < .001) and behaved significantly differently from the other 2 methods. The 35 studies using computer vision achieved a sensitivity of 0.85 (95% CI, 0.80-0.88; P < .001) and a specificity of 0.77 (95% CI, 0.69-0.84; P < .001), and 9 studies using hardware-based methods reached a sensitivity of 0.86 (95% CI, 0.77- 0.92; P < .001) and a specificity of 0.70 (95% CI, 0.54-0.82; P = .001). Studies based on computer vision and hardware-based methods were not significantly different with respect to sensitivity and specificity.
A multiple bivariate meta-regression model showed that sensitivity and specificity depended significantly on the test set characteristics, the test set source, and the method of analysis. Sensitivity was significantly lower for independent (0.51; 95% CI, 0.34-0.69) versus nonindependent test sets (0.82; 95% CI, 0.77-0.86; P = .002). Specificity was significantly higher for deep learning (0.92; 95% CI, 0.89-0.95) than for computer vision (0.77; 95% CI, 0.69-0.84; P < .001) or hardware-based methods (0.70; 95% CI, 0.54-0.82; P < .001) (Table). Analyses were repeated with studies of Wolf et al77 and Wiselin Jiji38 excluded because their distance from the summary ROC indicated potential outliers; however, their influence on the outcome was minor (eTable 3 in the Supplement). Although the bivariate meta-regression reduced heterogeneity in the meta-analysis, a significant between-study heterogeneity remained in all subgroups (Figure 4).
A subset of 14 studies compared CAD with dermatologists, who reached a sensitivity of 0.88 (95% CI, 0.79-0.93) and a specificity of 0.78 (95% CI, 0.76-0.79). Dermatologists and CAD attained a similar sensitivity (sensitivity of CAD: 0.89; 95% CI, 0.87-0.91; P = .496); specificity, however, was 10 percentage points lower for CAD, although not statistically significant (0.68; 95% CI, 0.60-0.77; P = .052).
Computer-aided diagnosis of melanoma is an instructive example of the current mismatch between expectations and the actual outcome of machine-learning approaches for accurate predictions and diagnoses in health care. Despite numerous breakthrough studies that demonstrate expert-level accuracy of CAD for melanoma, existing devices or applications are not widely used. A potential reason for this mismatch may be that the results of the studies conducted in this field cannot be transferred directly to clinical practice. We performed this meta-analysis with the aim to better characterize the studies at hand and identify factors that explain the mismatch between expectations and reality.
Most studies on CAD came from the field of computer science, whereas clinical studies were sparse. The studies from the field of computer science typically focused on technical issues, such as preprocessing of images, image augmentation, segmentation, feature extraction and architecture, and fine tuning of the classification algorithm. These studies usually did not address typical limitations of diagnostic studies, such as the potential ambiguity of pathologic reports, the complexity of clinical decision making in the presence of uncertainty, and the types of biases involved in such studies. Clinical information, such as age, anatomic site, and history of melanoma, was rarely used, although it may significantly improve the accuracy for melanoma detection.82
Computer-aided diagnosis studies were highly heterogeneous and at high risk for bias. Half of the studies, and practically all studies coming from the field of computer science, were conducted in an experimental setting and used images from publicly available databases. Most used convenience samples or, at best, retrospectively collected consecutive samples. These data sets are usually prone to selection and verification bias. Overfitting is an inherent problem of machine learning resulting in lack of generalizability, especially if the training set and the test set are different from the group of lesions encountered in clinical practice. It is not surprising that studies that used independent test sets reached a lower sensitivity than the remaining studies. The fact that specificity is not affected by overfitting may be explained by class imbalance. Most data sets used for training and testing are imbalanced and contain more nevi than melanomas. Because overfitting is more likely if the sample size is small, the sensitivity for melanoma is more likely affected by overfitting than the specificity.
Ideally, CAD should be trained and tested in the setting of its intended use. The clinical setting may vary from general population screening to surveillance of high-risk patients with multiple nevi and a personal history of melanoma. Most clinical studies were conducted in specialized referral centers with high melanoma prevalence. The systems were not tested in the general population or as screening tools.
Dermoscopy was most widely used for classification. Dermoscopic images can be obtained with different devices, including smartphones, which makes them widely available. Although dermoscopy is regarded as the state of the art in vivo technique for the diagnosis of melanoma, most prospective clinical studies used other methods, such as spectroscopy or multispectral images, which require exclusive hardware. Prospective, controlled clinical studies of automated systems of dermoscopic images or conventional close-ups are currently missing.
The restriction to melanocytic lesions was also a limitation found in most of the studies of this meta-analysis. If nonmelanocytic lesions were included, it was usually by chance. The restriction to melanocytic lesions limits the applicability of such systems in clinical practice. In a population of individuals with extensive chronic sun damage, a significant portion of pigmented lesions that are excised or biopsied for diagnostic reasons are nonmelanocytic. A system that is trained to differentiate melanoma from nevi will not be suitable in a setting in which a significant portion of lesions are seborrheic keratosis, solar lentigines, basal cell carcinomas, actinic keratoses, or Bowen disease. Such a system would need preselection of melanocytic lesions by experts, but if experts are needed to handle the system it would defy its own purpose. The lack of generalizability and the problem of out-of-distribution lesions, such as rare or unknown disease categories, is a limitation that is not addressed by current studies. In a recent study, an otherwise accurate CAD missed amelanotic melanomas, most probably because they were underrepresented in the training set.83
When compared directly, CAD differentiated melanoma from nevi with similar sensitivity to dermatologists but with lower specificity. This difference, however, was not statistically significant and can also be attributed to a threshold effect. The optimal threshold and the tradeoff between sensitivity and specificity is a problem. Although metrics exist that take into account the consequences of diagnostic decisions, they are rarely used in the realm of machine learning, which has also been criticized recently in an editorial by Shah et al.84 As shown in Figure 2, most automated diagnostic systems used thresholds that balanced sensitivity and specificity and avoided extreme values. This selection of thresholds makes sense clinically because if the sensitivity is maximized at an expense of an intolerable low specificity, the system would be useless in clinical practice.
This meta-analysis has limitations. Because of the heterogeneity of the studies, the summary estimates of the quantitative part have to be interpreted with caution and in light of the methodologic quality of the studies. Two studies in particular occurred as visual outliers in summary ROC space. Wiselin Jiji et al38 used a support-vector machine with significantly lower performance compared with other competitors of the International Symposium on Biomedical Imaging 2017 challenge, suggesting that implementation may have been suboptimal. Wolf et al77 used a smartphone-based approach on a retrospective convenience sample without specific details on technical implementation.
It is likely that parts of the mismatch between promising experimental results and limited usefulness in reality can be attributed to issues beyond accuracy. Dermatologists, who are regarded as the experts in the field of melanoma diagnosis, probably benefit the least and feel threatened the most. There is a fear that less-skilled physicians or even nonmedical personnel will use such systems to deliver a service that should be restricted to dermatologists. One could argue that not all dermatologists are experts in dermoscopy and that even experts could benefit from computer assistance when related to repetitive tasks, such as comparing sequential images. Therefore, a successful CAD would most probably enhance and support dermatologists rather than replace them. It is currently unclear in which setting and for which task CADs are most useful, but if the setting and the tasks are unclear, the systems cannot be trained and tested sufficiently. If the systems are used in a setting in which they are not accepted or for a task they have not been trained for, they will be wasted, even if the technology is exciting and accurate.
Accepted for Publication: April 17, 2019.
Corresponding Author: Harald Kittler, MD, ViDIR Group, Department of Dermatology, Medical University of Vienna, Währinger Gürtel 18-20, 1090 Vienna, Austria (email@example.com).
Published Online: June 19, 2019. doi:10.1001/jamadermatol.2019.1375
Author Contributions: Drs Kittler and Tschandl had full access to all of the data in the study and take responsibility for the integrity of the data and the accuracy of the data analysis.
Concept and design: Dick, Kittler.
Acquisition, analysis, or interpretation of data: All authors.
Drafting of the manuscript: Dick, Mittlböck, Kittler, Tschandl.
Critical revision of the manuscript for important intellectual content: All authors.
Statistical analysis: Dick, Mittlböck.
Administrative, technical, or material support: Kittler, Tschandl.
Supervision: Kittler, Tschandl.
Conflict of Interest Disclosures: Dr Kittler reported nonfinancial support from Fotofinder and nonfinancial support from Derma Medical Systems outside the submitted work. Dr Tschandl reported grants from MetaOptima Technology Inc outside the submitted work. No other disclosures were reported.