[Skip to Navigation]
Sign In
Figure 1.  Representative Example to Calculate the Highest Malignancy Output With Multiple Photographs of 1 Patient
Representative Example to Calculate the Highest Malignancy Output With Multiple Photographs of 1 Patient

A, The blob detector detects numerous blobs from unprocessed clinical image. B, The fine image selector determines whether the detected blob is a skin lesion and excludes images with inadequate quality or those with nonspecific diagnosis. C, The disease classifier analyzes each lesion and shows 178 outputs. For example, the output of the perinasal nodular lesion output was (Top1: basal cell carcinoma, 0.67; Top2: seborrheic keratosis, 0.19; Top3: wart, 0.05; Top178: infantile eczema, Youden index score, 0.00). D, Calculation of malignancy_output uses the 178 outputs of the each lesion with the following formula: malignancy output = (basal cell carcinoma output + squamous cell carcinoma output + squamous cell carcinoma in situ output + keratoacanthoma output + malignant melanoma output) + 0.2 × (actinic keratosis output + ulcer output). The number shown is the malignancy output multiplied by 100. The perinasal lesion was marked with a red box because the final malignancy_output (67) was higher than the fixed T90/T80 thresholds (the thresholds at which the sensitivity of the algorithm was 90% or 80%) (25.45/46.87). In all malignancy_outputs in 1 patient, the highest malignancy_output was finally used to draw the receiver operating characteristic curve. The image was generated by style-generative adversarial network and is not an actual person.19

Figure 2.  Example of the Detection of Basal Cell Carcinoma of the Periocular Area of a 79-Year-Old Man From the Dermatology Data Set
Example of the Detection of Basal Cell Carcinoma of the Periocular Area of a 79-Year-Old Man From the Dermatology Data Set

The rectangles were colored when the malignancy output was higher than the threshold (T80, red; T90, orange). The lesional blobs of basal cell carcinoma were strongly detected in the periocular area. There were several seborrheic keratoses on the cheek and brow, and the algorithm correctly diagnosed them as benign. The final malignancy output for the test individual in this series of 3 photographs was 94, which was the highest malignancy output.

Figure 3.  Receiver Operating Characteristic (ROC) Curves With the Dermatology (DER) and Plastic Surgery (PS) Validation Data Sets and the Comparison With Experts and the General Public
Receiver Operating Characteristic (ROC) Curves With the Dermatology (DER) and Plastic Surgery (PS) Validation Data Sets and the Comparison With Experts and the General Public

A, DER + PS data set (325 images from 80 patients; area under the ROC (AUC) of the algorithm = 0.919). B, DER + PS data set (2844 images from 673 patients; AUC = 0.910). C, DER data set (170 images from 40 patients; AUC = 0.868). D, DER data set (1570 images from 386 patients; AUC = 0.896). E, PS data set (155 images from 40 patients; AUC = 0.983). F, PS data set (1274 images from 287 patients; AUC = 0.954). Addition symbols that indicate the test participants’ mean sensitivity and specificity for malignant vs nonmalignant, and multiplication symbols that indicate the test participants’ mean sensitivity and specificity for whether a biopsy is or is not required are located on the proximity or slightly on the upper left-hand side of the algorithm’s curve as shown in A, C, E. Compared with the dermatologists’ individual sensitivity and specificity, the algorithm demonstrated relatively better performance in C than A or E. A much smaller number of dermatologists and dermatology residents are located left and upper than the algorithm’s curve in C. The black star/diamond points are the sensitivity/specificity of the algorithm at the 90% (T90) and 80% (T80) thresholds. The general public was asked whether there was a possibility of skin cancer, thus necessitating a visit to a dermatologist, and the sensitivity and specificity were calculated for their answers. The sensitivity of the general public was 50.1%, which indicated that half of the malignant lesions could be ignored.

Table 1.  Summary of the Validation Data Set and the Corresponding Demographic Information
Summary of the Validation Data Set and the Corresponding Demographic Information
Table 2.  Summary of the AUC, F1 Score, and Youden Index Score of the Algorithm and Test Participantsa
Summary of the AUC, F1 Score, and Youden Index Score of the Algorithm and Test Participantsa
1.
Gulshan  V, Peng  L, Coram  M,  et al.  Development and validation of a deep learning algorithm for detection of diabetic retinopathy in retinal fundus photographs.  JAMA. 2016;316(22):2402-2410. doi:10.1001/jama.2016.17216PubMedGoogle ScholarCrossref
2.
Esteva  A, Kuprel  B, Novoa  RA,  et al.  Dermatologist-level classification of skin cancer with deep neural networks.  Nature. 2017;542(7639):115-118. doi:10.1038/nature21056PubMedGoogle ScholarCrossref
3.
Haenssle  HA, Fink  C, Schneiderbauer  R,  et al; Reader study level-I and level-II Groups.  Man against machine: diagnostic performance of a deep learning convolutional neural network for dermoscopic melanoma recognition in comparison to 58 dermatologists.  Ann Oncol. 2018;29(8):1836-1842. doi:10.1093/annonc/mdy166PubMedGoogle ScholarCrossref
4.
Chilamkurthy  S, Ghosh  R, Tanamala  S,  et al.  Deep learning algorithms for detection of critical findings in head CT scans: a retrospective study.  Lancet. 2018;392(10162):2388-2396. doi:10.1016/S0140-6736(18)31645-3PubMedGoogle ScholarCrossref
5.
Fujisawa  Y, Otomo  Y, Ogata  Y,  et al.  Deep-learning–based, computer-aided classifier developed with a small dataset of clinical images surpasses board-certified dermatologists in skin tumour diagnosis.  Br J Dermatol. 2019;180(2):373-381. doi:10.1111/bjd.16924PubMedGoogle ScholarCrossref
6.
Han  SS, Kim  MS, Lim  W, Park  GH, Park  I, Chang  SE.  Classification of the clinical images for benign and malignant cutaneous tumors using a deep learning algorithm.  J Invest Dermatol. 2018;138(7):1529-1538. doi:10.1016/j.jid.2018.01.028PubMedGoogle ScholarCrossref
7.
Rajpurkar  P, Irvin  J, Ball  RL,  et al.  Deep learning for chest radiograph diagnosis: a retrospective comparison of the CheXNeXt algorithm to practicing radiologists.  PLoS Med. 2018;15(11):e1002686. doi:10.1371/journal.pmed.1002686PubMedGoogle Scholar
8.
Tschandl  P, Rosendahl  C, Akay  BN,  et al.  Expert-level diagnosis of nonpigmented skin cancer by combined convolutional neural networks.  JAMA Dermatol. 2019;155(1):58-65. doi:10.1001/jamadermatol.2018.4378PubMedGoogle ScholarCrossref
9.
Tschandl  P, Codella  N, Akay  BN,  et al.  Comparison of the accuracy of human readers versus machine-learning algorithms for pigmented skin lesion classification: an open, web-based, international, diagnostic study.  Lancet Oncol. 2019;20(7):938-947. doi:10.1016/S1470-2045(19)30333-XPubMedGoogle ScholarCrossref
10.
De Fauw  J, Ledsam  JR, Romera-Paredes  B,  et al.  Clinically applicable deep learning for diagnosis and referral in retinal disease.  Nat Med. 2018;24(9):1342-1350. doi:10.1038/s41591-018-0107-6PubMedGoogle ScholarCrossref
11.
Abràmoff  MD, Lou  Y, Erginay  A,  et al.  Improved automated detection of diabetic retinopathy on a publicly available dataset through integration of deep learning.  Invest Ophthalmol Vis Sci. 2016;57(13):5200-5206. doi:10.1167/iovs.16-19964PubMedGoogle ScholarCrossref
12.
Cho  SI, Sun  S, Mun  J,  et al.  Dermatologist-level classification of malignant lip diseases using a deep convolutional neural network.  [published online August 26, 2019].  Br J Dermatol. doi:10.1111/bjd.18459PubMedGoogle Scholar
13.
Maron  RC, Weichenthal  M, Utikal  JS,  et al; Collaborators.  Systematic outperformance of 112 dermatologists in multiclass skin cancer image classification by convolutional neural networks.  Eur J Cancer. 2019;119:57-65. doi:10.1016/j.ejca.2019.06.013PubMedGoogle ScholarCrossref
14.
Brinker  TJ, Hekler  A, Enk  AH,  et al.  A convolutional neural network trained with dermoscopic images performed on par with 145 dermatologists in a clinical melanoma image classification task.  Eur J Cancer. 2019;111:148-154. doi:10.1016/j.ejca.2019.02.005PubMedGoogle ScholarCrossref
15.
Leiter  U, Eigentler  T, Garbe  C.  Epidemiology of skin cancer.  Adv Exp Med Biol. 2014;810:120-140.PubMedGoogle Scholar
16.
Girshick  R, Donahue  J, Darrell  T, Malik  J. Rich feature hierarchies for accurate object detection and semantic segmentation. Posted online October 22, 2014. arXiv 1311.2524.
17.
Giotis  I, Molders  N, Land  S, Biehl  M, Jonkman  MF, Petkov  N.  MED-NODE: a computer-assisted melanoma diagnosis system using non-dermoscopic images.  Expert Syst Appl. 2015;42(19):6578-6585. doi:10.1016/j.eswa.2015.04.034Google ScholarCrossref
18.
Kawahara  J, Daneshvar  S, Argenziano  G, Hamarneh  G.  Seven-point checklist and skin lesion classification using multi-task multi-modal neural nets.  [published online April 9, 2018].  IEEE J Biomed Health Inform. 2018. doi:10.1109/JBHI.2018.2824327PubMedGoogle Scholar
19.
Karras  T, Laine  S, Aila  T. A style-based generator architecture for generative adversarial networks. Preprint. Posted online March 29, 2019. arXiv 1812.04948.
20.
Ren  S, He  K, Girshick  R, Sun  J. Faster R-CNN: towards real-time object detection with region proposal networks. Preprint. Posted online January 6, 2016. arXiv 1506.01497.
21.
Hu  J, Shen  L, Sun  G. Squeeze-and-excitation networks. Preprint. Posted online May 16, 2019. arXiv 1709.01507.
22.
He  K, Zhang  X, Ren  S, Sun  J. Deep residual learning for image recognition. Preprint. Posted online December 10, 2015. arXiv 1512.03385.
23.
van den Hout  WB.  The area under an ROC curve with limited information.  Med Decis Making. 2003;23(2):160-166. doi:10.1177/0272989X03251246PubMedGoogle ScholarCrossref
24.
Yap  J, Yolland  W, Tschandl  P.  Multimodal skin lesion classification using deep learning.  Exp Dermatol. 2018;27(11):1261-1267. doi:10.1111/exd.13777PubMedGoogle ScholarCrossref
25.
Yu  C, Yang  S, Kim  W,  et al.  Acral melanoma detection using a convolutional neural network for dermoscopy images.  PLoS One. 2018;13(3):e0193321. doi:10.1371/journal.pone.0193321PubMedGoogle Scholar
26.
Gal  Y. Uncertainty in deep learning [dissertation]. Cambridge, England: University of Cambridge; 2016.
27.
Dodge  S, Karam  L. A study and comparison of human and deep learning recognition performance under visual distortions. Preprint. Posted online May 6, 2017. arXiv 1705.02498.
28.
Redmon  J, Divvala  S, Girshick  R, Farhadi  A. You only look once: unified, real-time object detection. Preprint. Posted online May 9, 2016. arXiv 1506.02640.
29.
Liu  W, Anguelov  D, Erhan  D,  et al Ssd: Single shot multibox detector. Preprint. Posted online December 29, 2016. arXiv 1512.02325.
30.
He  K, Gkioxari  G, Dollár  P, Girshick  R. Mask R-CNN. Preprint. Posted online January 24, 2018. arXiv 1703.06870.
31.
Thian  YL, Li  Y, Jagmohan  P, Sia  D, Chan  VEY, Tan  RT. Convolutional neural networks for automated fracture detection and localization on wrist radiographs [published online January 30, 2019]. Radiology Artificial Intelligence. doi:10.1148/ryai.2019180001
32.
Gan  K, Xu  D, Lin  Y,  et al.  Artificial intelligence detection of distal radius fractures: a comparison between the convolutional neural network and professional assessments.  Acta Orthop. 2019;90(4):394-400. doi:10.1080/17453674.2019.1600125PubMedGoogle ScholarCrossref
33.
Akselrod-Ballin  A, Karlinsky  L, Alpert  S, Hasoul  S, Ben-Ari  R, Barkan  E. A region based convolutional network for tumor detection and classification in breast mammography. In: Carneiro  G,  et al, eds.  Deep Learning and Data Labeling for Medical Applications. Cham, Switzerland: Springer; 2016:197-205. doi:10.1007/978-3-319-46976-8_21
34.
Han  SS, Park  GH, Lim  W,  et al.  Deep neural networks show an equivalent and often superior performance to dermatologists in onychomycosis diagnosis: automatic construction of onychomycosis datasets by region-based convolutional deep neural network.  PLoS One. 2018;13(1):e0191493. doi:10.1371/journal.pone.0191493PubMedGoogle Scholar
35.
Winkler  JK, Fink  C, Toberer  F,  et al.  Association between surgical skin markings in dermoscopic images and diagnostic performance of a deep learning convolutional neural network for melanoma recognition  [published online August 14, 2019].  JAMA Dermatol. 2019. doi:10.1001/jamadermatol.2019.1735PubMedGoogle Scholar
Original Investigation
December 4, 2019

Keratinocytic Skin Cancer Detection on the Face Using Region-Based Convolutional Neural Network

Author Affiliations
  • 1I Dermatology Clinic, Seoul, Korea
  • 2Department of Dermatology, Severance Hospital, Yonsei University College of Medicine, Seoul, Korea
  • 3LG Sciencepark, Seoul, Korea
  • 4Department of Plastic and Reconstructive Surgery, Kangnam Sacred Hospital, Hallym University College of Medicine, Seoul, Korea
  • 5Department of Plastic and Reconstructive Surgery, Chonnam National University Medical School, Gwangju, Korea
  • 6Department of Dermatology, Seoul National University Bundang Hospital, Seongnam, Korea
  • 7Department of Dermatology, Asan Medical Center, Ulsan University College of Medicine, Seoul, Korea
JAMA Dermatol. 2020;156(1):29-37. doi:10.1001/jamadermatol.2019.3807
Key Points

Question  Can an algorithm using a region-based convolutional neural network detect skin lesions in unprocessed clinical photographs and predict risk of skin cancer?

Findings  In this diagnostic study, a total of 924 538 training image-crops including various benign lesions were generated with the help of a region-based convolutional neural network. The area under the receiver operating characteristic curve for the validation data set (2844 images from 673 patients comprising 185 malignant, 305 benign, and 183 normal conditions) was 0.910, and the algorithm’s F1 score and Youden index score were comparable with those of dermatologists and surpassed those of nondermatologists.

Meaning  With unprocessed photographs, the algorithm may be able to localize and diagnose skin cancer without manual preselection of suspicious lesions by dermatologists.

Abstract

Importance  Detection of cutaneous cancer on the face using deep-learning algorithms has been challenging because various anatomic structures create curves and shades that confuse the algorithm and can potentially lead to false-positive results.

Objective  To evaluate whether an algorithm can automatically locate suspected areas and predict the probability of a lesion being malignant.

Design, Setting, and Participants  Region-based convolutional neural network technology was used to create 924 538 possible lesions by extracting nodular benign lesions from 182 348 clinical photographs. After manually or automatically annotating these possible lesions based on image findings, convolutional neural networks were trained with 1 106 886 image crops to locate and diagnose cancer. Validation data sets (2844 images from 673 patients; mean [SD] age, 58.2 [19.9] years; 308 men [45.8%]; 185 patients with malignant tumors, 305 with benign tumors, and 183 free of tumor) were obtained from 3 hospitals between January 1, 2010, and September 30, 2018.

Main Outcomes and Measures  The area under the receiver operating characteristic curve, F1 score (mean of precision and recall; range, 0.000-1.000), and Youden index score (sensitivity + specificity −1; 0%-100%) were used to compare the performance of the algorithm with that of the participants.

Results  The algorithm analyzed a mean (SD) of 4.2 (2.4) photographs per patient and reported the malignancy score according to the highest malignancy output. The area under the receiver operating characteristic curve for the validation data set (673 patients) was 0.910. At a high-sensitivity cutoff threshold, the sensitivity and specificity of the model with the 673 patients were 76.8% and 90.6%, respectively. With the test partition (325 images; 80 patients), the performance of the algorithm was compared with the performance of 13 board-certified dermatologists, 34 dermatology residents, 20 nondermatologic physicians, and 52 members of the general public with no medical background. When the disease screening performance was evaluated at high sensitivity areas using the F1 score and Youden index score, the algorithm showed a higher F1 score (0.831 vs 0.653 [0.126], P < .001) and Youden index score (0.675 vs 0.417 [0.124], P < .001) than that of nondermatologic physicians. The accuracy of the algorithm was comparable with that of dermatologists (F1 score, 0.831 vs 0.835 [0.040]; Youden index score, 0.675 vs 0.671 [0.100]).

Conclusions and Relevance  The results of the study suggest that the algorithm could localize and diagnose skin cancer without preselection of suspicious lesions by dermatologists.

Introduction

Convolutional neural network (CNN) has shown expert-level performance in the fields of ophthalmology, dermatology, and radiology.1-9 Convolutional neural networks showed successful results in ophthalmologic emergency determination,10 brain hemorrhage detection with computed tomographic scan,4 and multiple-class cardiopulmonary disease classification with radiographic images.7 In ophthalmology, the IDx-DR diagnostic system (IDx Technologies Inc)11 was approved by the US Food and Drug Administration to be used independently in screening for diabetic retinopathy.

Several studies have demonstrated an expert-level performance of CNN in the diagnosis of skin cancer, particularly in distinguishing between malignant melanoma and nevus.2,3,5,6,8,9,12-14 However, the training and validation of previous studies were limited to cases from hospital archives, which portend a risk of selection bias because only photographs of lesions suspected as being malignant by dermatologists were included. In addition, the images were tested with the ideal composition and quality provided by dermatologists. To confirm its scalability, the algorithm should be able to distinguish malignant from various benign lesions, normal skin structures, and even general objects. In addition, the algorithm should be able to localize lesions of interest autonomously within a clinical photograph.

Basal cell carcinoma and squamous cell carcinoma are the most frequently diagnosed skin cancers in white persons,15 and most of these carcinomas occur on sun-exposed areas, such as the head and neck. There are several normal anatomic structures creating curves and shades in the head and neck areas, such as the nose, ears, and eyes, which are easily distinguished by humans but can confuse the algorithm. The fact that various anatomic structures could affect the algorithm makes it difficult to detect cancer in the head, face, and neck using a deep learning algorithm. However, considering the substantial cosmetic and functional disability that may result from delayed diagnosis of skin cancer in the head and neck areas, early screening and detection are important.

The region-based CNN (R-CNN) used in this research is a particular type of deep learning technology that can detect the location of desired objects.16 We tried to demonstrate that our algorithm, which used both R-CNN and CNN, can automatically locate suspected areas and predict the probability of a lesion being malignant in an Asian population.

Methods

We created a training data set (primary, secondary, and tertiary training data set; eTable 1 in the Supplement) and a validation data set (dermatology [DER] from Asan Medical Center; plastic surgery [PS] from Chonnam National University Department of Plastic Surgery and Hallym University Department of Plastic Surgery) (Table 1) from the 3 university hospitals as well as web archives. The study was approved with waiver of informed consent by the Hallym University and Asan Medical Center, Seoul, Korea, institutional review boards.

For the training algorithm, we used the clinical photographs from Asan Medical Center, which had been collected from 2003 to 2016, MED-NODE data set,17 Seven-Point Checklist Dermatology data set,18 and images from search engines as the primary training data set. Because the lesions of common benign disorders (ie, nevus, lentigo, acne, and seborrheic keratosis) and lesions of normal skin structures (ie, nose, ear) were observed in the primary training data set, we cropped and annotated these lesions based on image findings to create a secondary training data set (eFigure 1 in the Supplement). Unlike the secondary training data set, which was annotated manually, the tertiary training data set was annotated fully automatically using temporal models that were trained with the primary and secondary training data sets. We used the primary, secondary, and tertiary training data sets as the final training data set, which comprised 1 106 886 images (eTable 1 in the Supplement).

Our algorithm comprised 3 parts: (1) the blob detector (blob indicates point or region in the image that differs in properties such as brightness or color compared with the surrounding), which detects possible lesions of interest and generated numerous raw blobs; (2) the fine image selector, which excludes inadequate image blobs and general object blobs; and (3) the disease classifier, which predicts the chance of cancer (Figure 1). We trained the blob detector using faster-RCNN20 and we trained both the fine image selector and the disease classifier using CNNs (SENet,21,22 SE-ResNeXt-50, and SE-ResNet-50). The detailed process of the model training is described in the eMethods in the Supplement. A web DEMO (http://rcnn.modelderm.com) of the model is available and accessible via smartphones or personal computers.

To validate our model, we made 2 separate validation data sets that comprised photographs of the face from the clinical images of 3 university hospitals (Table 1 and eTables 2 and 3 in the Supplement). The DER data set comprised benign and malignant tumors, whereas the PS data set comprised malignant tumors and individuals without tumors. Overall, the validation data sets included 2844 images from 673 patients (mean [SD] age, 58.2 [19.9] years; 308 men [45.8%]); 185 patients with malignant tumors, 305 with benign tumors, and 183 free of tumor.

The photographs included in the DER data set (1570 images from 386 patients; mean [SD] age, 58.5 [20.1] years; 189 men [49.0%]) were collected from January 1 to June 30, 2018. We included all head and neck photographs of which the initial clinical diagnosis corresponded to any of the 10 tumorous disorders (basal cell carcinoma, squamous cell carcinoma, malignant melanoma, squamous cell carcinoma in situ, seborrheic keratosis, actinic keratosis, hemangioma, pyogenic granuloma, melanocytic nevus, and dermatofibroma). To prevent data overlap between the training and validation data sets, we selected validation images taken after 2018 at the time of initial diagnosis and biopsy. After reviewing the pathologic diagnosis, we confirmed that 81 patients were diagnosed with 3 malignant tumorous disorders (basal cell carcinoma, squamous cell carcinoma, and squamous cell carcinoma in situ), 251 patients with 6 benign tumorous disorders (seborrheic keratosis, actinic keratosis, hemangioma, pyogenic granuloma, melanocytic nevus, and dermatofibroma), and 54 patients with other benign disorders. Because there was no biopsy-proven melanoma case in the validation data set, the validation data set included 3 keratinocytic malignant tumors and various benign tumorous conditions. The photographs of the DER data set were taken at various locations, such as the outpatient room, photography studio room, and treatment room; therefore, the composition, background, and lighting were not uniform. Both close-up and entire facial images are included at various angles.

The photographs of the PS data set (1274 images from 287 patients; mean age, 57.9 [19.6] years; 119 men [41.5%]) were acquired from the departments of plastic surgery of the Hallym University and the Chonnam National University. The images from Hallym University were acquired from January 1, 2014, to September 30, 2018, and those from the Chonnam National University were collected from January 1, 2010, to December 31, 2017. All photographs of the PS data set were taken in a photography studio within the department of plastic surgery clinic. The photographs were taken in frontal, side, and diagonal lines in unified composition with constant background and lighting; however, there were no close-up images of the lesions. Two dermatologists (S.S.H., I.J.M.) confirmed the location and diagnosis of the lesion for the entire PS data set.

The algorithm analyzed all photographs of each patient, with a mean (SD) of 4.2 (2.4) images per patient. At first, the blob detector and fine-image selector detected lesional blobs with adequate quality. Then, the disease classifier analyzed each blob to calculate the probability of malignancy. The malignancy output was defined as the sum of outputs of malignant and premalignant disorders: malignancy output = (basal cell carcinoma output + squamous cell carcinoma output + squamous cell carcinoma in situ output + keratoacanthoma output + malignant melanoma output) + 0.2 × (actinic keratosis output + ulcer output). The receiver operating characteristic (ROC) curve was drawn using the highest malignancy output value among all malignancy outputs obtained from all clinical images of each patient.

Considering the limited concentration span of the human reader, we randomly chose 325 test images of 80 patients (20 patients free of tumor, 20 patients with benign tumor, and 40 patients with malignant tumor) for reader testing. The test participants comprised 13 board-certified dermatologists, 34 dermatology residents, 20 nondermatologic physicians, and 52 members of the general public with no medical training. After reviewing all photographs of a patient, all physicians diagnosed skin cancer (malignancy), possible skin cancer (recommend biopsy), and benign lesion or no lesion. The general public classifications were possible skin cancer (dermatologist visit recommended) or benign lesion or no lesion. The ROC area of the participants was calculated with 2 sensitivity and specificity points (biopsy or not and malignancy or not) by using the following formula3,23: (a, b) = (sensitivity, specificity) of malignancy or not; (c, d) = (sensitivity, specificity) of biopsy or not; ROC_area (%) = ((100.0 − b) × a × 0.5 + (b − d) × (c + a) × 0.5 + d × (100.0 + c) × 0.5)/10 000.0.

Statistical Analysis

Because the main purpose of our model was disease screening, we evaluated the performance of the algorithm at the high-sensitivity areas of the ROC curve. We defined T90 and T80 as the thresholds at which the sensitivity of the algorithm was 90% or 80% with benign and malignant nodular disorders of the DER data set. At the fixed thresholds (T90 or T80), we compared the F1 score (mean of precision and recall; range, 0.000-1.000) and Youden index score (sensitivity + specificity −1; 0%-100%) of the model with those of the participants. For comparison of the algorithm with the participants, a 2-tailed, 1-sample t test was applied. Findings were considered significant at P < .05.

For calculating the average precision (AP) score, 2 dermatologists (S.S.H., I.J.M.) annotated the location of the malignant lesion on validation images to determine the ground truth. If the box proposed by the algorithm overlapped with that of the ground truth, the lesion was regarded as malignant. Then, we calculated the AP score using the sklearn.metrics.average_precision_score function (y_true = intersection over union >0 for malignant area; y_score = malignancy output). We used R, version 3.5.3 (pROC package, version 1.14; R Foundation), for calculating the area under the ROC curve (AUC).

Results

Using R-CNN and CNNs, we created a skin cancer screening algorithm comprising a blob detector, fine-image selector, and disease classifier (Figure 1). The blob detector generates numerous possible blobs from a given image, and the fine-image selector excludes inadequate blobs. The disease classifier then predicts the chance of malignant lesions as shown in the examples presented in Figure 2.

To validate our algorithm, we used the DER validation data set (1570 images; 386 patients) and PS validation data set (1274 images; 287 patients). The performance of the disease classifier was 0.910 (DER + PS), 0.896 (DER), and 0.954 (PS) on the AUC. The AP score for malignancy class was 0.632 (DER + PS) and the AP score for normal class was 0.992 (DER + PS). The mean (SD) malignancy output for the blobs that were tested with malignant ground truth was 0.346 (0.328), and that of the blobs tested with benign ground truth was 0.030 (0.039). Because cancer screening is the main purpose of our algorithm, the performance of the algorithm was tested at high-sensitivity areas. At the thresholds of T90, the sensitivity and specificity of the model were 89.2% and 77.9%, respectively (DER + PS). At the threshold of T80, the sensitivity and specificity of the model were 76.8% and 90.6% (DER + PS), respectively (Figure 3D).

The reader test was conducted with the test partition, which included 325 images from 80 patients with a total 119 human participants tested (Figure 3 and Table 2). With the test partition (325 images; 80 patients), the AUC of the algorithm was 0.919. The ROC area of dermatologists was 0.906 (0.021) and that of the nondermatologic physicians was 0.725 (0.068). The F1 score (0.835 [0.040]) and Youden index score (0.671 [0.100]) of 13 dermatologists were comparable with the F1 score (0.831 at T90 and 0.831 at T80) and Youden index score (0.625 at T90 and 0.675 at T80) of the model (Table 2). The F1 score of dermatology residents (0.815 [0.045]) was lower than that of the model (0.831 at T90, P = .04; 0.831 at T80, P = .04). In addition, the F1 score of the nondermatologic physician (0.653 [0.126]) was lower than that of the model (0.831 at T90, P < .001; 0.831 at T80, P < .001). The Youden index score of the nondermatologic physicians (0.417 [0.124]) was also lower than that of the algorithm (0.625 at T90, P < .001; 0.675 at T90, P < .001).

Discussion

Several studies have implemented deep learning models for dermatologic diagnosis and shown their performance to be on par with or superior to that of dermatologists.2,3,5,6,8,9,12-14,24,25 In those studies, an algorithm had been trained with images from hospital archives; the images were primarily of malignant and suspicious benign lesions (ie, malignant melanoma and dysplastic nevus). The trained algorithm showed expert-level performance in skin cancer diagnosis. Esteva et al2 reported an AUC of 0.96 for distinguishing between malignant melanoma from nevus and nonmelanoma skin cancers from seborrheic keratosis with macro images. Haenssle et al3 reported an AUC of 0.86, using dermoscopic images, which outperformed most of the 58 dermatologists. In addition, Tschandl et al8 suggested that the combined CNN, which analyzed both dermoscopic and clinical images, could outperform 95 human experts. These advances are expected to assist dermatologists in distinguishing between malignant and benign lesions more accurately if the algorithm is validated in clinical settings.

Because deep learning algorithms require both positive and negative data for training, algorithms sometimes show uncertainty with untrained problems.26 Owing to the diversity of negative data, the product of deep learning algorithms can produce many false-positive results when used in real-world practice. Hospital archives have few images of common benign lesions because dermatologists mainly evaluate malignant or premalignant disorders. Even though deep learning algorithms could distinguish between dysplastic nevus and malignant melanoma better than dermatologists, the same algorithm may perform with lower accuracy in discrimination between malignant melanoma and untrained benign lesions, such as subungual hematoma, or discrimination between malignant melanoma and untrained normal structures, such as the nostrils.

The algorithm could be used both at the screening stage before the patient visits the physician and at the disease confirmation stage after the consultation. However, an algorithm trained with only hospital-based data sets may have various drawbacks if it were to be used for disease screening purposes owing to the following problems: (1) trivial or nonspecific lesions, because the algorithm could misdiagnose inflammatory acne lesions as malignant owing to the irregular shapes of acne and acne scars; (2) normal structures, such as the dimple on the earlobe due to ear piercings being mistaken for an umbilicated nodule of basal cell carcinoma; (3) lesions with a nonuniform background, such as the curves of the ear, crown around the eyes, and shade under the nose, can affect the result; (4) accessories or general objects, such as tattoos, glasses, and earrings, that should be excluded before assessment; and (5) inadequate photograph quality because the algorithm may perform poorly in diagnosing blurry images if it was trained only with high-quality images.27

For the abovementioned reasons, the artifical intelligence model trained with images from hospital archives may show numerous false-positive results despite good performance with the high-quality validation data set acquired from hospital archives. Considering the low incidence of cutaneous malignant lesions, a high false-positive rate would result in unnecessary referrals, making the algorithm unreliable.

To minimize false-positive results, we tried to collect normal and common benign lesional blobs with the help of R-CNN, which is a type of CNN that detects the position of a desired object by verifying each region through CNN after region proposal is performed. Currently, various R-CNNs, such as faster-RCNN,20 Yolo,28 SSD,29 and Mask R-CNN,30 are available. The mean AP value of the faster-RCNN on Visual Object Classes Challenge 2007 and 2012 was 78.8% and 75.9%, respectively.20

There have been several reports on the successful use of R-CNN in fracture detection using radiographic images31,32 and in tumor detection in mammography.33 In a study on onychomycosis, faster-RCNN was used to detect the nail plate and build 49 567 onychomycosis training data sets based on image findings.34

The ROC curve of the CNN model is drawn by plotting all sensitivities and specificities for the incremental changes in threshold from 0 to 1. However, a human rater is unfamiliar with using a visual analog scale for evaluation of test images, and instead usually responds yes or no on the question of malignancy or biopsy. Therefore, the result of a human rater is drawn as a dot (x-axis = 1 − specificity; y-axis = sensitivity) on an ROC plot. A quantitative comparison between the algorithm and dermatologists is difficult because the performance of the algorithm is shown as a curve (ROC curve) while that of the dermatologist is shown as a dot (x-axis = 1 − specificity; y-axis = sensitivity). In addition, even though the AUC value of the ROC curve is an important performance evaluation metric, the performance of the target area on the ROC curve is more important depending on whether the actual use of the model is for disease screening or disease confirmation. In this study, we compared the F1 score and Youden index score of the model with that of experts at high sensitivity areas (T90 or T80) for the disease screening task.

Overall, the performance of our algorithm was comparable with that of dermatologists, whereas the algorithm showed superior performance to nondermatologists. In terms of the F1 score at thresholds of T90 and T80, our model (0.831) outperformed nondermatologists (0.653 [0.126]; P < .001). As shown in Figure 3A, the sensitivity of the general public was only 50.1% for malignant cases. Because the algorithm-based test can be periodically repeated with minimal cost, we expect that regular use of the algorithm as a screening tool may improve public health.

Limitations

Although our study demonstrated the potential of the deep learning models to be used for screening skin cancer, our models present several limitations. Above all, our algorithm was validated with 1 race (Asian) and within 1 region (South Korea). Because malignant melanoma is a rare cancer among Asian persons, there was no case of malignant melanoma in validation data sets. In addition, normal findings can be observed in various ways depending on race/ethnicity, region, and culture. Second, in real-world practice, physicians include a large amount of information in their evaluation rather than clinical photographs alone. Recently, a multimodal approach has been attempted to reflect various information on patients in artificial intelligence decision-making.24 Third, our model demonstrated a tendency to mistake skin markings for lesions, particularly the ones around malignancy-prone areas, such as the nose. There recently has been a report on the importance of erasing skin markings before obtaining dermoscopic images because they may lead to overdiagnosis.35 In addition, some of the small lesions around the eyes were missed (eFigure 2 in the Supplement). Finally, we took a multistage approach in this study, combining CNN with R-CNN to perform skin cancer screening. However, we expect to be able to train a model in an end-to-end fashion once sufficient data for both benign and malignant lesions annotated with the diagnosis as well as the location are available. Showing comparisons with end-to-end trained models will be important in future studies.

Conclusions

In this study, we used the R-CNN technology to build a large data set comprising normal and benign images to solve the problem of false-positive findings in skin cancer detection. The generated data set was used to train the fine-image selector and disease classifier, which successfully localized and diagnosed malignant lesions on the face. In terms of the F1 score and Youden index score, the algorithm accuracy was comparable with that of dermatologists, whereas the diagnostic accuracy of the algorithm was superior to that of nondermatologic physicians. Further research is warranted to assess the algorithm’s performance on individuals of different races/ethnicities as well as additional validation with malignant melanoma.

Back to top
Article Information

Accepted for Publication: October 14, 2019.

Corresponding Authors: Sung Eun Chang, MD, PhD, Department of Dermatology, Asan Medical Center, Ulsan University College of Medicine, 88, OLYMPIC-RO 43-GIL Songpa-gu, Seoul, 05505, Korea (csesnumd@gmail.com); Seong Hwan Kim, MD, Department of Plastic and Reconstructive Surgery, Kangnam Sacred Hospital, Hallym University College of Medicine, 1, Singil-ro, Yeong deong po-gu, Seoul 07441, Korea (kalosmanus@naver.com).

Published Online: December 4, 2019. doi:10.1001/jamadermatol.2019.3807

Author Contributions: Drs Han and Moon contributed equally to this work, had full access to all of the data in the study, and take responsibility for the integrity of the data and the accuracy of the data analysis.

Concept and design: Han, Lim, Suh, Lee, Na, Kim, Chang.

Acquisition, analysis, or interpretation of data: Han, Moon, Lim, Na, Kim.

Drafting of the manuscript: Han, Moon, Lim, Suh, Lee, Chang.

Critical revision of the manuscript for important intellectual content: Han, Moon, Suh, Lee, Na, Kim, Chang.

Statistical analysis: Han, Lim, Suh, Lee, Kim.

Administrative, technical, or material support: Han, Lim, Na, Kim, Chang.

Supervision: Han, Suh, Lee, Kim, Chang.

Conflict of Interest Disclosures: Dr Lim is employed by LG Sciencepark. However, the company did not have any role in the study design, data collection and analysis, the decision to publish, or the preparation of this manuscript. No other disclosures were reported.

Additional Contributions: We thank the professors and clinicians who participated in the tests. Kim Sohyun, MA (I Dermatology Clinic), assisted with the survey part of the investigation. There was no financial compensation. We thank the patient depicted in Figure 2 for granting permission to publish this information.

Additional Information: The image in Figure 1 was generated by style-generative adversarial network and is not an actual person.

References
1.
Gulshan  V, Peng  L, Coram  M,  et al.  Development and validation of a deep learning algorithm for detection of diabetic retinopathy in retinal fundus photographs.  JAMA. 2016;316(22):2402-2410. doi:10.1001/jama.2016.17216PubMedGoogle ScholarCrossref
2.
Esteva  A, Kuprel  B, Novoa  RA,  et al.  Dermatologist-level classification of skin cancer with deep neural networks.  Nature. 2017;542(7639):115-118. doi:10.1038/nature21056PubMedGoogle ScholarCrossref
3.
Haenssle  HA, Fink  C, Schneiderbauer  R,  et al; Reader study level-I and level-II Groups.  Man against machine: diagnostic performance of a deep learning convolutional neural network for dermoscopic melanoma recognition in comparison to 58 dermatologists.  Ann Oncol. 2018;29(8):1836-1842. doi:10.1093/annonc/mdy166PubMedGoogle ScholarCrossref
4.
Chilamkurthy  S, Ghosh  R, Tanamala  S,  et al.  Deep learning algorithms for detection of critical findings in head CT scans: a retrospective study.  Lancet. 2018;392(10162):2388-2396. doi:10.1016/S0140-6736(18)31645-3PubMedGoogle ScholarCrossref
5.
Fujisawa  Y, Otomo  Y, Ogata  Y,  et al.  Deep-learning–based, computer-aided classifier developed with a small dataset of clinical images surpasses board-certified dermatologists in skin tumour diagnosis.  Br J Dermatol. 2019;180(2):373-381. doi:10.1111/bjd.16924PubMedGoogle ScholarCrossref
6.
Han  SS, Kim  MS, Lim  W, Park  GH, Park  I, Chang  SE.  Classification of the clinical images for benign and malignant cutaneous tumors using a deep learning algorithm.  J Invest Dermatol. 2018;138(7):1529-1538. doi:10.1016/j.jid.2018.01.028PubMedGoogle ScholarCrossref
7.
Rajpurkar  P, Irvin  J, Ball  RL,  et al.  Deep learning for chest radiograph diagnosis: a retrospective comparison of the CheXNeXt algorithm to practicing radiologists.  PLoS Med. 2018;15(11):e1002686. doi:10.1371/journal.pmed.1002686PubMedGoogle Scholar
8.
Tschandl  P, Rosendahl  C, Akay  BN,  et al.  Expert-level diagnosis of nonpigmented skin cancer by combined convolutional neural networks.  JAMA Dermatol. 2019;155(1):58-65. doi:10.1001/jamadermatol.2018.4378PubMedGoogle ScholarCrossref
9.
Tschandl  P, Codella  N, Akay  BN,  et al.  Comparison of the accuracy of human readers versus machine-learning algorithms for pigmented skin lesion classification: an open, web-based, international, diagnostic study.  Lancet Oncol. 2019;20(7):938-947. doi:10.1016/S1470-2045(19)30333-XPubMedGoogle ScholarCrossref
10.
De Fauw  J, Ledsam  JR, Romera-Paredes  B,  et al.  Clinically applicable deep learning for diagnosis and referral in retinal disease.  Nat Med. 2018;24(9):1342-1350. doi:10.1038/s41591-018-0107-6PubMedGoogle ScholarCrossref
11.
Abràmoff  MD, Lou  Y, Erginay  A,  et al.  Improved automated detection of diabetic retinopathy on a publicly available dataset through integration of deep learning.  Invest Ophthalmol Vis Sci. 2016;57(13):5200-5206. doi:10.1167/iovs.16-19964PubMedGoogle ScholarCrossref
12.
Cho  SI, Sun  S, Mun  J,  et al.  Dermatologist-level classification of malignant lip diseases using a deep convolutional neural network.  [published online August 26, 2019].  Br J Dermatol. doi:10.1111/bjd.18459PubMedGoogle Scholar
13.
Maron  RC, Weichenthal  M, Utikal  JS,  et al; Collaborators.  Systematic outperformance of 112 dermatologists in multiclass skin cancer image classification by convolutional neural networks.  Eur J Cancer. 2019;119:57-65. doi:10.1016/j.ejca.2019.06.013PubMedGoogle ScholarCrossref
14.
Brinker  TJ, Hekler  A, Enk  AH,  et al.  A convolutional neural network trained with dermoscopic images performed on par with 145 dermatologists in a clinical melanoma image classification task.  Eur J Cancer. 2019;111:148-154. doi:10.1016/j.ejca.2019.02.005PubMedGoogle ScholarCrossref
15.
Leiter  U, Eigentler  T, Garbe  C.  Epidemiology of skin cancer.  Adv Exp Med Biol. 2014;810:120-140.PubMedGoogle Scholar
16.
Girshick  R, Donahue  J, Darrell  T, Malik  J. Rich feature hierarchies for accurate object detection and semantic segmentation. Posted online October 22, 2014. arXiv 1311.2524.
17.
Giotis  I, Molders  N, Land  S, Biehl  M, Jonkman  MF, Petkov  N.  MED-NODE: a computer-assisted melanoma diagnosis system using non-dermoscopic images.  Expert Syst Appl. 2015;42(19):6578-6585. doi:10.1016/j.eswa.2015.04.034Google ScholarCrossref
18.
Kawahara  J, Daneshvar  S, Argenziano  G, Hamarneh  G.  Seven-point checklist and skin lesion classification using multi-task multi-modal neural nets.  [published online April 9, 2018].  IEEE J Biomed Health Inform. 2018. doi:10.1109/JBHI.2018.2824327PubMedGoogle Scholar
19.
Karras  T, Laine  S, Aila  T. A style-based generator architecture for generative adversarial networks. Preprint. Posted online March 29, 2019. arXiv 1812.04948.
20.
Ren  S, He  K, Girshick  R, Sun  J. Faster R-CNN: towards real-time object detection with region proposal networks. Preprint. Posted online January 6, 2016. arXiv 1506.01497.
21.
Hu  J, Shen  L, Sun  G. Squeeze-and-excitation networks. Preprint. Posted online May 16, 2019. arXiv 1709.01507.
22.
He  K, Zhang  X, Ren  S, Sun  J. Deep residual learning for image recognition. Preprint. Posted online December 10, 2015. arXiv 1512.03385.
23.
van den Hout  WB.  The area under an ROC curve with limited information.  Med Decis Making. 2003;23(2):160-166. doi:10.1177/0272989X03251246PubMedGoogle ScholarCrossref
24.
Yap  J, Yolland  W, Tschandl  P.  Multimodal skin lesion classification using deep learning.  Exp Dermatol. 2018;27(11):1261-1267. doi:10.1111/exd.13777PubMedGoogle ScholarCrossref
25.
Yu  C, Yang  S, Kim  W,  et al.  Acral melanoma detection using a convolutional neural network for dermoscopy images.  PLoS One. 2018;13(3):e0193321. doi:10.1371/journal.pone.0193321PubMedGoogle Scholar
26.
Gal  Y. Uncertainty in deep learning [dissertation]. Cambridge, England: University of Cambridge; 2016.
27.
Dodge  S, Karam  L. A study and comparison of human and deep learning recognition performance under visual distortions. Preprint. Posted online May 6, 2017. arXiv 1705.02498.
28.
Redmon  J, Divvala  S, Girshick  R, Farhadi  A. You only look once: unified, real-time object detection. Preprint. Posted online May 9, 2016. arXiv 1506.02640.
29.
Liu  W, Anguelov  D, Erhan  D,  et al Ssd: Single shot multibox detector. Preprint. Posted online December 29, 2016. arXiv 1512.02325.
30.
He  K, Gkioxari  G, Dollár  P, Girshick  R. Mask R-CNN. Preprint. Posted online January 24, 2018. arXiv 1703.06870.
31.
Thian  YL, Li  Y, Jagmohan  P, Sia  D, Chan  VEY, Tan  RT. Convolutional neural networks for automated fracture detection and localization on wrist radiographs [published online January 30, 2019]. Radiology Artificial Intelligence. doi:10.1148/ryai.2019180001
32.
Gan  K, Xu  D, Lin  Y,  et al.  Artificial intelligence detection of distal radius fractures: a comparison between the convolutional neural network and professional assessments.  Acta Orthop. 2019;90(4):394-400. doi:10.1080/17453674.2019.1600125PubMedGoogle ScholarCrossref
33.
Akselrod-Ballin  A, Karlinsky  L, Alpert  S, Hasoul  S, Ben-Ari  R, Barkan  E. A region based convolutional network for tumor detection and classification in breast mammography. In: Carneiro  G,  et al, eds.  Deep Learning and Data Labeling for Medical Applications. Cham, Switzerland: Springer; 2016:197-205. doi:10.1007/978-3-319-46976-8_21
34.
Han  SS, Park  GH, Lim  W,  et al.  Deep neural networks show an equivalent and often superior performance to dermatologists in onychomycosis diagnosis: automatic construction of onychomycosis datasets by region-based convolutional deep neural network.  PLoS One. 2018;13(1):e0191493. doi:10.1371/journal.pone.0191493PubMedGoogle Scholar
35.
Winkler  JK, Fink  C, Toberer  F,  et al.  Association between surgical skin markings in dermoscopic images and diagnostic performance of a deep learning convolutional neural network for melanoma recognition  [published online August 14, 2019].  JAMA Dermatol. 2019. doi:10.1001/jamadermatol.2019.1735PubMedGoogle Scholar
×