The deep learning–based automatic detection algorithm (DLAD) showed consistently high image-wise classification (area under the receiver operating characteristic curve [AUROC], 0.973-1.000) (A) and lesion-wise localization (area under the alternative free-response receiver operating characteristic curve [AUAFROC], 0.923-0.985) (B) performances in external validation tests. In comparison of performance with physicians, the DLAD showed significantly high classification (AUROC, 0.983 vs 0.814-0.932) (C) and localization (AUAFROC, 0.985 vs 0.781-0.907) (D) performances than all observer groups.
A, The chest radiograph (CR) shows nodular opacity at the right lower lung field (arrowhead), which was initially detected by 2 of 15 observers. B, The corresponding computed tomographic (CT) image reveals a nodule at the right middle lobe. C, The deep learning–based automatic detection algorithm (DLAD) correctly localized the lesion (probability score, 0.291). Four observers additionally detected the lesion after checking the output.
A, The chest radiograph (CR) shows subtle patchy increased opacity at the left middle lung field, which was initially missed by all 15 observers. B, The corresponding computed tomographic (CT) image shows patchy ground glass opacity at the left upper lobe. C, The deep learning–based automatic detection algorithm (DLAD) correctly localized the lesion (probability score, 0.371). Seven observers correctly detected the lesion after checking the result.
eAppendix. Supplemental Material
eFigure 1. Flow Diagram of Data Inclusion, Curation, and Allocation
eFigure 2. Architecture of the DLAD Algorithm
eFigure 3. User Interface of the Observer Performance Test
eFigure 4. Comparison of Performances Between Physician Only Reading and Physician Assisted by DLAD
eFigure 5. Representative Case From the Observer Performance Test (Active Pulmonary Tuberculosis)
eFigure 6. Representative Case From the Observer Performance Test (Pneumothorax)
eFigure 7. Confusion Matrices for Differentiation of Abnormal CRs
eFigure 8. Examples of the Differentiation of Abnormal CRs by DLAD
eFigure 9. Examples of Differentiation of False-Positively Classified Normal CRs by DLAD
eTable 1. Inclusion and Exclusion Criteria for the Development and External Validation Datasets
eTable 2. Demographic Description of the Five External Validation Datasets
eTable 3. Sensitivities of DLAD for Individual Diseases in the 5 External Validation Datasets
eTable 4. Performance of Individual Non-Radiology Physician Readers
eTable 5. Performance of Individual Board-Certified Radiologist Readers
eTable 6. Performance of Individual Thoracic Radiologist Readers
eTable 7. Performance of DLAD in the Differentiation of Abnormal CRs
eTable 8. Differentiation of False-Positively Classified Normal CRs by DLAD
Customize your JAMA Network experience by selecting one or more topics from the list below.
Identify all potential conflicts of interest that might be relevant to your comment.
Conflicts of interest comprise financial interests, activities, and relationships within the past 3 years including but not limited to employment, affiliation, grants or funding, consultancies, honoraria or payment, speaker's bureaus, stock ownership or options, expert testimony, royalties, donation of medical equipment, or patents planned, pending, or issued.
Err on the side of full disclosure.
If you have no conflicts of interest, check "No potential conflicts of interest" in the box below. The information will be posted with your response.
Not all submitted comments are published. Please see our commenting policy for details.
Hwang EJ, Park S, Jin K, et al. Development and Validation of a Deep Learning–Based Automated Detection Algorithm for Major Thoracic Diseases on Chest Radiographs. JAMA Netw Open. 2019;2(3):e191095. doi:10.1001/jamanetworkopen.2019.1095
Can a deep learning–based algorithm accurately discriminate abnormal chest radiograph results showing major thoracic diseases from normal chest radiograph results?
In this diagnostic study of 54 221 chest radiographs with normal findings and 35 613 with abnormal findings, the deep learning–based algorithm for discrimination of chest radiographs with pulmonary malignant neoplasms, active tuberculosis, pneumonia, or pneumothorax demonstrated excellent and consistent performance throughout 5 independent data sets. The algorithm outperformed physicians, including radiologists, and enhanced physician performance when used as a second reader.
A deep learning–based algorithm may help improve diagnostic accuracy in reading chest radiographs and assist in prioritizing chest radiographs, thereby increasing workflow efficacy.
Interpretation of chest radiographs is a challenging task prone to errors, requiring expert readers. An automated system that can accurately classify chest radiographs may help streamline the clinical workflow.
To develop a deep learning–based algorithm that can classify normal and abnormal results from chest radiographs with major thoracic diseases including pulmonary malignant neoplasm, active tuberculosis, pneumonia, and pneumothorax and to validate the algorithm’s performance using independent data sets.
Design, Setting, and Participants
This diagnostic study developed a deep learning–based algorithm using single-center data collected between November 1, 2016, and January 31, 2017. The algorithm was externally validated with multicenter data collected between May 1 and July 31, 2018. A total of 54 221 chest radiographs with normal findings from 47 917 individuals (21 556 men and 26 361 women; mean [SD] age, 51  years) and 35 613 chest radiographs with abnormal findings from 14 102 individuals (8373 men and 5729 women; mean [SD] age, 62  years) were used to develop the algorithm. A total of 486 chest radiographs with normal results and 529 with abnormal results (1 from each participant; 628 men and 387 women; mean [SD] age, 53  years) from 5 institutions were used for external validation. Fifteen physicians, including nonradiology physicians, board-certified radiologists, and thoracic radiologists, participated in observer performance testing. Data were analyzed in August 2018.
Deep learning–based algorithm.
Main Outcomes and Measures
Image-wise classification performances measured by area under the receiver operating characteristic curve; lesion-wise localization performances measured by area under the alternative free-response receiver operating characteristic curve.
The algorithm demonstrated a median (range) area under the curve of 0.979 (0.973-1.000) for image-wise classification and 0.972 (0.923-0.985) for lesion-wise localization; the algorithm demonstrated significantly higher performance than all 3 physician groups in both image-wise classification (0.983 vs 0.814-0.932; all P < .005) and lesion-wise localization (0.985 vs 0.781-0.907; all P < .001). Significant improvements in both image-wise classification (0.814-0.932 to 0.904-0.958; all P < .005) and lesion-wise localization (0.781-0.907 to 0.873-0.938; all P < .001) were observed in all 3 physician groups with assistance of the algorithm.
Conclusions and Relevance
The algorithm consistently outperformed physicians, including thoracic radiologists, in the discrimination of chest radiographs with major thoracic diseases, demonstrating its potential to improve the quality and efficiency of clinical practice.
Chest radiographs (CRs) have been used as a first-line examination for the evaluation of various thoracic diseases worldwide.1,2 In fact, CR is the most commonly performed radiologic examination today, accounting for up to 26% of all diagnostic radiologic examinations.3 Interpretation of CR, however, remains a challenging task requiring both experience and expertise, as various anatomic structures can overlap in a single 2-dimensional image, and various physiological and pathological changes may appear similar or a single pathology may exhibit various features.4 Thus, interpretation is prone to errors, with a previous study5 reporting that 22% of all errors in diagnostic radiology were made in CRs. Compounding this difficulty is an increase in the number of examinations at a rate much faster than the increase in the number of qualified radiologists, which has led to an increased workload for radiologists.6
Thus, it is not surprising that computer-aided diagnosis (CAD) for CRs has remained an attractive topic for researchers.7 Indeed, there have been several CAD systems that have achieved successful results for various thoracic diseases, including pulmonary nodules,8,9 pulmonary tuberculosis,10,11 and pneumothorax.12 However, few CAD systems are presently used in clinical practice owing to their suboptimal performance (sensitivity, 47%-76% with 1.7-3.3 false-positives per image for lung nodules; area under the receiver operating characteristic curve [AUROC], 0.71-0.84 for pulmonary tuberculosis),10,13 and specific target diseases limit their utility in general practice.
Recently, the deep learning technique demonstrated promising results in medical image analyses, including detecting diabetic retinopathy in fundus photographs,14 classifying skin cancer from skin photographs,15 and detecting metastasis on pathologic images.16 As for CRs, several studies have reported notable performance of the deep learning algorithm for image classification.17-19 However, algorithms in those studies have not yet been fully validated in unseen data sets,17-19 limiting the generalizability of results.
Previously, we investigated deep learning–based automatic detection algorithms (DLADs) for classification of CRs with malignant nodules20 and active pulmonary tuberculosis.21 However, those algorithms had limited clinical utility, as there are various pathologies and abnormalities other than malignant neoplasms and pulmonary tuberculosis in real-world clinical practice. For a CAD system to have practical value in this setting, it should work on CRs with various abnormalities, particularly major thoracic diseases that account for most thoracic abnormalities observed on CRs. Therefore, the purpose of our study was to develop a DLAD for major thoracic diseases on CRs and to validate its performance using independent data sets in comparison with physicians.
This study was approved by the institutional review boards of all participating institutions, which waived the requirement for patient consent. This report followed the Standards for Reporting of Diagnostic Accuracy (STARD) 2015 reporting guideline.
Among the development data set, CRs with normal findings (n = 54 221) and some CRs with abnormal findings of pulmonary malignant neoplasms (n = 13 926) and active pulmonary tuberculosis (n = 6768) were used in our previous studies.20,21 In those studies, however, tasks of the developed algorithms were classification of CRs with pulmonary malignant neoplasms20 and active pulmonary tuberculosis from normal CRs, which is different from that of our current study.21
We defined the target diseases of our DLAD as major thoracic diseases that are common, clinically important, and detectable on CRs. Specifically, we included pulmonary malignant neoplasms (including primary lung cancers and metastasis), active pulmonary tuberculosis, and pneumonia, which are among the top 5 respiratory diseases in terms of global burden.22,23 We added pneumothorax as a target disease, as it is relatively common, yet can cause mortality without prompt and accurate detection through CRs.24,25
The primary goal of our DLAD was binary classification of CRs: CRs with abnormal findings including any of the target diseases vs normal CRs. The subsidiary goal of our DLAD was to differentiate CRs with abnormal results into 1 of 4 major thoracic diseases.
For the development of the DLAD, a total of 57 481 CRs with normal results and 41 140 CRs with abnormal results were retrospectively collected between November 1, 2016, and January 31, 2017, from a single institution (institution A). The CRs with normal findings were collected based on their radiology reports and were double-checked by board-certified radiologists. The CRs with abnormal findings were obtained from patients with pathologically proven or clinically and/or radiologically confirmed diseases, of which the detailed inclusion criteria are summarized in eTable 1 in the Supplement. Abnormal findings of CRs included the following 4 disease categories: pulmonary malignant neoplasms, active pulmonary tuberculosis, pneumonia, and pneumothorax.
For data curation, all CRs were reviewed by at least 1 of 15 board-certified radiologists (7-14 years of experience in reading CRs). The data curation process comprised 2 steps. The first step, image labeling, was performed to confirm whether each CR was categorized correctly and whether abnormalities of CRs with abnormal results were visible on the CRs. The second step, image annotation, marked the exact location of each abnormal finding on the CR. During data curation, CRs originally designated as normal but showing significant abnormality (3260 CRs) and CRs read as abnormal but without detectable abnormal finding (5527 CRs) were excluded from the data set by reviewing radiologists. Finally, 54 221 CRs with normal results from 47 917 individuals (21 556 men and 26 361 women; mean [SD] age, 51  years) and 35 613 CRs with abnormal results from 14 102 individuals (8373 men and 5729 women; mean [SD] age, 62  years) were used for the development of the DLAD. Annotations were performed in 35.6% of CRs with abnormal results (12 696 of 35 613).
All CRs were randomly assigned into 1 of the 3 following data sets: (1) training data set comprising 53 621 CRs with normal findings and 34 074 CRs with abnormal findings to optimize network weights; (2) tuning data set comprising 300 CRs with normal findings and 750 CRs with abnormal findings to optimize hyperparameters; and (3) in-house validation data set comprising 300 CRs with normal findings and 789 CRs with abnormal findings to evaluate the detection performance of the trained DLAD (eFigure 1 in the Supplement). To prevent CRs of a single individual from being assigned to different data sets, patient-based assignment was performed, and the patients of the 3 data sets were excluded from the other data sets.
Detailed description of the network architecture and the training of the DLAD is provided in the eAppendix and eFigure 2 in the Supplement. In brief, we adopted a deep convolutional neural network with dense blocks26 comprising 5 parallel classifiers. Four classifiers were designed for each disease, and the final classifier was designed for classification of CRs with abnormal results reflecting any of the target diseases. To train the algorithm to classify abnormal CRs with major thoracic diseases as well as to localize abnormalities, 2 types of losses were used to train the algorithm: classification loss and localization loss. Both CRs with and without annotations were used in training, while localization losses were calculated only from the CRs with annotations.
Finally, for each input CR, the DLAD provided continuous value between 0 and 1 as the image-level probability of abnormal CR. Per-pixel localization probability maps for each target disease and the entire target disease, overlaid on the input CR, were also provided.
First, the performance of the DLAD was evaluated using an in-house validation data set, part of the development data set not used for training. Thereafter, external validation tests were performed using 5 independent data sets separately collected and curated between May 1 and July 31, 2018 at different institutions (4 hospitals in Korea [institutions A-D] and 4 hospital in France [institution E]) to validate the consistency of the DLAD’s performance. The inclusion criteria for the external validation data sets are summarized in eTable 1 in the Supplement. Each CR with abnormal results contained only 1 disease finding of the 4 target diseases. All CRs with normal and abnormal findings, except CRs with pneumothorax, had corresponding chest computed tomographic images to define firm reference standards. Five board-certified radiologists (7-14 years of experience) in each of the participating institutions selected and labeled the CRs and annotated the exact locations of the abnormalities on each CR with abnormal results. The external validation data set from institution A was temporally different from the development data set. Demographic information regarding the external validation data sets appears in eTable 2 in the Supplement. Overall, a total of 486 CRs with normal results and 529 with abnormal results (1 from each participant; 628 men and 387 women; mean [SD] age, 53  years) were used for external validation.
To compare the performances between the DLAD and physicians and to evaluate whether a DLAD can improve physicians’ diagnostic performance, an observer performance test was conducted. An observer panel of 15 physicians with varying experience (5 thoracic radiologists [9-14 years of experience]; 5 board-certified radiologists [5-7 years of experience]; and 5 nonradiology physicians) was constructed. For this test, the external validation data set from institution A was used. The radiologist who defined the reference standard for the data set did not participate in the test. The test included 2 sessions. In session 1, observers independently assessed every CR, without assistance of the DLAD, to classify CRs into those with significant abnormal findings requiring treatment or further evaluation and those without. Observers were also asked to localize the abnormal finding via free-hand annotation, along with a confidence score of a continuous value between 0 and 1 for each annotation. In session 2, observers reevaluated every CR with the assistance of the DLAD and were asked to modify their original decision if necessary (eFigure 3 in the Supplement).
All statistical analyses were performed in August 2018 using R statistical software version 3.5.1 (R Project for Statistical Computing)27 with package RJafroc.28 Receiver operating characteristic (ROC) analyses and jackknife alternative free-response ROC (JAFROC) analyses were performed to evaluate image-wise classification and lesion-wise localization performances, respectively. For the DLAD, the image-wise probability value of each CR and the maximum pixel-wise probability value in the predefined reference standard annotation were considered to be confidence scores for ROC and JAFROC analyses, respectively. For physicians, the highest confidence score among annotated lesions in each image was used as their confidence score for image-wise classification.29 The AUROCs and area under the alternative free-response ROC curve (AUAFROCs) were used as performance measures of ROC and JAFROC analyses, respectively. Statistical significances were evaluated using a method suggested by Dorfman et al.30 Both the observers and cases were treated as random effects for analyses in the physician groups, while only cases were treated as random effects for analyses in individual physicians.31
In addition, sensitivities, specificities, precision, and F1 scores for image-wise classification were evaluated. For the DLAD, 2 different probability values were selected as classification thresholds based on the results of in-house validation: a high sensitivity threshold (95% sensitivity on in-house validation) and a high specificity threshold (95% specificity on in-house validation). For physicians, any detected lesion was regarded to be positive. Comparison of sensitivities and specificities were performed using McNemar tests.
For performance evaluation in differentiating CRs with abnormal findings, the disease with the highest probability value among the DLAD’s output was regarded as the differential diagnosis of the DLAD. Thereafter, confusion matrices were drawn and overall accuracies and producer’s accuracies of each target disease were calculated.32
All results with 2-sided P values less than .05 were considered to indicate a statistically significant difference, and the Holm-Bonferroni method was used to correct for multiple comparisons.33
The performance of the DLAD in our in-house validation data set revealed an AUROC of 0.965 (95% CI, 0.955-0.975) for image-wise classification. From this result, operating thresholds were defined as a probability of 0.16 (high sensitivity threshold; sensitivity, 0.951; specificity, 0.750) and 0.46 (high specificity threshold; sensitivity, 0.920; specificity, 0.950).
For external validation, the DLAD showed a median (range) AUROC of 0.979 (0.973-1.000), which was greater than the results of in-house validation. Median (range) sensitivity and specificity were 0.979 (0.913-1.000) and 0.880 (0.566-1.000), respectively, using the high sensitivity threshold; median (range) sensitivity and specificity were 0.945 (0.845-1.000) and 0.980 (0.848-1.000), respectively, using the high specificity threshold. For individual diseases, sensitivities of the DLAD were between 0.833 and 1.000 using the high sensitivity threshold, and between 0.808 and 1.000 using the high specificity threshold (eTable 3 in the Supplement). Detailed performances are described in Table 1 and Figure 1.
In-house performance of the DLAD in lesion-wise localization was an AUAFROC of 0.916 (95% CI, 0.900-0.932). In the external validation data sets, the DLAD showed a median (range) AUAFROC of 0.972 (0.923-0.985) (Table 1 and Figure 1).
In session 1 of the observer performance test, pooled AUROCs for nonradiology physicians, board-certified radiologists, and thoracic radiologists were 0.814, 0.896, and 0.932, respectively. The performance of the DLAD (AUROC 0.983) was significantly higher than all 3 observer groups (All P < .005) (Table 2 and Figure 1).
For the lesion-wise localization, pooled AUAFROCs for nonradiology physicians, board-certified radiologists, and thoracic radiologists were 0.781, 0.870, and 0.907, respectively. The performance of the DLAD (AUAFROC 0.985) was significantly higher than all observer groups (All P < .001).
Regarding the performances of individual observers, the DLAD showed significantly better image-wise classification performance than 14 of 15 observers (median [range] AUROC, 0.906 [0.779-0.959]), and significantly better lesion-wise localization performance than all observers (median [range] AUAFROC, 0.877 [0.742-0.938]) (eTables 4-6 in the Supplement).
In session 2 of the observer performance test, AUROCs of nonradiology physicians, board-certified radiologists, and thoracic radiologists were 0.904, 0.939, and 0.958, respectively. Increments of AUROCs were 0.090, 0.043, and 0.026, respectively, all of which were statistically significant (all P < .005) (Table 2; eFigure 4 in the Supplement).
For lesion-wise localization, AUAFROCs of nonradiology physicians, board-certified radiologists, and thoracic radiologists were 0.873, 0.919, and 0.938, respectively. Increments of AUAFROCs were 0.092, 0.049, and 0.031, respectively, all of which were statistically significant (all P < .001).
In terms of sensitivities and specificities, significant improvement in sensitivities (0.699-0.876 in session 1; 0.835-0.924 in session 2; all P < .001) were observed in all 3 physician groups, while specificities (0.901-0.946 in session 1; 0.924-0.948 in session 2) were significantly improved only in nonradiology physicians.
In terms of individual observers, significant improvements in AUROCs (median [range] increment, 0.040 [0.007-0.111]) and AUAFROCs (median [range] increment, 0.051 [0.015-0.108]) were observed in 14 of 15 physicians (eTables 4-6 in the Supplement).
Figure 2, Figure 3, and eFigure 5 and eFigure 6 in the Supplement show representative images from the observer performance test.
In the evaluation of the DLAD’s performance in differentiating CRs with abnormal findings, the pooled overall accuracy was 0.686 (95% CI, 0.645-0.726), while the producer’s accuracies for pulmonary malignant neoplasm, active pulmonary tuberculosis, pneumonia, and pneumothorax were 0.840 (95% CI, 0.774-0.892), 0.209 (95% CI, 0.143-0.290), 0.731 (95% CI, 0.642-0.808), and 0.950 (95% CI, 0.893-0.981), respectively. Detailed results are shown in eTable 7, eTable 8, and eFigures 7 to 9 in the Supplement.
We developed a DLAD algorithm that is able to classify CRs with normal vs abnormal findings indicative of major thoracic diseases including pulmonary malignant neoplasms, active pulmonary tuberculosis, pneumonia, and pneumothorax. Our algorithm consistently demonstrated high performance across independent data sets, even outperforming physicians, including thoracic radiologists. Furthermore, we demonstrated improved physician performance with the assistance of the DLAD.
The strengths of our study can be summarized as follows. First, the development data set underwent extensive data curation by radiologists. It has been shown that the performance of deep learning–based algorithms depends not only on the quantity of the training data set, but also on the quality of the data labels.34 As for CRs, several open-source data sets are currently available; however, those data sets remain suboptimal for the development of deep learning–based algorithms because they are weakly labeled by radiologic reports35 or lack localization information.36 In contrast, in the present study, we initially collected data from the radiology reports and clinical diagnosis; then experienced board-certified radiologists meticulously reviewed all of the collected CRs. Furthermore, annotation of the exact location of each abnormal finding was done in 35.6% of CRs with abnormal results, which we believe led to the excellent performance of our DLAD.37
Second, our DLAD showed consistently excellent performance in the 5 external validation data sets, each of which was independently collected from different institutions across different continents (AUROC, 0.973-1.000; AUAFROC, 0.923-0.985). This consistent performance of the DLAD across the external validation data sets acquired from different populations suggests that our DLAD’s performance may be generalized to various populations.
Third, we compared the performance of our DLAD with the performance of physicians with various levels of experience. The stand-alone performance of a CAD system can be influenced by the difficulty of the test data sets and can be exaggerated in easy test data sets. However, observer performance tests may provide a more objective measure of performance by comparing the performance between the CAD system and physicians. Impressively, the DLAD demonstrated significantly higher performance both in image-wise classification and lesion-wise localization than all physician groups, even the thoracic radiologist group.
Fourth, our DLAD provided localization information as well as image-wise classification capabilities. Although the exact localization of abnormalities on CR may not be a clinically relevant task, it may be an important consideration in the reliability of an algorithm. The explainable output of deep learning algorithms can be critical for the reliability of the algorithms, particularly in the medical field.38 Localization information provided by the DLAD can help visualize the logical background of the classification output, which is the ultimate goal of our DLAD. Indeed, the improvement of physicians’ performances with the assistance of the DLAD suggests that it provides a reliable explanation.
The primary goal of our DLAD was to classify CRs with normal vs abnormal results indicating any of the major thoracic diseases. In most clinical situations, CRs serve as the initial diagnostic examination for various thoracic diseases. Detection of such abnormalities would lead to further diagnostic workups with other radiologic or laboratory examinations to make a specific diagnosis. Therefore, the initial detection of such clinically relevant abnormalities is of paramount importance in the interpretation of CRs in real clinical practice. The target diseases of our DLAD, although they did not cover all of the thoracic diseases, were the most common, clinically relevant diseases. Pulmonary malignant neoplasm, tuberculosis, and pneumonia, which are responsible for 1.6, 1.4, and 4 million global deaths per year, respectively, are 3 of the most important diseases among all thoracic diseases that can be detected on CRs.23 While it causes less mortality than the 3 other target diseases, pneumothorax is still an important global health burden, with an annual incidence of 18 to 28 and 1.2 to 6 cases per 100 000 males and females, respectively.39 Furthermore, it is critical to detect pneumothorax on CRs, as CRs are typically the final diagnostic examination to confirm the diagnosis.
The high performance of the DLAD in classification of CRs with normal and abnormal findings indicative of major thoracic diseases, outperforming even thoracic radiologists, suggests its potential for stand-alone use in select clinical situations. It may also help improve the clinical workflow by prioritizing CRs with suspicious abnormal findings requiring prompt diagnosis and management. It can also improve radiologists’ work efficiency, which would partially alleviate the heavy workload burden that radiologists face today and improve patients’ turnaround time. Furthermore, the improved performance of physicians with the assistance of the DLAD indicates the potential of our DLAD as a second reader. The DLAD can contribute to reducing perceptual error of interpreting physicians by alerting them to the possibility of major thoracic diseases and visualizing the location of the abnormality. In particular, the more obvious increment of performance in less-experienced physicians suggests that our DLAD can help improve the quality of CR interpretations in situations in which expert thoracic radiologists may not be available.
Providing a differential diagnosis among the CRs with abnormal results was a subsidiary task of our DLAD. According to our study results, the DLAD showed promising but suboptimal performance in this task (pooled overall accuracy of 0.686). Actually, one of the most important challenges in the interpretation of CRs is that there are substantial overlaps between the radiological findings of various diseases. Therefore, it is often impossible to provide a specific differential diagnosis using only CRs. As expected, our DLAD showed substantial misclassification among pulmonary malignant neoplasms, tuberculosis, and pneumonia owing to these overlaps in radiologic findings. For pneumothorax, on the other hand, which is one of the few examples in which a specific diagnosis can be made with CRs alone because of the condition’s clearly different findings from other diseases, the DLAD showed excellent differentiating performance (pooled producer’s accuracy of 0.950).
There are several limitations in the present study. First, validation was performed using experimentally designed data sets; however, real-world situations may be substantially different from these data sets, particularly regarding disease prevalence and diversity of abnormalities on CRs. In this regard, further validation and clinical utility tests in various clinical settings are warranted. Second, our DLAD covers only 4 major thoracic disease categories. However, we believe our DLAD algorithm can detect a substantial proportion of clinically relevant diseases in actual practice. At present, training our DLAD to detect all kinds of thoracic diseases, including rare or clinically irrelevant abnormalities, may not be practical because it can cause many false-positive classifications, hampering its clinical utility. Third, each abnormal CR in the external validation data sets represented only 1 target disease, as we attempted to set a strict reference standard for CRs with abnormal findings. However, CRs with multiple target diseases are not uncommon in real-world situations and, thus, warrant future investigations.
We developed a DLAD algorithm that can classify CRs with normal and abnormal findings indicating major thoracic diseases with consistently high performance, outperforming even radiologists, which may improve the quality and efficiency of the current clinical workflow.
Accepted for Publication: January 24, 2019.
Published: March 22, 2019. doi:10.1001/jamanetworkopen.2019.1095
Correction: This article was corrected on April 12, 2019, to fix incorrect wording in the Methods section and Table 1 as well as incorrect data in eFigure 1 in the Supplement.
Open Access: This is an open access article distributed under the terms of the CC-BY License. © 2019 Hwang EJ et al. JAMA Network Open.
Corresponding Author: Chang Min Park, MD, PhD, Department of Radiology and Institute of Radiation Medicine, Seoul National University College of Medicine, 101 Daehak-ro, Jongno-gu, Seoul 03080, South Korea (firstname.lastname@example.org).
Author Contributions: Drs Hwang and C. M. Park had full access to all of the data in the study and take responsibility for the integrity of the data and the accuracy of the data analysis. Dr Hwang and Mr S. Park equally contributed to this work.
Concept and design: Hwang, S. Park, Lee, Goo, C. M. Park.
Acquisition, analysis, or interpretation of data: Hwang, S. Park, Jin, Kim, Choi, Goo, Aum, Yim, Cohen, Ferretti, C. M. Park.
Drafting of the manuscript: Hwang, S. Park, Kim.
Critical revision of the manuscript for important intellectual content: Hwang, S. Park, Jin, Choi, Lee, Goo, Aum, Yim, Cohen, Ferretti, C. M. Park.
Statistical analysis: Hwang, S. Park, Lee.
Obtained funding: C. M. Park.
Administrative, technical, or material support: Hwang, S. Park, Kim, Choi, Goo, Yim, Ferretti, C. M. Park.
Supervision: Goo, Yim, C. M. Park.
Conflict of Interest Disclosures: Dr Goo reported grants from Lunit Inc during the conduct of the study. Dr Ferretti reported personal fees from Boehringer, Roche, Bristol-Myers Squibb, and GEMS and nonfinancial support from Guerbet outside the submitted work. No other disclosures were reported.
Funding/Support: This study was supported by the Seoul National University Hospital Research Fund (grant 06-2016-3000), Lunit Inc, and the Seoul Research & Business Development Program (grant FI170002).
Role of the Funder/Sponsor: The funders had no role in the design and conduct of the study; collection, management, analysis, and interpretation of the data; preparation, review, or approval of the manuscript; and decision to submit the manuscript for publication.
Group Members: The DLAD Development and Evaluation Group includes the following 10 members: Dong Hyeon Kim, MD, Seoul National University Hospital College of Medicine, Seoul, South Korea; Sungmin Woo, MD, Armed Forces Daejon Hospital, Daejon, South Korea; Wonseok Choi, MD, Seoul National University Hospital College of Medicine; In Pyung Hwang, MD, Seoul National University Hospital College of Medicine; Yong Sub Song, MD, Seoul National University Hospital College of Medicine; Jiyeon Lim, MD, Seoul National University Hospital College of Medicine; Hyungjin Kim, MD, Seoul National University Hospital College of Medicine; Jae Yeon Wi, MD, Asan Medical Center, Seoul, South Korea; Su Suk Oh, MD, Seoul National University Hospital, Seoul, South Korea; Mi-Jin Kang, MD, Inje University Sanggyepaik Hospital, Seoul, South Korea; Nyoung Keun Lee, MD, Sungmin Hospital, Incheon, South Korea; Jin Young Yoo, MD, Chungbuk National University Hospital, Cheongju, South Korea; Young Joo Suh, MD, Severance Hospital, Yonsei University College of Medicine, Seoul, South Korea.