Breast cancer is the most common non–skin-related cancer among women and the leading cause of cancer-related deaths among women worldwide. While mammography screening has been shown to reduce breast cancer morbidity and mortality, the intrinsic limitation of a mammogram, namely, it being a 2-dimensional projection of a 3-dimensional structure, increases the complexity of the cancer detection task. In the United States, a single radiologist interprets screening mammograms to carry out this task, whereas most of Europe and parts of Asia use 2 radiologists (double reading). In the single-radiologist reading scenario, false-negative rates range between 10% and 30%,1 and false-positive rates are such that 49% of women screened annually for 10 years will experience at least 1 false-positive mammogram result.2 Clearly, this is a task that can benefit from artificial intelligence (AI), and indeed AI applications in breast cancer compose 12% of the total applications of AI in radiology.3 In particular, deep learning applications in digital mammograms are on the rise.4-6 However, most of the studies to date have used small data sets, which could have resulted in algorithm overtraining and overenthusiastic predictions about the algorithms’ usability in the breast imaging clinic.
Schaffter and colleagues7 provide the results of a large international, crowd-sourced trial on the application of deep learning in digital mammograms. This extraordinary, well-organized challenge was backed up by powerful computational resources. In this challenge, 126 teams from 44 countries had a single goal: to meet or beat the radiologists in determining whether a given mammogram would diagnose cancer within a 12-month time frame. Two large databases were used, 1 from the United States and 1 from Sweden, amounting to more than 300 000 mammographic examinations from more than 150 000 women. To understand the magnitude of this database, it would take a high-volume breast radiologist (5000 mammogram cases per year) 60 years to read.
To appreciate the contributions of this study to the field, I will summarize its main results and limitations. At the set, this study issued 2 challenges. In the first challenge, the algorithms would only use the mammograms to make their predictions. In the second challenge, the algorithms would also use previous examinations (if available), in addition to clinical and demographic data about the patients, to make the same predictions. Given the large number of cases available to train and validate the algorithms and the level of talent involved, it is easy to expect that radiologists would be beaten in these challenges, but surprisingly that is not what happened. When operating based on the mammograms alone (challenge 1), the top-performing algorithm had lower specificity at the radiologists’ operating sensitivity for both databases. Moreover, no significant improvements were observed when the algorithms had access to prior imaging, as well as clinical and demographic data about the patients (challenge 2). At this point, the organizers invited the top 8 teams to collaborate to produce an ensemble model, properly called the challenge ensemble model (CEM). Specificity of this model was contrasted with that of radiologists at a given sensitivity, and once again radiologists came ahead by a wide margin (at a sensitivity of 85.9%, radiologists’ specificity was 90.5%, whereas that of the CEM was 76.1%). It was only when the radiologists’ own assessments of the cases were added to the CEM model (CEM + R) that its specificity improved to 92%, resulting in AI finally beating the radiologists.
There is a lot to unpack from these results. First and foremost, despite the fact that all algorithms developed for the challenge are freely available, no breast clinic would have the computational resources to deploy the ensemble model with live patient data for a gain in specificity of 1.5% over radiologists’ specificity at given sensitivity (85.9%). In the challenge, the issue of computational resources was handled by having all algorithms run in the cloud (computer data storage), using the same computational resources, but in real life it would be very difficult for breast clinics to upload identifiable patient data to the cloud for processing. Even if the trained ensemble algorithm was to be made available to breast clinics implemented either in a specialized workstation or laptop, this would effectively freeze the algorithm in time and prevent it from learning the characteristics of the population being screened. At that point, it is fair to question whether this is still AI or just a fancy new computer-aided detection system. Second, when compared with double reading, no gains were observed, even in the CEM + R model, which suggests that 2 radiologists working together still beat a radiologist working with AI. Third, in challenge 1, breast-level performance was used, but in the ensemble model, case-level performance was used. Neither of these models involved marking the actual location where the algorithm(s) believed a cancer to be present, which is different from the radiologists’ assessments of recall based on suspicion in a given area. This was owing to the fact that data presented to the algorithms were weakly labeled (ie, they did not contain the location of the cancers), but it is an important distinction between what radiologists do in their daily practices and what the challenge algorithms were asked to do.
There is a reason why these limitations are important, primarily because this has been the largest effort to date, to my knowledge, to develop AI that could aid radiologists in the reading of screening mammograms. For the most part, it showed that AI is not there yet. In addition, with many calling for a total overhaul of radiology owing to AI, studies like this strongly suggest that radiologists will be masters of their domain for quite some time, as the task of image interpretation is significantly more complex than radiologists get credit for. Nonetheless, it is necessary not to lose sight of the fact that the results presented by Schaffter and colleagues7 are temporary. They do not show that AI is never going to be useful for screening mammography; they simply show that today, even with an incredible amount of resources, arguably the best AI teams in the world could not meet or beat the radiologists.
Published: March 2, 2020. doi:10.1001/jamanetworkopen.2020.0282
Open Access: This is an open access article distributed under the terms of the CC-BY License. © 2020 Mello-Thoms C. JAMA Network Open.
Corresponding Author: Claudia Mello-Thoms, MS, PhD, Department of Radiology, University of Iowa, 200 Hawkins Drive, 3922 JPP, Iowa City, IA 52242 (claudia-mello-thoms@uiowa.edu).
Conflict of Interest Disclosures: None reported.
5.Hinton
B, Ma
L, Mahmoudzadeh
AP,
et al. Deep learning networks find unique mammographic differences in previous negative mammograms between interval and screen-detected cancers: a case-case study.
Cancer Imaging. 2019;19(1):41. doi:
10.1186/s40644-019-0227-3PubMedGoogle ScholarCrossref 7.Schaffter
T, Buist
DSM, Lee
CI,
et al; DM DREAM Consortium. Evaluation of combined artificial intelligence and radiologist assessment to interpret screening mammograms.
JAMA Netw Open. 2020;3(3):e200265. doi:
10.1001/jamanetworkopen.2020.0265Google Scholar