[Skip to Content]
Sign In
Individual Sign In
Create an Account
Institutional Sign In
OpenAthens Shibboleth
[Skip to Content Landing]
Figure.  Receiver Operating Characteristic Curves for the 3 Artificial Intelligence Computer-Aided Detection Algorithms
Receiver Operating Characteristic Curves for the 3 Artificial Intelligence Computer-Aided Detection Algorithms

As a comparison, for first-reader radiologists, the vertical dashed line represents the mean 1 − specificity (1 − 0.966 = 0.034); and the horizontal dashed line, the mean sensitivity (0.774). The right panel is a magnification of the vertical line and horizontal line intersection.

Table 1.  Area Under the Receiver Operating Characteristic Curve for the 3 Artificial Intelligence Algorithms
Area Under the Receiver Operating Characteristic Curve for the 3 Artificial Intelligence Algorithms
Table 2.  Screening Performance Benchmarks for Artificial Intelligence Algorithms and for Radiologists in 739 Women Who Received a Diagnosis of Breast Cancer and 112 924 Healthy Women
Screening Performance Benchmarks for Artificial Intelligence Algorithms and for Radiologists in 739 Women Who Received a Diagnosis of Breast Cancer and 112 924 Healthy Women
Table 3.  Number of Abnormal Interpretations and Cases Positive for Cancer Detected by Algorithms and Readers Alone and by Algorithms Combined With the Assessment of the First, Second, or Both Readers
Number of Abnormal Interpretations and Cases Positive for Cancer Detected by Algorithms and Readers Alone and by Algorithms Combined With the Assessment of the First, Second, or Both Readers
1.
Marmot  MG, Altman  DG, Cameron  DA, Dewar  JA, Thompson  SG, Wilcox  M.  The benefits and harms of breast cancer screening: an independent review.   Br J Cancer. 2013;108(11):2205-2240. doi:10.1038/bjc.2013.177 PubMedGoogle ScholarCrossref
2.
Seely  JM, Alhassan  T.  Screening for breast cancer in 2018—what should we be doing today?   Curr Oncol. 2018;25(suppl 1):S115-S124. doi:10.3747/co.25.3770 PubMedGoogle ScholarCrossref
3.
Giess  CS, Wang  A, Ip  IK, Lacson  R, Pourjabbar  S, Khorasani  R.  Patient, radiologist, and examination characteristics affecting screening mammography recall rates in a large academic practice.   J Am Coll Radiol. 2019;16(4, pt A):411-418. doi:10.1016/j.jacr.2018.06.016 PubMedGoogle ScholarCrossref
4.
Barlow  WE, Chi  C, Carney  PA,  et al.  Accuracy of screening mammography interpretation by characteristics of radiologists.   J Natl Cancer Inst. 2004;96(24):1840-1850. doi:10.1093/jnci/djh333 PubMedGoogle ScholarCrossref
5.
Lehman  CD, Arao  RF, Sprague  BL,  et al.  National performance benchmarks for modern screening digital mammography: update from the Breast Cancer Surveillance Consortium.   Radiology. 2017;283(1):49-58. doi:10.1148/radiol.2016161174 PubMedGoogle ScholarCrossref
6.
Elmore  JG, Jackson  SL, Abraham  L,  et al.  Variability in interpretive performance at screening mammography and radiologists’ characteristics associated with accuracy.   Radiology. 2009;253(3):641-651. doi:10.1148/radiol.2533082308 PubMedGoogle ScholarCrossref
7.
Lehman  CD, Wellman  RD, Buist  DSM, Kerlikowske  K, Tosteson  AN, Miglioretti  DL; Breast Cancer Surveillance Consortium.  Diagnostic accuracy of digital screening mammography with and without computer-aided detection.   JAMA Intern Med. 2015;175(11):1828-1837. doi:10.1001/jamainternmed.2015.5231 PubMedGoogle ScholarCrossref
8.
Ciatto  S, Del Turco  MR, Risso  G,  et al.  Comparison of standard reading and computer aided detection (CAD) on a national proficiency test of screening mammography.   Eur J Radiol. 2003;45(2):135-138. doi:10.1016/S0720-048X(02)00011-6 PubMedGoogle ScholarCrossref
9.
Rodriguez-Ruiz  A, Lång  K, Gubern-Merida  A,  et al.  Stand-alone artificial intelligence for breast cancer detection in mammography: comparison with 101 radiologists.   J Natl Cancer Inst. 2019;111(9):916-922. doi:10.1093/jnci/djy222 PubMedGoogle ScholarCrossref
10.
McKinney  SM, Sieniek  M, Godbole  V,  et al.  International evaluation of an AI system for breast cancer screening.   Nature. 2020;577(7788):89-94. doi:10.1038/s41586-019-1799-6 PubMedGoogle ScholarCrossref
11.
Akselrod-Ballin  A, Chorev  M, Shoshan  Y,  et al.  Predicting breast cancer by applying deep learning to linked health records and mammograms.   Radiology. 2019;292(2):331-342. doi:10.1148/radiol.2019182622 PubMedGoogle ScholarCrossref
12.
Wu  K, Wu  E, Wu  Y,  et al. Validation of a deep learning mammography model in a population with low screening rates. Preprint. arXiv: 1911.00364v1. Posted online November 1, 2019.
13.
Dembrower  K, Lindholm  P, Strand  F.  A multi-million mammography image dataset and population-based screening cohort for the training and evaluation of deep neural networks—the Cohort of Screen-Aged Women (CSAW).   J Digit Imaging. 2020;33(2):408-413. doi:10.1007/s10278-019-00278-0 PubMedGoogle ScholarCrossref
14.
Törnberg  S, Kemetli  L, Ascunce  N,  et al.  A pooled analysis of interval cancer rates in six European countries.   Eur J Cancer Prev. 2010;19(2):87-93. doi:10.1097/CEJ.0b013e32833548ed PubMedGoogle ScholarCrossref
15.
Kim  H-E, Kim  HH, Han  B-K,  et al  Changes in cancer detection and false-positive recall in mammography using artificial intelligence: a retrospective, multireader study.   Lancet Digital Health. Published online February 6, 2020. doi:10.1016/S2589-7500(20)30003-0 Google Scholar
16.
Wu  N, Phang  J, Park  J,  et al.  Deep neural networks improve radiologists’ performance in breast cancer screening.   IEEE Trans Med Imaging. 2020;39(4):1184-1194. doi:10.1109/TMI.2019.2945514PubMedGoogle ScholarCrossref
17.
Rodríguez-Ruiz  A, Krupinski  E, Mordang  J-J,  et al.  Detection of breast cancer with mammography: effect of an artificial intelligence support system.   Radiology. 2019;290(2):305-314. doi:10.1148/radiol.2018181371 PubMedGoogle ScholarCrossref
18.
Schaffter  T, Buist  DSM, Lee  CI,  et al; and the DM DREAM Consortium.  Evaluation of combined artificial intelligence and radiologist assessment to interpret screening mammograms.   JAMA Netw Open. 2020;3(3):e200265. doi:10.1001/jamanetworkopen.2020.0265 PubMedGoogle Scholar
19.
Buist  DSM, Porter  PL, Lehman  C, Taplin  SH, White  E.  Factors contributing to mammography failure in women aged 40-49 years.   J Natl Cancer Inst. 2004;96(19):1432-1440. doi:10.1093/jnci/djh269 PubMedGoogle ScholarCrossref
20.
Boyd  NF, Guo  H, Martin  LJ,  et al.  Mammographic density and the risk and detection of breast cancer.   N Engl J Med. 2007;356(3):227-236. doi:10.1056/NEJMoa062790 PubMedGoogle ScholarCrossref
21.
Klemi  PJ, Toikkanen  S, Räsänen  O, Parvinen  I, Joensuu  H.  Mammography screening interval and the frequency of interval cancers in a population-based screening.   Br J Cancer. 1997;75(5):762-766. doi:10.1038/bjc.1997.135 PubMedGoogle ScholarCrossref
22.
Tabár  L, Faberberg  G, Day  NE, Holmberg  L.  What is the optimum interval between mammographic screening examinations? an analysis based on the latest results of the Swedish two-county breast cancer screening trial.   Br J Cancer. 1987;55(5):547-551. doi:10.1038/bjc.1987.112 PubMedGoogle ScholarCrossref
23.
Domingo  L, Hofvind  S, Hubbard  RA,  et al.  Cross-national comparison of screening mammography accuracy measures in U.S., Norway, and Spain.   Eur Radiol. 2016;26(8):2520-2528. doi:10.1007/s00330-015-4074-8 PubMedGoogle ScholarCrossref
24.
McDonald  ES, Oustimov  A, Weinstein  SP, Synnestvedt  MB, Schnall  M, Conant  EF.  Effectiveness of digital breast tomosynthesis compared with digital mammography: outcomes analysis from 3 years of breast cancer screening.   JAMA Oncol. 2016;2(6):737-743. doi:10.1001/jamaoncol.2015.5536 PubMedGoogle ScholarCrossref
25.
Houssami  N, Kirkpatrick-Jones  G, Noguchi  N, Lee  CI.  Artificial Intelligence (AI) for the early detection of breast cancer: a scoping review to assess AI’s potential in breast screening practice.   Expert Rev Med Devices. 2019;16(5):351-362. doi:10.1080/17434440.2019.1610387 PubMedGoogle ScholarCrossref
Limit 200 characters
Limit 25 characters
Conflicts of Interest Disclosure

Identify all potential conflicts of interest that might be relevant to your comment.

Conflicts of interest comprise financial interests, activities, and relationships within the past 3 years including but not limited to employment, affiliation, grants or funding, consultancies, honoraria or payment, speaker's bureaus, stock ownership or options, expert testimony, royalties, donation of medical equipment, or patents planned, pending, or issued.

Err on the side of full disclosure.

If you have no conflicts of interest, check "No potential conflicts of interest" in the box below. The information will be posted with your response.

Not all submitted comments are published. Please see our commenting policy for details.

Limit 140 characters
Limit 3600 characters or approximately 600 words
    Views 2,345
    Citations 0
    Original Investigation
    August 27, 2020

    External Evaluation of 3 Commercial Artificial Intelligence Algorithms for Independent Assessment of Screening Mammograms

    Author Affiliations
    • 1Department of Oncology-Pathology, Karolinska Institute, Stockholm, Sweden
    • 2Department of Radiology, Karolinska University Hospital, Stockholm, Sweden
    • 3Department of Medical Radiation Physics and Nuclear Medicine, Karolinska University Hospital, Stockholm, Sweden
    • 4Department of Physiology and Pharmacology, Karolinska Institute, Stockholm, Sweden
    • 5Department of Radiology, Capio Sankt Görans Hospital, Stockholm, Sweden
    • 6Department of Molecular Medicine and Surgery, Karolinska Institute, Stockholm, Sweden
    • 7Division of Computational Science and Technology, KTH Royal Institute of Technology, Science for Life Laboratory, Solna, Sweden
    • 8KTH Royal Institute of Technology, Science for Life Laboratory, Solna, Sweden
    • 9Department of Medical Epidemiology and Biostatistics, Karolinska Institute, Stockholm, Sweden
    • 10Breast Radiology, Karolinska University Hospital, Stockholm, Sweden
    JAMA Oncol. Published online August 27, 2020. doi:10.1001/jamaoncol.2020.3321
    Key Points

    Question  Are there currently commercially available artificial intelligence (AI) algorithms that perform as well as or above the level of radiologists in mammography screening assessment?

    Findings  In this case-control study that included 8805 women, 1 of the 3 externally evaluated AI computer-aided detection algorithms was more accurate than first-reader radiologists in assessing screening mammograms. However, the highest number of cases positive for breast cancer was detected by combining this best algorithm with first-reader radiologists.

    Meaning  One commercially available AI algorithm performed independent reading of screening mammograms with sufficient diagnostic performance to act as an independent reader in prospective clinical studies.

    Abstract

    Importance  A computer algorithm that performs at or above the level of radiologists in mammography screening assessment could improve the effectiveness of breast cancer screening.

    Objective  To perform an external evaluation of 3 commercially available artificial intelligence (AI) computer-aided detection algorithms as independent mammography readers and to assess the screening performance when combined with radiologists.

    Design, Setting, and Participants  This retrospective case-control study was based on a double-reader population-based mammography screening cohort of women screened at an academic hospital in Stockholm, Sweden, from 2008 to 2015. The study included 8805 women aged 40 to 74 years who underwent mammography screening and who did not have implants or prior breast cancer. The study sample included 739 women who were diagnosed as having breast cancer (positive) and a random sample of 8066 healthy controls (negative for breast cancer).

    Main Outcomes and Measures  Positive follow-up findings were determined by pathology-verified diagnosis at screening or within 12 months thereafter. Negative follow-up findings were determined by a 2-year cancer-free follow-up. Three AI computer-aided detection algorithms (AI-1, AI-2, and AI-3), sourced from different vendors, yielded a continuous score for the suspicion of cancer in each mammography examination. For a decision of normal or abnormal, the cut point was defined by the mean specificity of the first-reader radiologists (96.6%).

    Results  The median age of study participants was 60 years (interquartile range, 50-66 years) for 739 women who received a diagnosis of breast cancer and 54 years (interquartile range, 47-63 years) for 8066 healthy controls. The cases positive for cancer comprised 618 (84%) screen detected and 121 (16%) clinically detected within 12 months of the screening examination. The area under the receiver operating curve for cancer detection was 0.956 (95% CI, 0.948-0.965) for AI-1, 0.922 (95% CI, 0.910-0.934) for AI-2, and 0.920 (95% CI, 0.909-0.931) for AI-3. At the specificity of the radiologists, the sensitivities were 81.9% for AI-1, 67.0% for AI-2, 67.4% for AI-3, 77.4% for first-reader radiologist, and 80.1% for second-reader radiologist. Combining AI-1 with first-reader radiologists achieved 88.6% sensitivity at 93.0% specificity (abnormal defined by either of the 2 making an abnormal assessment). No other examined combination of AI algorithms and radiologists surpassed this sensitivity level.

    Conclusions and Relevance  To our knowledge, this study is the first independent evaluation of several AI computer-aided detection algorithms for screening mammography. The results of this study indicated that a commercially available AI computer-aided detection algorithm can assess screening mammograms with a sufficient diagnostic performance to be further evaluated as an independent reader in prospective clinical trials. Combining the first readers with the best algorithm identified more cases positive for cancer than combining the first readers with second readers.

    Introduction

    Population-wide mammography screening resulting in earlier detection of tumors has decreased breast cancer mortality by 20% to 40%.1,2 Nevertheless, the workload for radiologists is high and the quality of assessment varies.3,4 Having a computer algorithm that performs at, or above, the level of radiologists in mammography assessment would be valuable. An added benefit of artificial intelligence (AI) computer-aided detection (CAD) algorithms would be to reduce the broad variation in performance among human readers that has been shown in previous studies.5,6 Computer-aided detection can take on 2 different roles: as a concurrent assistant directing the radiologist's attention to suspicious areas in the mammogram and as an independent reader making an overall assessment of the whole examination without radiologist intervention. Until recently, most commercial CAD systems operated as concurrent assistants and were based on a limited set of programmer-defined features used to identify suspicious areas in the mammogram.7 This approach was never convincingly successful, with some early reports showing an increased sensitivity but later studies showing no clear improvement.7,8 Furthermore, additional time was required from the radiologist to consider each CAD marking. During the last years, academic and commercial researchers have worked hard to leverage the capabilities of AI, or more specifically of deep neural networks, to enable CAD for independent mammography assessments.9-12 The reported performance levels have in several cases been on par with radiologists. However, across these published studies there have been various issues: the source population was not a screening cohort,9 the radiologists with which the AI CAD program was compared showed a poor performance,10 and the AI CAD algorithms have often not been publicly available.10-12 Most importantly, none of these studies involved third-party external validation with comparisons among competing AI CAD algorithms. In the present study, we compare the results of applying 3 different AI CAD algorithms as independent readers of a large set of mammographic examinations from a public mammography screening program for which the algorithm developers had no access to images and had no involvement in the evaluation process.

    Methods

    The study sample was derived from the CSAW (Swedish Cohort of Screen-Age Women) data set,13 which consists of all women 40 to 74 years of age in Stockholm county who were invited for screening examinations between 2008 and 2015. The screening interval in Stockholm county is 24 months. However, until 2012, the interval was only 18 months for women between 40 and 49 years of age. In the present study, all screening examinations were from one institution, the Karolinska University Hospital. This retrospective case-control study was approved by the Stockholm ethical review board, which waived the requirement for individual informed consent.

    We included all women aged 40 to 74 years from the CSAW cohort who were diagnosed as having breast cancer between 2008 and 2015, who had a complete screening examination prior to diagnosis, who had no prior breast cancer, and who did not have implants. We excluded 419 examinations with a cancer diagnosis that had more than 12 months between the examination date and diagnosis owing to the lower likelihood of cancer being present at the time of screening. In a secondary analysis, we added 174 women who had received a cancer diagnosis between 12 and 23 months after screening. Random sampling of healthy women was carried out to enable efficient computer processing while maintaining representability. We included a random sample of 10 000 healthy women. Of those women, we excluded 995 who had less than 2-year cancer-free follow-up, 909 who had examinations after December 31, 2015, 26 who had implants, and 99 examinations with an unknown radiologist identification number (eFigure 1 in the Supplement). All images were acquired on full-field digital mammography Hologic equipment. Prospectively recorded screening assessments for each examination were extracted from the Regional Cancer Center Stockholm-Gotland screening register. The mammography screening system in Sweden requires a 2-view mammography of each breast. All examinations are assessed by double-reading, with a binary decision by each reader: normal or abnormal (“flagged” for discussion). There had been 25 different first-reader radiologists and 20 different second-reader radiologists. There is no defined designation of breast radiologists into first or second readers. However, the second reader is often more experienced than the first reader. In addition, when performing an assessment, the second reader can access the assessment already performed by the first reader. For any abnormal assessment, the examination proceeds to consensus discussion with another binary decision: normal or recall. Data on cancer diagnoses, including tumor characteristics and radiologic assessments, were obtained through linkage with the Regional Cancer Center Stockholm-Gotland breast cancer quality register and screening register using Swedish personal identity numbers. Positive follow-up findings were determined by pathology-confirmed diagnosis at screening or within 12 months thereafter.

    All images were processed locally on our hardware by 3 different commercial AI CAD algorithms (AI-1, AI-2, and AI-3). The AI CAD algorithms have not been approved by the US Food and Drug Administration for use as independent readers. The vendors asked to remain anonymous, with the possibility for each vendor to later decide to waive anonymity. Each algorithm was described by the vendor according to the structure devised by the authors of this study (eAppendix in the Supplement). All 3 AI CAD algorithms processed the images, and no other data, and yielded a prediction score for each breast ranging between 0 and 1 for the suspicion of cancer, where 1 denotes the highest suspicion level. All analyses in this study were carried out on the examination level, which is equivalent to the patient level, based on the maximum score of the left or the right breast, whichever scored highest. The area under the receiver operating curve (AUC) was calculated for each of the 3 AI systems overall and by subgroups of age, mammographic density, and detection mode. To enable a comparison with the recorded binary decisions of radiologists, the output of each algorithm was dichotomized at a cut point corresponding to a specificity as close as possible to that of the first-reader radiologists (ie, 96.6%). Because our study sample was enriched with positive cases, we applied stratified bootstrapping (1000 samples) with a 14:1 ratio of healthy women to women receiving a diagnosis of cancer to mimic the ratio in the source screening cohort (approximately 0.5% screen-detected cancer among all screened women). We determined performance levels for all AI CAD algorithms and for all radiologist assessments (first reader, second reader, and consensus) for the following diagnostic metrics: sensitivity (number of true positives divided by all true), specificity (number of true negatives divided by all negative), accuracy (number of true positives plus true negatives divided by all), abnormal interpretation rate (number positive divided by all, multiplied by 1000), cancer detection rate (number of true positives divided by all, multiplied by 1000), false-negative rate (number of false negatives divided by all, multiplied by 1000) and positive predictive value (number of true positives divided by all positive, multiplied by 1000). We also investigated whether an association existed between the number of abnormal interpretations and the number of cases positive for cancer detected for all 3 AI CAD algorithms alone and also combined with the assessment of the first or second reader or both readers (the joint assessment was considered abnormal if at least 1 of the AI CAD algorithms or readers made an abnormal assessment). We examined the sensitivity and specificity when combining all 3 AI CAD algorithms (the joint assessment was considered abnormal if at least 1 AI CAD algorithm made an abnormal assessment), as well as when combining all algorithms and radiologists (the joint assessment was considered abnormal if at least 2 of the AI CAD algorithms or radiologists made an abnormal assessment).

    The computer software Stata, version 15.1 (StataCorp), was used for all statistical analyses. All statistical tests were 2-sided. The level for statistical significance was set at α = .05. We tested for differences in the AUC using the DeLong method. The AUC CIs were estimated by the sandwich variance estimator.

    Results

    The final study sample, as described in eTable 1 in the Supplement, consisted of 8805 women and screening examinations, of whom 739 women received a diagnosis of breast cancer (positive) and a random sample of 8066 women were healthy controls (negative). The median age at screening was 54.5 years (interquartile range, 47.4-63.5 years), and the median age at diagnosis was 59. 8 years (interquartile range, 49.8-65.8 years). The median age for healthy controls was 54 years (interquartile range, 47-63 years). The median age for cases was 60 years (interquartile range, 50-66 years). The positive cases consisted of 618 (84%) screen-detected cancer cases and 121 (16%) clinically detected cancer cases within 12 months of the screening examination. Of those, 640 cases of cancer had an invasive component and 85 were in situ only.

    Table 1 reports the AUC for cancer detection for each AI algorithm overall and by subgroups. Overall, the AUC was 0.956 (95% CI, 0.948-0.965) for AI algorithm 1 (AI-1), 0.922 (95% CI, 0.910-0.934) for AI-2, and 0.920 (95% CI, 0.909-0.931) for AI-3. The differences between AI-1 and each of the other 2 AI CAD algorithms (AI-2 and AI-3) were statistically significant (P < .001), whereas there was no significant difference between AI-2 and AI-3 (P = .68). Within all analyzed subgroups, AI-1 had a significantly higher AUC than AI-2 and AI-3 (P < .001), whereas there was no significant difference between AI-2 and AI-3 for any subgroup. Specifically, the AUC for clinically detected cancer after negative radiologist assessment was 0.810 (95% CI, 0.767-0.852) for AI-1, 0.728 (95% CI, 0.677-0.779) for AI-2, and 0.744 (95% CI, 0.696-0.792) for AI-3. In addition, we observed that the AUC for younger vs older, and for higher vs lower breast density, were significantly lower for all algorithms. For AI-1, the AUC was 0.974 for women 55 years or older and 0.925 for women younger than 55 years; 0.933 for mammograms with high percent density and 0.976 for mammograms with low percent density. In a secondary analysis, after extending the study population with the 174 women who received a diagnosis of cancer between 12 and 23 months after screening, the AUC was 0.916 (95% CI, 0.905-0.928) for AI-1, 0.859 (95% CI, 0.843-0.874) for AI-2, and 0.877 (95% CI, 0.964-0.890) for AI-3. A box plot of the raw estimated scores of each algorithm is shown in eFigure 4 in the Supplement.

    The results of the comparisons with radiologists’ assessments are presented in Table 2. The total simulated screening population consisted of 113 663 examinations (of which 739 were diagnosed as positive for breast cancer). The sensitivity was 81.9% (95% CI, 78.9%-84.6%) for AI-1, 67.0% (95% CI, 63.5%-70.4%) for AI-2, and 67.4% (95% CI, 63.9%-70.8%) for AI-3, 77.4% (95% CI, 74.2%-80.4%) for the first reader, 80.1% (95% CI, 77.0%-82.9%) for the second reader, and 85.0% (95% CI, 82.2%-87.5%) for the consensus discussion. There was a significant sensitivity difference between AI-1 and the other 2 AI CAD algorithms (P < .001) as well as between AI-1 and the first reader (P = .03). However, the analysis did not show a difference between AI-1 and the second reader (P = .40) or the consensus discussion (P = .11). Specificity for the AI CAD algorithms was preselected to match the specificity of the first reader and should therefore not be compared. The specificity for the second reader was 97.2% (95% CI, 97.1%-97.3%), and for the consensus discussion, it was 98.5% (95% CI, 98.4%-98.6%). Potential cut points for the continuous score of each algorithm to achieve various sensitivity levels are presented in eTable 2 in the Supplement. When choosing an operating point corresponding to the Breast Cancer Surveillance Consortium benchmark of 88.9% specificity, the sensitivity was 88.6% for AI-1, 80.0% for AI-2, and 80.2% for AI-3 (eTable 3 in the Supplement). Examples of mammograms of cancer identified by AI CAD but missed by radiologists, and vice versa, are shown in eFigure 2 and eFigure 3 in the Supplement.

    The results of the combined assessment across all 3 algorithms showed a sensitivity of 86.7% (95% CI, 84.2%-89.2%) and a specificity of 92.5% (95% CI, 92.3%-92.7%). Compared with the best algorithm, that is, AI-1, the combined system had a marginally higher sensitivity (P = .01) but a much lower specificity (P < .001). As a comparison, AI-1 alone achieved 86.3% sensitivity at 92.5% specificity, and 79.3% sensitivity at 98.0% specificity (Figure).

    Table 3 gives the simulated scenarios in which the binary decisions by the 3 AI CAD algorithms and the readers were combined. Of 739 total cancer cases, there were 655 screening examinations assessed as abnormal for the first reader combined with AI-1 (88.6% sensitivity), 620 combined with AI-2 (83.9% sensitivity), 623 combined with AI-3 (84.3% sensitivity), and 640 combined with the second reader (86.6% sensitivity). Of 113 663 total examinations in the simulated screening cohort, there were 7851 examinations assessed as abnormal for the first reader combined with AI-1 (93.0% specificity), 7998 combined with AI-2 (92.9% specificity), 7847 combined with AI-3 (92.9% specificity), and 5484 combined with the second reader (95.1% specificity). For the first reader, the relative increase in cancer detection was 15% when adding AI-1 and 12% when adding the second reader; the relative increase in abnormal interpretations was 78% when adding AI-1, and 24% when adding the second reader. To examine these results separated into in situ cancer, invasive cancer, and stage II or higher breast cancer, please see eTable 4 in the Supplement.

    When combining all 3 algorithms and 2 reader radiologists (at least 2 had to make a positive assessment), the estimated sensitivity was 87.4% (95% CI, 85.0%-89.8%), and the estimated specificity was 95.9% (95% CI, 95.7%-96.0%). The sensitivity of the combined algorithms and radiologists was higher than AI-1 (P = .003) and higher than second readers (P < .001). The specificity was somewhat lower than AI-1 (P < .001), the second readers (P < .001), and the consensus decision (P < .001).

    Discussion

    The present study observed 3 main findings. First, a difference was found in the AUC among the 3 AI CAD algorithms, from 0.920 to 0.956. Second, the best computer algorithm reached, and in some comparisons surpassed, the performance level of radiologists in assessing screening mammograms, obtaining 81.9% sensitivity when operating at 96.6% specificity in a simulated study population of 113 663 screening examinations based on an original sample of 8805 women from a population-based screening cohort. Third, combining the first reader with the best algorithm identified more cancer cases than combining the first and second readers.

    The proportion of clinically detected interval cancer in our study was 16%, which is lower than the 28% reported in a prior European study.14 The lower proportion may be explained by our exclusion criteria for cancer diagnosed later than 12 months after screening, which was chosen to increase the likelihood that cancer was present in the breast at the time of examination.

    The best-performing algorithm, AI-1, had an overall AUC of 0.956 for the detection of cancer at screening or within 12 months thereafter. The 2 other AI CAD algorithms had an overall AUC of 0.922 and 0.920. Prior studies have reported AUC values between 0.840 and 0.959.9-11,15-18 The subgroup analysis of the AUC in our study showed a decreased performance for younger vs older women and for higher vs lower breast density on mammography. This is in line with prior studies showing that there is an increase of interval cancer cases, that is, decreased mammographic sensitivity, for younger women and for women with higher mammographic density.19-22 In our specific analysis of interval cancer detected within 12 months after a negative screening examination, AI-1 achieved an AUC of 0.810, suggesting that there is potential for the AI algorithms to promote earlier cancer detection and that there are suspicious findings present in many of those mammograms. The AI-1 algorithm was superior to the other 2 algorithms across all subgroups. The differences between AI-2 and AI-3 were minimal across all subgroups. The cause of the stronger performance of AI-1 was not the subject of this study. However, our reading of the algorithm descriptions submitted by the vendors (eAppendix in the Supplement) revealed the following differences, which might be part of the explanation; AI-1 was trained on more data than the other 2 were, had pixel-level annotations for training, and had a higher capacity backbone (ResNet34); in addition, the data augmentation included adjustment of contrast and brightness. The largest training population for the superior performing AI-1 consisted of images from GE equipment and images of South Korean women. Although we do not have ethnic descriptors of our study population, the vast majority of women in Stockholm are White, and all images in our study were acquired on Hologic equipment. Against this background, the superior performance of AI-1 is an interesting example of robustness. In training AI algorithms for mammographic cancer detection, matching ethnic and equipment distributions between the training population and the clinical test population may not be of highest importance.

    When comparing binary decisions by AI CAD algorithms, operating at the specificity level of the first reader, with the actual recorded assessment by radiologists, we concluded that AI-1 showed superior performance to the other 2 AI CAD algorithms and to the first readers. The specificity was 96.6% by design, and the resulting sensitivity of AI-1 was 81.9%; when using an operating point corresponding to 88.9% specificity, the resulting sensitivity was 88.6%. This can be compared with the Breast Cancer Surveillance Consortium benchmarks of 86.9% sensitivity at 88.9% specificity.5 In this retrospective analysis, it is apparent that AI-1 fulfills the specificity and sensitivity criteria. It is well known that the specificity of European breast radiologists is generally higher and the sensitivity lower than US colleagues.23 It is important to bear in mind that in contrast to the original readers, the AI CAD algorithms did not benefit from information from prior mammograms nor of patient reports of breast symptoms.

    The foregoing discussion focuses on comparing various algorithms and radiologists when applied separately. However, based on double-reading screening programs, we know that combining assessments can improve performance. Taking the first-reader assessments as the starting point, we found that 2 human readers showed more agreement regarding abnormal interpretations and for false positives than an algorithm and a human reader did. When adding an AI CAD algorithm to the first reader, more true-positive cases would likely be found, but a much larger proportion of false-positive examinations would have to be handled in the ensuing consensus discussion. Changing the perspective by taking the AI CAD assessments as the starting point, cancer detection was estimated to increase by merely 8% when combined with the first reader, whereas the abnormal assessments (true positives plus false positives) increased by 77%. Even though there is an absolute diagnostic gain when adding the first reader to the AI CAD algorithm, a cost-benefit analysis is required, in any given setting, to determine the economic implications of adding a human reader at all.

    When assessing the combination of 3 algorithms based on voting systems, we found that combining all 3 algorithms did not achieve a markedly higher performance than using the best algorithm alone. Likewise, compared with the diagnostic performance of combining the best algorithm with the first reader, we found no clear advantage of having a voting system involving all algorithms and all readers. Given that commercial AI systems will likely come with a price that is a notable proportion of the cost of the corresponding radiologist time, we view the most realistic implementation to be 1 radiologist and 1 algorithm. However, as shown by the 77% increase in abnormal findings, even though this implementation will obviate the need for 1 radiologist in screening assessment, it would increase the workload for the 2 radiologists involved in the consensus discussions. Before clinical practice can change, it is critically important to conduct prospective clinical studies as well as a thorough examination of ethical, legal, and societal aspects of replacing a medical professional by computer software.

    As a development of regular, 2-dimensional (2-D) mammography, on which the present study was based, digital breast tomosynthesis, or 3-D mammography, has become increasingly accepted as an alternative screening modality. There are reports that the interval cancer rate decreases with the use of tomosynthesis.24 Since tomosynthesis appears to make some signs of cancer more conspicuous and easier to identify for radiologists, it will be an interesting topic for future studies to examine whether AI algorithms applied to 2-D mammography reduces the clinical utility of 3-D mammography, or whether AI algorithms trained on 3-D mammography can further improve the diagnostic performance.

    Strengths and Limitations

    The major strength of our study is that we performed an independent evaluation of several AI CAD algorithms, none of which had ever been exposed to images from our institution. Additional strengths included the comparative aspect between AI CAD algorithms and radiologists, and that our large set of examinations was chosen in a representative manner from a population-based screening cohort. The study limitations are that our results applied to a version of each algorithm that has already been replaced by a more recent algorithm, that the examinations were from a Swedish setting, and that we did not analyze the performance for women with implants or prior breast cancer. For computational reasons, we used a cancer-enriched cohort. We therefore used inverse probability weighted bootstrapping to simulate a study population with a cancer prevalence matching a screening cohort to address issues raised for studies of cancer-enriched study populations.25 A weakness of our study is that the AI CAD algorithms did not consider prior mammograms, hormonal medication, or breast symptoms—which puts AI CAD algorithms at a disadvantage compared with radiologists.

    Conclusions

    In conclusion, our results suggested that the best computer algorithm evaluated in this study assessed screening mammograms with a diagnostic performance on par with or exceeding that of radiologists in a retrospective cohort of women undergoing regular screening. This achievement is considerable, bearing in mind that radiologists, but not AI algorithms, had proprietary access to certain information. We believe that the time has come to evaluate AI CAD algorithms as independent readers in prospective clinical studies in mammography screening programs.

    Back to top
    Article Information

    Accepted for Publication: June 2, 2020.

    Corresponding Author: Fredrik Strand, MD, PhD, Department of Oncology-Pathology, Karolinska Institute, Karolinska vägen, A2:07, 171 64 Solna, Sweden (fredrik.strand@ki.se).

    Published Online: August 27, 2020. doi:10.1001/jamaoncol.2020.3321

    Author Contributions: Drs Salim and Strand had full access to all the data in the study and take responsibility for the integrity of the data and the accuracy of the data analysis.

    Concept and design: Salim, Azavedo, Smith, Eklund, Strand.

    Acquisition, analysis, or interpretation of data: Salim, Wåhlin, Dembrower, Foukakis, Liu, Eklund, Strand.

    Drafting of the manuscript: Salim, Eklund, Strand.

    Critical revision of the manuscript for important intellectual content: All authors.

    Statistical analysis: Salim, Smith, Eklund, Strand.

    Obtained funding: Strand.

    Administrative, technical, or material support: Wåhlin, Dembrower, Liu, Strand.

    Supervision: Azavedo, Foukakis, Smith, Eklund, Strand.

    Conflict of Interest Disclosures: Dr Foukakis reported receiving grants from Pfizer outside the submitted work. Dr Eklund reported receiving grants from Swedish Research Council and from the Swedish Cancer Society during the conduct of the study. Dr Strand reported receiving grants from Stockholm City Council during the conduct of the study; and receiving personal fees from Collective Minds Radiology outside the submitted work. No other disclosures were reported.

    Funding/Support: This study was funded by the Stockholm County Council Dnr 20170802 award.

    Role of the Funder/Sponsor: The funders had no role in the design and conduct of the study; collection, management, analysis, and interpretation of the data; preparation, review, or approval of the manuscript; and decision to submit the manuscript for publication.

    Additional Contributions: The 3 commercially available artificial intelligence algorithms assessed in this study were provided by the 3 companies that developed them; no exchange of financial funds or other benefits were involved.

    Additional Information: This work originated in the Department of Oncology-Pathology, Karolinska Institute, Karolinska.

    References
    1.
    Marmot  MG, Altman  DG, Cameron  DA, Dewar  JA, Thompson  SG, Wilcox  M.  The benefits and harms of breast cancer screening: an independent review.   Br J Cancer. 2013;108(11):2205-2240. doi:10.1038/bjc.2013.177 PubMedGoogle ScholarCrossref
    2.
    Seely  JM, Alhassan  T.  Screening for breast cancer in 2018—what should we be doing today?   Curr Oncol. 2018;25(suppl 1):S115-S124. doi:10.3747/co.25.3770 PubMedGoogle ScholarCrossref
    3.
    Giess  CS, Wang  A, Ip  IK, Lacson  R, Pourjabbar  S, Khorasani  R.  Patient, radiologist, and examination characteristics affecting screening mammography recall rates in a large academic practice.   J Am Coll Radiol. 2019;16(4, pt A):411-418. doi:10.1016/j.jacr.2018.06.016 PubMedGoogle ScholarCrossref
    4.
    Barlow  WE, Chi  C, Carney  PA,  et al.  Accuracy of screening mammography interpretation by characteristics of radiologists.   J Natl Cancer Inst. 2004;96(24):1840-1850. doi:10.1093/jnci/djh333 PubMedGoogle ScholarCrossref
    5.
    Lehman  CD, Arao  RF, Sprague  BL,  et al.  National performance benchmarks for modern screening digital mammography: update from the Breast Cancer Surveillance Consortium.   Radiology. 2017;283(1):49-58. doi:10.1148/radiol.2016161174 PubMedGoogle ScholarCrossref
    6.
    Elmore  JG, Jackson  SL, Abraham  L,  et al.  Variability in interpretive performance at screening mammography and radiologists’ characteristics associated with accuracy.   Radiology. 2009;253(3):641-651. doi:10.1148/radiol.2533082308 PubMedGoogle ScholarCrossref
    7.
    Lehman  CD, Wellman  RD, Buist  DSM, Kerlikowske  K, Tosteson  AN, Miglioretti  DL; Breast Cancer Surveillance Consortium.  Diagnostic accuracy of digital screening mammography with and without computer-aided detection.   JAMA Intern Med. 2015;175(11):1828-1837. doi:10.1001/jamainternmed.2015.5231 PubMedGoogle ScholarCrossref
    8.
    Ciatto  S, Del Turco  MR, Risso  G,  et al.  Comparison of standard reading and computer aided detection (CAD) on a national proficiency test of screening mammography.   Eur J Radiol. 2003;45(2):135-138. doi:10.1016/S0720-048X(02)00011-6 PubMedGoogle ScholarCrossref
    9.
    Rodriguez-Ruiz  A, Lång  K, Gubern-Merida  A,  et al.  Stand-alone artificial intelligence for breast cancer detection in mammography: comparison with 101 radiologists.   J Natl Cancer Inst. 2019;111(9):916-922. doi:10.1093/jnci/djy222 PubMedGoogle ScholarCrossref
    10.
    McKinney  SM, Sieniek  M, Godbole  V,  et al.  International evaluation of an AI system for breast cancer screening.   Nature. 2020;577(7788):89-94. doi:10.1038/s41586-019-1799-6 PubMedGoogle ScholarCrossref
    11.
    Akselrod-Ballin  A, Chorev  M, Shoshan  Y,  et al.  Predicting breast cancer by applying deep learning to linked health records and mammograms.   Radiology. 2019;292(2):331-342. doi:10.1148/radiol.2019182622 PubMedGoogle ScholarCrossref
    12.
    Wu  K, Wu  E, Wu  Y,  et al. Validation of a deep learning mammography model in a population with low screening rates. Preprint. arXiv: 1911.00364v1. Posted online November 1, 2019.
    13.
    Dembrower  K, Lindholm  P, Strand  F.  A multi-million mammography image dataset and population-based screening cohort for the training and evaluation of deep neural networks—the Cohort of Screen-Aged Women (CSAW).   J Digit Imaging. 2020;33(2):408-413. doi:10.1007/s10278-019-00278-0 PubMedGoogle ScholarCrossref
    14.
    Törnberg  S, Kemetli  L, Ascunce  N,  et al.  A pooled analysis of interval cancer rates in six European countries.   Eur J Cancer Prev. 2010;19(2):87-93. doi:10.1097/CEJ.0b013e32833548ed PubMedGoogle ScholarCrossref
    15.
    Kim  H-E, Kim  HH, Han  B-K,  et al  Changes in cancer detection and false-positive recall in mammography using artificial intelligence: a retrospective, multireader study.   Lancet Digital Health. Published online February 6, 2020. doi:10.1016/S2589-7500(20)30003-0 Google Scholar
    16.
    Wu  N, Phang  J, Park  J,  et al.  Deep neural networks improve radiologists’ performance in breast cancer screening.   IEEE Trans Med Imaging. 2020;39(4):1184-1194. doi:10.1109/TMI.2019.2945514PubMedGoogle ScholarCrossref
    17.
    Rodríguez-Ruiz  A, Krupinski  E, Mordang  J-J,  et al.  Detection of breast cancer with mammography: effect of an artificial intelligence support system.   Radiology. 2019;290(2):305-314. doi:10.1148/radiol.2018181371 PubMedGoogle ScholarCrossref
    18.
    Schaffter  T, Buist  DSM, Lee  CI,  et al; and the DM DREAM Consortium.  Evaluation of combined artificial intelligence and radiologist assessment to interpret screening mammograms.   JAMA Netw Open. 2020;3(3):e200265. doi:10.1001/jamanetworkopen.2020.0265 PubMedGoogle Scholar
    19.
    Buist  DSM, Porter  PL, Lehman  C, Taplin  SH, White  E.  Factors contributing to mammography failure in women aged 40-49 years.   J Natl Cancer Inst. 2004;96(19):1432-1440. doi:10.1093/jnci/djh269 PubMedGoogle ScholarCrossref
    20.
    Boyd  NF, Guo  H, Martin  LJ,  et al.  Mammographic density and the risk and detection of breast cancer.   N Engl J Med. 2007;356(3):227-236. doi:10.1056/NEJMoa062790 PubMedGoogle ScholarCrossref
    21.
    Klemi  PJ, Toikkanen  S, Räsänen  O, Parvinen  I, Joensuu  H.  Mammography screening interval and the frequency of interval cancers in a population-based screening.   Br J Cancer. 1997;75(5):762-766. doi:10.1038/bjc.1997.135 PubMedGoogle ScholarCrossref
    22.
    Tabár  L, Faberberg  G, Day  NE, Holmberg  L.  What is the optimum interval between mammographic screening examinations? an analysis based on the latest results of the Swedish two-county breast cancer screening trial.   Br J Cancer. 1987;55(5):547-551. doi:10.1038/bjc.1987.112 PubMedGoogle ScholarCrossref
    23.
    Domingo  L, Hofvind  S, Hubbard  RA,  et al.  Cross-national comparison of screening mammography accuracy measures in U.S., Norway, and Spain.   Eur Radiol. 2016;26(8):2520-2528. doi:10.1007/s00330-015-4074-8 PubMedGoogle ScholarCrossref
    24.
    McDonald  ES, Oustimov  A, Weinstein  SP, Synnestvedt  MB, Schnall  M, Conant  EF.  Effectiveness of digital breast tomosynthesis compared with digital mammography: outcomes analysis from 3 years of breast cancer screening.   JAMA Oncol. 2016;2(6):737-743. doi:10.1001/jamaoncol.2015.5536 PubMedGoogle ScholarCrossref
    25.
    Houssami  N, Kirkpatrick-Jones  G, Noguchi  N, Lee  CI.  Artificial Intelligence (AI) for the early detection of breast cancer: a scoping review to assess AI’s potential in breast screening practice.   Expert Rev Med Devices. 2019;16(5):351-362. doi:10.1080/17434440.2019.1610387 PubMedGoogle ScholarCrossref
    ×