[Skip to Navigation]
Sign In
Table.  US Patient Cohorts Used for Training Clinical Machine Learning Algorithms, by Statea
US Patient Cohorts Used for Training Clinical Machine Learning Algorithms, by Statea
1.
Liu  Y, Chen  PC, Krause  J, Peng  L.  How to read articles that use machine learning: users’ guides to the medical literature.   JAMA. 2019;322(18):1806-1816. doi:10.1001/jama.2019.16489PubMedGoogle ScholarCrossref
2.
Obermeyer  Z, Powers  B, Vogeli  C, Mullainathan  S.  Dissecting racial bias in an algorithm used to manage the health of populations.   Science. 2019;366(6464):447-453. doi:10.1126/science.aax2342PubMedGoogle ScholarCrossref
3.
Adamson  AS, Smith  A.  Machine learning and health care disparities in dermatology.   JAMA Dermatol. 2018;154(11):1247-1248. doi:10.1001/jamadermatol.2018.2348PubMedGoogle ScholarCrossref
4.
Zech  JR, Badgeley  MA, Liu  M, Costa  AB, Titano  JJ, Oermann  EK.  Variable generalization performance of a deep learning model to detect pneumonia in chest radiographs: a cross-sectional study.   PLoS Med. 2018;15(11):e1002683. doi:10.1371/journal.pmed.1002683PubMedGoogle Scholar
5.
Wang  X, Liang  G, Zhang  Y, Blanton  H, Bessinger  Z, Jacobs  N.  Inconsistent performance of deep learning models on mammogram classification.   J Am Coll Radiol. 2020;17(6):796-803. doi:10.1016/j.jacr.2020.01.006PubMedGoogle ScholarCrossref
6.
Warner  HR, Toronto  AF, Veasey  LG, Stephenson  R.  A mathematical approach to medical diagnosis: application to congenital heart disease.   JAMA. 1961;177:177-183. doi:10.1001/jama.1961.03040290005002PubMedGoogle ScholarCrossref
Research Letter
September 22/29, 2020

Geographic Distribution of US Cohorts Used to Train Deep Learning Algorithms

Author Affiliations
  • 1Department of Bioengineering, Stanford University, Stanford, California
  • 2Department of Radiology, Stanford University School of Medicine, Stanford, California
JAMA. 2020;324(12):1212-1213. doi:10.1001/jama.2020.12067

Advances in machine learning, specifically the subfield of deep learning, have produced algorithms that perform image-based diagnostic tasks with accuracy approaching or exceeding that of trained physicians. Despite their well-documented successes, these machine learning algorithms are vulnerable to cognitive and technical bias,1 including bias introduced when an insufficient quantity or diversity of data is used to train an algorithm.2,3 We investigated an understudied source of systemic bias in clinical applications of deep learning—the geographic distribution of patient cohorts used to train algorithms.

Methods

We searched PubMed for peer-reviewed articles published online or in print between January 1, 2015, and December 31, 2019, that trained a deep learning algorithm to perform an image-based diagnostic task and benchmarked performance against (or in tandem with) physicians across 6 clinical disciplines: radiology, ophthalmology, dermatology, pathology, gastroenterology, and cardiology. Search terms included deep learning and the clinical specialties of interest, along with Medical Subject Heading synonyms. Results were supplemented by searching reference lists of relevant publications and reviews. Studies that used at least 1 US patient cohort for algorithm training were included. All authors gave input to the search strategy. One author (A.K.) performed the search, screened articles, and extracted data, then repeated the process a second time after a washout period. The final set of included articles and extracted data was reviewed by all authors.

For each state, the number of studies that used at least 1 patient cohort from that state was determined. Patient cohorts provided by a hospital or health system were attributed to the home state of the institution unless an alternate method for assembling the cohort was described. If cohorts were ambiguous, we communicated with corresponding authors for clarification. Cohorts used only for testing or validation of an algorithm were not included.

Some patient cohorts were intrinsically geographically heterogeneous or ambiguous, such as cohorts from large studies from the National Institutes of Health (NIH) or clinical trials (spanning 5 or more states) and data from industry repositories. These cohorts were labeled “multisite” and their number and type were characterized separately.

Results

Of the 2606 studies identified by the search, 74 met inclusion criteria: radiology (n = 35), ophthalmology (n = 16), dermatology (n = 11), pathology (n = 8), gastroenterology (n = 2), and cardiology (n = 2). (The list of studies is available from the authors on request.)

Fifty-six studies (76%) trained algorithms using at least 1 geographically identifiable cohort. Cohorts from California appeared in 22 of the 56 studies (39%), cohorts from Massachusetts in 15 (27%), and cohorts from New York in 14 (25%) (Table). Forty of 56 studies (71%) used a patient cohort from at least 1 of these 3 states. Among the remaining 47 states, 34 did not contribute any patient cohorts, and the remainder contributed between 1 and 5 cohorts (Table).

Eighteen of 74 studies (24%) used multisite cohorts exclusively; across all studies, 23 multisite cohorts were identified. Thirteen (57%) of 23 were from existing NIH studies or consortia, 7 (30%) were from industry trials or databases, 2 (9%) were from online image atlases, and 1 (4%) was from an academic second opinion service.

Discussion

In clinical applications of deep learning across multiple disciplines, algorithms trained on US patient data were disproportionately trained on cohorts from California, Massachusetts, and New York, with little to no representation from the remaining 47 states. California, Massachusetts, and New York may have economic, educational, social, behavioral, ethnic, and cultural features that are not representative of the entire nation; algorithms trained primarily on patient data from these states may generalize poorly, which is an established risk when implementing diagnostic algorithms in new geographies.4-6

Limitations include that the search was limited to a single database, search and data extraction were performed by a single individual, and patient-level demographic data were not routinely available. Geographic location represents only one measure of the diversity of a cohort.

Both for technical performance and for fundamental reasons of equity and justice, the biomedical research community—academia, industry, and regulatory bodies—should take steps to ensure that machine learning training data mirror the populations for which algorithms ultimately will be used.

Section Editor: Jody W. Zylke, MD, Deputy Editor.
Back to top
Article Information

Corresponding Author: Amit Kaushal, MD, PhD, Department of Bioengineering, Stanford University, 443 Via Ortega Dr, MC 4245, Shriram Room 219, Stanford, CA 94305 (akaushal@stanford.edu).

Accepted for Publication: June 22, 2020.

Author Contributions: Dr Kaushal had full access to all of the data in the study and takes responsibility for the integrity of the data and the accuracy of the data analysis.

Concept and design: All authors.

Acquisition, analysis, or interpretation of data: All authors.

Drafting of the manuscript: Kaushal.

Critical revision of the manuscript for important intellectual content: All authors.

Statistical analysis: Kaushal.

Administrative, technical, or material support: All authors.

Supervision: Altman, Langlotz.

Conflict of Interest Disclosures: Dr Altman reported that he is a cofounder, advisor, and shareholder of Personalis; is a consultant or advisor to Pfizer, GlaxoSmithKline, Withhealth, Cogen Therapeutics, Goldfinch Bio, United Health Group, Myome, BridgeBio, and Primavera Capital; and is an advisor to UK Biobank and Swiss Personalized Health Network. Dr Langlotz reported that he serves on the board of directors and is a shareholder of BunkerHill; is an advisor and option holder to whiterabbit.ai, Nines, GalileoCDS, and Sirona Medical; has received honorarium and travel reimbursement from Canon Medical and travel reimbursement from Siemens; and receives institutional support by grants and gifts from GE Healthcare, Siemens Medical, Philips, Google, Carestream, IBM, IDEXX, Nines, and Hospital Israelita Albert Einstein. No other disclosures were reported.

Additional Contributions: We thank John Borghi, PhD, Lane Medical Library, Stanford University School of Medicine, for consulting on the development of the literature search (without compensation).

References
1.
Liu  Y, Chen  PC, Krause  J, Peng  L.  How to read articles that use machine learning: users’ guides to the medical literature.   JAMA. 2019;322(18):1806-1816. doi:10.1001/jama.2019.16489PubMedGoogle ScholarCrossref
2.
Obermeyer  Z, Powers  B, Vogeli  C, Mullainathan  S.  Dissecting racial bias in an algorithm used to manage the health of populations.   Science. 2019;366(6464):447-453. doi:10.1126/science.aax2342PubMedGoogle ScholarCrossref
3.
Adamson  AS, Smith  A.  Machine learning and health care disparities in dermatology.   JAMA Dermatol. 2018;154(11):1247-1248. doi:10.1001/jamadermatol.2018.2348PubMedGoogle ScholarCrossref
4.
Zech  JR, Badgeley  MA, Liu  M, Costa  AB, Titano  JJ, Oermann  EK.  Variable generalization performance of a deep learning model to detect pneumonia in chest radiographs: a cross-sectional study.   PLoS Med. 2018;15(11):e1002683. doi:10.1371/journal.pmed.1002683PubMedGoogle Scholar
5.
Wang  X, Liang  G, Zhang  Y, Blanton  H, Bessinger  Z, Jacobs  N.  Inconsistent performance of deep learning models on mammogram classification.   J Am Coll Radiol. 2020;17(6):796-803. doi:10.1016/j.jacr.2020.01.006PubMedGoogle ScholarCrossref
6.
Warner  HR, Toronto  AF, Veasey  LG, Stephenson  R.  A mathematical approach to medical diagnosis: application to congenital heart disease.   JAMA. 1961;177:177-183. doi:10.1001/jama.1961.03040290005002PubMedGoogle ScholarCrossref
×