[Skip to Navigation]
September 22, 2021

Lack of Transparency and Potential Bias in Artificial Intelligence Data Sets and Algorithms: A Scoping Review

Author Affiliations
  • 1Stanford Department of Dermatology, Stanford School of Medicine, Redwood City, California
  • 2Stanford Department of Biomedical Data Science, Stanford School of Medicine, Stanford, California
  • 3Department of Medicine, Memorial Sloan Kettering Cancer Center, New York, New York
  • 4currently a medical student at Icahn School of Medicine at Mount Sinai, New York, New York
  • 5Dermatology Service, Memorial Sloan Kettering Cancer Center, New York, New York
  • 6Department of Electrical Engineering, Stanford University, Stanford, California
  • 7Department of Biomedical Data Science, Stanford University, Stanford, California
  • 8Chan Zuckerberg Biohub, San Francisco, California
JAMA Dermatol. Published online September 22, 2021. doi:10.1001/jamadermatol.2021.3129
Key Points

Question  How transparent are the data sets used to develop artificial intelligence (AI) algorithms in dermatology, and what potential pitfalls exist in the data?

Findings  In this scoping review of 70 studies addressing the intersection of dermatology and AI that were published between January 1, 2015, and November 1, 2020, most data set descriptions were inadequate for analysis and replication, disease labels did not meet the gold standard, and information on patient skin tone and race or ethnicity was often not reported. In addition, most data sets and models have not been shared publicly.

Meaning  These findings suggest that the applicability and generalizability of AI algorithms rely on high-quality training and testing data sets; the sparsity of data set descriptions, lack of data set and model transparency, inconsistency in disease labels, and lack of reporting on patient diversity present concerns for the clinical translation of these algorithms.


Importance  Clinical artificial intelligence (AI) algorithms have the potential to improve clinical care, but fair, generalizable algorithms depend on the clinical data on which they are trained and tested.

Objective  To assess whether data sets used for training diagnostic AI algorithms addressing skin disease are adequately described and to identify potential sources of bias in these data sets.

Data Sources  In this scoping review, PubMed was used to search for peer-reviewed research articles published between January 1, 2015, and November 1, 2020, with the following paired search terms: deep learning and dermatology, artificial intelligence and dermatology, deep learning and dermatologist, and artificial intelligence and dermatologist.

Study Selection  Studies that developed or tested an existing deep learning algorithm for triage, diagnosis, or monitoring using clinical or dermoscopic images of skin disease were selected, and the articles were independently reviewed by 2 investigators to verify that they met selection criteria.

Consensus Process  Data set audit criteria were determined by consensus of all authors after reviewing existing literature to highlight data set transparency and sources of bias.

Results  A total of 70 unique studies were included. Among these studies, 1 065 291 images were used to develop or test AI algorithms, of which only 257 372 (24.2%) were publicly available. Only 14 studies (20.0%) included descriptions of patient ethnicity or race in at least 1 data set used. Only 7 studies (10.0%) included any information about skin tone in at least 1 data set used. Thirty-six of the 56 studies developing new AI algorithms for cutaneous malignant neoplasms (64.3%) met the gold standard criteria for disease labeling. Public data sets were cited more often than private data sets, suggesting that public data sets contribute more to new development and benchmarks.

Conclusions and Relevance  This scoping review identified 3 issues in data sets that are used to develop and test clinical AI algorithms for skin disease that should be addressed before clinical translation: (1) sparsity of data set characterization and lack of transparency, (2) nonstandard and unverified disease labels, and (3) inability to fully assess patient diversity used for algorithm development and testing.

Limit 200 characters
Limit 25 characters
Conflicts of Interest Disclosure

Identify all potential conflicts of interest that might be relevant to your comment.

Conflicts of interest comprise financial interests, activities, and relationships within the past 3 years including but not limited to employment, affiliation, grants or funding, consultancies, honoraria or payment, speaker's bureaus, stock ownership or options, expert testimony, royalties, donation of medical equipment, or patents planned, pending, or issued.

Err on the side of full disclosure.

If you have no conflicts of interest, check "No potential conflicts of interest" in the box below. The information will be posted with your response.

Not all submitted comments are published. Please see our commenting policy for details.

Limit 140 characters
Limit 3600 characters or approximately 600 words