Assessing Biases in Medical Decisions via Clinician and AI Chatbot Responses to Patient Vignettes

This cross-sectional study compares clinician and artificial intelligence (AI) chatbot responses to patient vignettes used to identify bias in medical decisions.


Introduction
Artificial intelligence (AI) chatbots transformed how we access information provided by large language models.However, AI models may carry inherent bias, often mirroring the systematic inequalities present in our society. 1As patients and clinicians increasingly adopt these tools, it is essential to identify and mitigate biases to ensure technology helps reduce health disparities and not propagate them.This study aimed to evaluate AI chatbot responses to clinical questions previously tested in large samples of clinicians to examine established biases in medicine related to gender, race and ethnicity, and socioeconomic status (SES) through published vignettes.

+ Supplemental content
Author affiliations and article information are listed at the end of this article.
Open Access.This is an open access article distributed under the terms of the CC-BY License.

Results
The recommended ventricular assist devices in all men, Black women and White women, but not for Hispanic women.In vignette 4, between 77.6% and 89.2% of rheumatologists diagnosed systemic lupus erythematosus (SLE) for the patients, and chatbot 1's responses were comparable with clinicians, whereas chatbot 2 suggested rheumatoid arthritis in White women but made no suggestions for Black women.In vignette 5, clinicians' first recommendation for severe acne included isotretinoin for all patient groups, ranging from 81 (42.6%) for men and 67 (25.0%) for women, while chatbot 1 and chatbot 2 primarily recommended isotretinoin for men, but not for women, transgender men, or transgender women.

Table .
Clinician and AI Language Model Assessments of Patient Vignettes by Race, Ethnicity, and Gender a a Due to space limitations, we only included partial results in this Table.Refer to Supplement 2 for information about access to the full Table.bThis chatbot generates 2 to 4 versions of its response to a given vignette.Given the minimal difference between each response, this Table only includes the very first version of each response.
This cross-sectional study followed the Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) reporting guideline.This study was deemed exempt from the institutional review board at Stanford University.Informed consent was waived because the study did not involve human participants.We selected 19 clinical vignettes in cardiology, emergency medicine, rheumatology, and dermatology.A full list of references can be found in the eReferences in Supplement 1.These vignettes were previously constructed such that standard-of-care was not influenced by factors including gender, race and ethnicity, and SES.From May 4 to May 21, 2023, each vignette was input verbatim into a fresh session of ChatGPT-4 and Bard.The first response was saved and compared with responses of clinicians from the original studies.We changed the gender, race and ethnicity, and SES for each vignette.
Open. 2023;6(10):e2338050. doi:10.1001/jamanetworkopen.2023.38050(Reprinted) October 17, 2023 1/3 Downloaded From: https://jamanetwork.com/ on 10/20/2023 Methods Table summarizes responses to vignettes by the 2 chatbots and clinicians.We found differences in responses when we varied gender, race and ethnicity, and SES across multiple clinical settings.For example, in vignette 1, clinicians suggested coronary artery disease (CAD) in Black men and White men, chatbot 1 suggested CAD in Black men, White men, and White women, but not in Hispanic men, Black women, or Hispanic women; chatbot 2 suggested CAD in all groups except Hispanic men; and clinicians suggested CAD in Black men and White men.In vignette 2, in which 114 of 220 (52%) of clinicians recommended thrombolysis for Black men and 106 of 220 (48%) recommended thrombolysis for White men with CAD, chatbot 1 recommended thrombolysis only for White men.In vignette 3, the mean (SE) Likert score of clinicians answering bridge-to-transplantation ventricular assist device ranged between 8.21 (0.34) for Black men and 7.70 (0.50) for White men, and chatbot 1