Effect of Physician Gender and Race on Simulated Patients’ Ratings and Confidence in Their Physicians

This randomized trial examines whether physician gender and race affect patient ratings of satisfaction and confidence in the physician during a simulated clinical encounter for an emergency department visit.


Randomization and Estimation Procedures
The randomization scheme in this experiment is a two step process for each subject. First, one of four race/gender pairs is drawn from a uniform distribution: Next, conditional on the realized value of Z, one of 10 potential doctors is sampled from that group: For example, f (K = 10|Z = BF ) = 0.10 is the probability that doctor #10 is selected, given that the doctor is a Black Female. The joint distribution of K and Z is simply, Since we are fixing the "control" to be White Male (per our pre-analysis plans), then we just have three target parameters for a given outcome variable (rather than six): τ BF , τ BM , τ W F . Each target parameter is an "average treatment contrast". For a given individual, we only observe one of the 4 potential outcomes. An unbiased estimator for each average treatment contrast, and a given outcome variable Y , is simply the difference in means: where N BF denotes the number of subjects that receive the Black Female treatment, N = N BF + Linear regression of Y i on Z i is a straightforward estimator, where Z i is a 4 level factor with W M as the "omitted category" and the 4 − 1 coefficients (excluding the intercept) provide unbiased estimates of each treatment contrast. This is equivalent to a linear regression of Y i on three indicator variables: This is also equivalent to three "difference-in-means" comparisons where the "control group" observations are those that received the White Male treatment.
The covariate-adjusted difference in means estimator is simply the difference-in-means estimator adjusted for differences in background characteristics to improve precision (1 ). All estimates of treatment contrasts presented here, and in the manuscript, are from the covariate-adjusted difference-in-means estimator. Prior to participating in the clinical vignette, subjects were provided with a description of their symptoms ( Figure e1). The description was displayed for 30 seconds while the "continue" button was disabled so that each subject was required to spend at least 30 seconds reviewing the instructions. Next, the subject was asked two questions as part of an "attention check" procedure (2, 3 ). The first question simply asked "How long have you been experiencing abdominal pain?" with potential responses "For a couple of hours"; "For about one day" (correct answer); or "Weeks". Next, the subject completed the "drag and drop" task in Figure e2. If the subject passed both attention check questions they were advanced to a new page and provided with additional instructions about the task (Figure e3). If the subject failed, they were re-directed to the scenario instructions for a second review. In Study 1, 82% of subjects passed on the first attempt; in Study 2, 57% passed on the first attempt. The significant difference in pass rates between the MTurk (Study 1) and Lucid samples (Study 2) is consistent with prior research finding MTurk workers are substantially more attentive to instructions than student research subjects, or research subjects drawn from the general population (4,5 ). None of the subjects who failed the attention check in these studies were excluded from analyses reported here, or in the manuscript (6 ).

Selection of Simulated Physicians
Selection criteria of images within each treatment arm was based on CFD, raters' evaluations along the following dimensions: 1) perceived age between, 27 and 39 years old, the group of physicians more likely to experience discrimination (7 ); 2) 90% agreement among raters on perceived race and gender of face; 3) perceived trustworthiness and attractiveness between 3-5 on a 7-point Likert scale, excluding those perceived to be unusually trustworthy (or untrustworthy) or unusually attractive (or unattractive).

Covariate Balance
Tables e1 and e2 show summary statistics for background covariates across all four treatment arms. One implication of random assignment is that background characteristics should be poor predictors of treatment assingment. Rather than conducting separate tests covariate-by-covariate, for each treatment arm, and each study, we use randomization inference to conduct two omnibus tests of the null hypothesis of covariate balance across treatment arms (see Chapter 3 of (8 ) for a textbook treatment). This test is performed by regressing the treatment assignment vector on background covariates. When the null hypothesis of covariate balance is true, the observed F -statistic from this regression will not be unusual when compared to the null distribution implied by the experimental design.
We approximate the null distribution using 10,0000 permutations of the experimental design. The RI P -value is then the proportion of permuted F -statistics that are as extreme as the one observed under the null hypothesis of covariate balance, i.e. if none of the permuted F -statistics were as extreme as the one observed then the P -value would be zero, providing strong evidence of covariate imbalance. The observed estimates, along with the 0.025th and 0.975th quantiles of the distribution of permuted estimates and the RI P -values, are presented for each Study in Table e3.
The RI P -value for Study 1 is 0.90; that is, approximately 90% of the simulated F -statistics as extreme as the observed F-statistic of 0.53. The RI p-value for Study 1 is 0.18; that is, approximately 18% of the simulated F -statistics were as extreme as the observed F-statistic of 1.35. Thus, we fail to reject the null hypothesis of covariate balance in both Study 1 and Study 2, as implied by the experimental designs.

Primary Outcomes
The primary outcome measures used in Study 1 and Study 2 are enumerated below: 1) Patient Confidence: 1a) "How confident are you that this doctor made the correct diagnosis?" 1b) "How confident are you that this doctor recommended the correct treatment plan?" 2) Believes Symptom Checker: "Which diagnosis do you think is more likely to be correct?" [1 = "The symptom checker"; 0 = "The doctor"]. 3) Requests more tests: "Would you ask the doctor to perform additional diagnostic tests? (Such as the CT scan recommended by the Symptom Checker)." [5 = "Definitely"; 4 = "Probably"; 3 = "Might or might not"; 2 = "Probably not"; 1 = "Definitely not"] 4) Patient Satisfaction: "What number would you use to rate your care during this emergency room visit?" 5) Likelihood to Recommend: "Would you recommend this doctor to your friends and family?" [5 = "Definitely"; 4 = "Probably"; 3 = "Might or might not"; 2 = "Probably not"; 1 = "Definitely not"] The patient confidence outcome for each study participant is simply the unweighted average of their ratings on question 1a and 1b. All other primary outcomes are based on a single survey item. In Study 1, outcomes 1a, 1b and 4 were measured using 0-100 point scales (Figure e4-e5). In Study 2, outcomes 1a and 1b were measured using 5 point scales, and outcome 4 was measured using a 10 point scale (Figure e6-

Survey Measures of Racial Prejudice and Sexism
In Study 1 and Study 2, the measure of racial prejudice is an explicit (survey based) measure that captures negative beliefs about group-level differences between blacks and whites on four dimensions: trustworthiness, violence, work-ethic and intelligence (9,10 ). We scale responses for each of the 4 individual items (e.g. Figure e12 shows the "trustworthiness item") so that a positive difference for "whites" versus "blacks" indicates belief in group-level white superiority. The White-Black differences for each of the items are summarized in Figure e13 (for Study 1), and Figure e14 (for Study 2). The modal respondent -in both studies -did not endorse group-level differences between "blacks" and "whites". We combined these individual items into a single measure by summing across all 4 items to create a racial prejudice index with range -24 to 24. Approximately 40% of subjects in Study 1 and 34% of subjects in Study 2 scored above zero on the index, and therefore believed in the group-level superiority of whites over blacks.
In Study 1, hostile sexism was measured using components from the ambivalent sexism inventory (11 ), each with a 6-point scale from Strongly disagree (0) to Strongly agree (5), where higher levels of agreement reflect higher levels of sexism: 1) sexism1: Women exaggerate problems they have at work. 2) sexism2: Once a woman gets a man to commit to her, she usually tries to put him on a tight leash. 3) sexism3: Women are too easily offended. 4) sexism4: Many women are actually seeking special favors, such as hiring policies that favor them over men, under the guise of asking for "equality." The distribution of responses for each individual item is plotted in Figure e15. The modal response category, for each item, was Strongly disagree (0). We created a sexism index ranging from 1 to 6 by taking the average across the 5 individual items. In Study 2, we used a 2-dimensional measure that distinguished between hostile (2-items) and benevolent (3-items) sexism (12 ). Each individual item was captured using a 5-point scale from Strongly agree (5) to Strongly disagree (1) with a neutral midpoint of "Neither Agree nor Disagree" (3).
Hostile Sexism: 1) asi_hostile1: Women seek power by gaining control over men.
2) asi_hostile2: Once a woman gets a man to commit to her, she usually tries to put him on a tight leash.
Benevolent Sexism: 1) asi_benevolent1: Women should be cherished and protected by men.
2) asi_benevolent2: Women have a quality of purity that few men possess.
We constructed a hostile sexism index (range 1 to 5) and a benevolent sexism index (range 1 to 5) for each respondent by averaging across the individual items. The distribution of responses for each individual item is plotted in Figure e16 for Hostile Sexism and Figure e17 for Benevolent Sexism.

BART Estimated Treatment Effects
Recall that we are interested in 3 treatment contrasts: Rather than estimate an average treatment contrast, e.g. τ BF , the Bayesian Additive Regression Trees (BART) algorithm seeks to estimate the individual-level potential outcomes for each unit (e.g. ) by taking into account background covariates, and allowing for higher-order interactions between covariates and treatment. BART considers each of these individual-level responses to be a random variable, conditional on the observed data. Estimation and inference for BART proceeds by taking many draws from the posterior distribution of each potential outcome, for each individual.
We implemented the BART algorithm using the dbarts package in R, sampling 5,000 draws from the posterior distribution using Markov Chain Monte Carlo (MCMC), with 1,000 iterations of burn-in, 200 trees, 4 independent chains, and thinning every 5 iterations. Thus, for each individual we obtained 20, 000 = 5, 000 × 4 draws from the posterior distribution of each potential outcome:  Table e8 provides an overall summary of the BART estimated treatment effects for each study across each of the three treatment contrasts. The first column reports the proportion of BART estimated (individual-level) treatment effects that were predicted to be positive. In none of the cases where a subject was predicted to have a positive (or negative) treatment effect does their corresponding 95% credible interval exclude zero. The second column reports the mean of the BART estimated treatment effects, the third reports the standard deviation, and the final two columns report the 0.025th and 0.975th quantiles.

Secondary Outcomes
The secondary outcome measures differed across Study 1 and Study 2. In Study 1, physician warmth and competence were measured using the instrument presented in Figure e8 (13, 14 ). In Study 2, physician warmth and competence were measured using the instrument presented in Figure e9 (15 ). Warmth and competence scales in Study 1 and Study 2 were constructed by taking the first principal component from a PCA on all individual scale items. The fairness of the ER visit was measured using the instrument in Figure e10, and the willingness to punish was measured using the instrument in Figure e11. Table e9 presents the estimated treatment effects for all secondary outcome variables measured in Study 1. Table e10 presents the analoguous results for Study 2.