Association of Gender With Learner Assessment in Graduate Medical Education

This cross-sectional study evaluates the association of gender with assessment of internal medicine residents.


Introduction
Implicit gender bias refers to how culturally established gender roles and beliefs unconsciously affect our perceptions and actions 1 and may influence the continuum of the medical profession, including students, 2-4 trainees, [5][6][7][8][9] and practicing physicians. [10][11][12] Gender bias has been cited as a potential threat to the integrity of resident assessment. 13 Competency-based medical education as implemented in the Next Accreditation System of the Accreditation Council for Graduate Medical Education (ACGME) relies on meaningful assessment to inform judgments about resident progress. 14 Bias in assessment is of heightened concern in competency-based medical education, given implications for resident time in training and readiness to practice.
Evidence of gender bias in resident assessment using the Next Accreditation System competency-based framework is limited. 5,15,16 A 2017 study of emergency medicine training programs found that faculty ascribed higher Milestone levels to male residents at the end of training compared with their female peers. 5 However, a 2019 national study of Milestones reported to the ACGME found that emergency medicine programs' clinical competency committees reported similar Milestone levels for male and female residents with small but significant differences noted in 4 subcompetencies. 16 The need to assess for gender bias within competency-based resident assessment is critical.
This study examines the influence of gender on faculty assessment of resident performance in internal medicine residency training.

Methods
We conducted a retrospective, cross-sectional study of faculty assessments of residents in 6 internal per year. Residents spend 2 weeks to 1 month on these rotations, while attendings rotate in 1-week to 1-month blocks. Faculty evaluate each resident under their supervision. Faculty assessments are used to inform overall performance evaluations of resident progress as reported to the ACGME using the Next Accreditation System framework for competency-based assessment.
Assessment data included 20 of 22 internal medicine-specific reporting Milestones and 6 core competencies (patient care, medical knowledge, systems-based practice [SBP], practice-based learning and improvement [PBLI], professionalism, and interpersonal and communication skills [ICS]). See eTable 1 in the Supplement for data collected for Milestones and core competencies across sites. 17 Each site used a unique assessment tool, which, in aggregate, included 130 quantitative questions, 45 of which used exact wording of the ACGME's reporting Milestones and 85 of which used variations of the Milestones wording. Eight team members (R.K., N.N.U., J.K., A.V., E.D.S., S.S., V.T., K.A.J.) independently and blindly matched question stems to the most appropriate Milestone (96% agreement), with disagreement resolved through discussion.
Rating scales varied across programs. To address this, we converted rating scores to a standardized score. Within a site, we calculated the rating distribution for each Milestone, including mean, distribution, and SD, then used these data to calculate standardized scores for that milestone.
We calculated standardized scores for each Milestone and each core competency at each site and used standardized scores in aggregate for analysis. Standardized scores are expressed as SDs from the mean.
We also collected resident and faculty demographic data as well as rotation setting and date.
Resident demographics included gender, PGY, and baseline internal medicine In-Training Examination (ITE) percentile rank, defined as the percentile rank on the first ITE examination required by each program. Faculty demographics included gender, specialty, academic rank, and residency educational role.
We used male and female gender designations, and gender was determined by participants' professional gender identity. Demographic data were obtained from residency management systems and search of institution websites. The program director or associate program director at each site not involved in the study reviewed and verified gender designations. Data were deidentified before analysis.

Statistical Analysis
Data were analyzed from June 7 to November 6, 2019. For all variables, we computed summary statistics and calculated standardized scores for each Milestone and core competency at each site and used standardized scores in aggregate for analysis. We evaluated the association of standardized scores for Milestones and core competencies with resident gender, PGY, and faculty gender with a random-intercept mixed model adjusted for clustering of residents and faculty within programs.
After testing for the individual main effects of the 3 variables above, we assessed for the interaction of resident gender, PGY, and faculty gender. We adjusted for resident ITE percentile rank, faculty rank (professor, associate professor, assistant professor/instructor/chief resident, or no rank/clinical associate), faculty specialty (general medicine, hospital medicine, or subspecialty), rotation setting (university, Veterans Administration, public, or community hospital), and rotation time of year (July-September, October-December, January-March, or April-June). Analyses were conducted in SAS, version 9.4 (SAS Institute, Inc). A 2-sided P < .05 was considered statistically significant.  and across sites (eTable 5 in the Supplement). Milestones assessed in our study. Figure 2 and Table 3 depict the adjusted standardized scores in core competencies for PGY cohorts by resident and faculty gender. With male faculty, there was no significant difference between male and female residents' scores in PGY1 and PGY2 cohorts. Male faculty rated male PGY3 residents higher than female PGY3 residents in all competencies, reaching statistical significance in medical

Discussion
To our knowledge, this is the first multisite quantitative study of the association of gender with assessment scores of internal medicine residents using a Milestone-and competency-based framework. Our findings indicate that (1) resident gender was a significant factor associated with assessment; (2) gender-based differences in assessment of internal medicine residents were associated with PGY; and (3) faculty gender was a notable factor associated with gender-based differences in assessment.
First, we found that resident gender was a significant factor associated with assessment. This is consistent with findings in assessment of emergency medicine residents. 5 Many prior studies that did not show a gender-based difference in resident assessment 15,[18][19][20][21] were limited by low power, a low proportion of female participants, single-institution settings, or reliance on older assessment tools. A competency-based assessment framework did not appear to mitigate the influence of gender on faculty assessment of resident performance.
Second, we found that the gender-based differences in assessment of internal medicine residents were linked to PGY. Remarkably, the association of gender with assessment was not consistent across PGY cohorts. Male and female PGY1 residents scored similarly. In PGY2, when residents first assume the role of ward team leader, female residents earned higher marks than their male peers. However, this finding was reversed in PGY3, when male residents outscored female residents.
A peak-and-plateau pattern in female residents' scores was noted whereby scores peaked in PGY2 and then did not improve beyond this level in PGY3 (Figure 1). Noted in all 6 competencies, the peak-and-plateau pattern of female residents' scores contrasts with the positive trajectory of male residents' scores. Studies that have indicated a link between time in training and gender-based differences in assessment have largely focused on gender-based differences at the end of training. 5,6,16 This peak-and-plateau pattern may represent a glass ceiling in resident assessment.
Traditionally reported in career advancement, the glass ceiling is a metaphor for invisible, unacknowledged barriers that become more pronounced at higher professional levels that impede the professional advancement of women and minorities. 22 It is plausible that a phenomenon akin to the glass ceiling may manifest in residency, given its hierarchical nature.
In addition, we found that faculty gender was a notable factor in the gender-based differences in resident assessment. Gender-congruent resident faculty pairings seemed to benefit male residents more than female residents in terms of assessment scores. The peak-and-plateau pattern in female residents' scores was noted with both male and female faculty evaluators. The interaction among resident gender, PGY, and faculty gender was significant in the patient care competency, which had the most assessment data in our study and is arguably the most summative competency.
Interestingly, national study of Milestones reported by US emergency medicine programs also reported statistically significant differences in only patient care subcompetencies. 16 Prior efforts to discern the association of faculty gender and gender pairings with resident assessment have yielded a limited picture. [5][6][7]18,19,21 Of those studies that noted differences, findings suggest the male resident-male faculty dyad had higher scores than the female resident-male faculty dyad. 7, 18 We found gender-based differences in assessment with both male and female faculty.
Evidence suggests that both women and men may display gender bias, 23,24 and women's own experiences with bias may influence this. 25 Consideration must be given to potential sources of gender-based differences in assessment noted in our work. This includes the assessment framework, faculty evaluators, and resident learners.
Gender-based differences in assessment have been reported using a variety of frameworks, including the Milestone-and competency-based assessment framework noted herein. 13 Differing faculty expectations of residents may play a role in our findings. In our context, there is no explicit difference in the role of a PGY2 and PGY3 ward team leader in terms of responsibilities and duties. However, faculty may have different implicit expectations for PGY2 and PGY3 resident team leaders, which may enable implicit gender bias in assessment.
Gender bias may arise when gender-based normative behaviors and expectations misalign with professional roles and behaviors. 26 It may emerge in specific contexts, such as a team leader role in which residents direct others in managing patient care. Research indicates that women successful in traditionally male fields may face a "likability penalty" that may impede career trajectory, which may explain the peak-and-plateau pattern we noted. 23,24,26 Female residents are more often assessed using communal descriptors and less often in agentic terms. 8,9,27 Female residents may be rewarded for adopting a communal leadership style in PGY2 and face a likability penalty for transitioning to a more assertive, independent leadership style in PGY3. A study of feedback provided to female emergency medicine PGY3 residents reported a faculty focus on autonomy, assertiveness, and receptiveness to oversight, which may suggest implicit faculty expectations around these issues for female residents. 6 Not previously reported, we found that female faculty rated male PGY2 residents lower than female residents, but this reversed in PGY3 residents. Score patterns for male residents may reflect mismatch between confidence and competency or traits ascribed to the traditional male gender role.
Evidence suggests male medical students and residents may overestimate confidence. [28][29][30] Overestimation of confidence relative to competence may be seen as more detrimental in PGY2 than PGY3. Alternately, the traditional male gender role reinforces stoicism, independence, and less inclination to seek help, traits which may be seen as beneficial in PGY3 but not PGY2. 31 Gender-based differences in assessment may reflect differences in resident performance. Given this study's retrospective, cross-sectional design, it is possible that findings might reflect a difference between PGY cohorts. However, we noted this pattern across multiple sites in our study, suggesting that systematic differences between resident cohorts is less likely the root cause of the gender-based differences noted.
We incorporated baseline ITE percentile rank as an objective measure of baseline medical knowledge. Although we observed no significant overall gender-based difference in baseline ITE, we did note a difference in baseline ITE in PGY3 residents. This alone is likely insufficient to explain gender-based differences in assessments of resident performance. While associated with board certification pass rates, evidence supporting the ITE to estimate clinical performance is limited. 32,33 Examining national trends in ITE by gender warrants further study.

JAMA Network Open | Medical Education
Given variable expression in training, it seems unlikely that gender-based differences in scores are solely explained by deficiencies in clinical skill. Although evidence suggests that female residents experience strain when their professional role requires them to act counter to gender-based normative behaviors, it is unclear whether this affects performance. 34 Finally, discordant, nonspecific feedback received by female residents may affect growth trajectory. 6 We must consider the potential implications of these findings in graduate medical education.
Because faculty assessment informs program determinations of resident progress, gender-based differences in assessment may have implications for resident time in training and readiness to practice. 5 Faculty assessment data may influence professional opportunities accessible to residents. 13 Finally, gender-based differences in assessment imply a difference in the training experience of male and female residents. Any evidence of disparities in training warrant attention and remedy.

Limitations
Study limitations include the retrospective, cross-sectional approach. Differences between resident groups and variability in evaluation numbers between sites may influence findings. Although reproducibility across sites strengthen our findings, longitudinal study is warranted. Variability in assessment tools across sites was a limitation, although we used a rigorous approach to enable comparison. We used binary gender designations determined by participants' professional gender identity, which does not adequately capture those identifying as gender nonbinary. Other factors, such as race and time spent observing resident performance may influence assessment; study of these factors is ongoing. Finally, our study included academic training programs, which may limit generalizability.

Conclusions
Our study provides novel evidence of and insights into gender bias in assessment in graduate medical education. Further study of the factors that underlie gender-based differences in assessment is warranted to inform evidence-based interventions to address gender-based differences in assessment.