[Skip to Navigation]
Sign In
Figure 1.  Significant Single-Word (Unigram) Associations by Faculty Gender
Significant Single-Word (Unigram) Associations by Faculty Gender

OR indicates odds ratio.

Figure 2.  Significant 2-Word (Bigram) Associations by Faculty Gender
Significant 2-Word (Bigram) Associations by Faculty Gender

OR indicates odds ratio.

Table.  Demographic Information of Faculty Evaluation Cohort
Demographic Information of Faculty Evaluation Cohort
1.
Lautenberger  DM, Dandar  VM, Raezer  CL, Sloane  RA.  The State of Women in Academic Medicine: The Pipeline and Pathways to Leadership, 2013-2014. Washington, DC: Association of American Medical Colleges; 2014.
2.
Jena  AB, Khullar  D, Ho  O, Olenski  AR, Blumenthal  DM.  Sex differences in academic rank in US medical schools in 2014.  JAMA. 2015;314(11):1149-1158. doi:10.1001/jama.2015.10680PubMedGoogle ScholarCrossref
3.
Jagsi  R, Griffith  KA, Stewart  A, Sambuco  D, DeCastro  R, Ubel  PA.  Gender differences in salary in a recent cohort of early-career physician-researchers.  Acad Med. 2013;88(11):1689-1699. doi:10.1097/ACM.0b013e3182a71519PubMedGoogle ScholarCrossref
4.
Jagsi  R, Guancial  EA, Worobey  CC,  et al.  The “gender gap” in authorship of academic medical literature—a 35-year perspective.  N Engl J Med. 2006;355(3):281-287. doi:10.1056/NEJMsa053910PubMedGoogle ScholarCrossref
5.
Jagsi  R, Griffith  KA, Stewart  A, Sambuco  D, DeCastro  R, Ubel  PA.  Gender differences in the salaries of physician researchers.  JAMA. 2012;307(22):2410-2417. doi:10.1001/jama.2012.6183PubMedGoogle ScholarCrossref
6.
Reed  DA, Enders  F, Lindor  R, McClees  M, Lindor  KD.  Gender differences in academic productivity and leadership appointments of physicians throughout academic careers.  Acad Med. 2011;86(1):43-47. doi:10.1097/ACM.0b013e3181ff9ff2PubMedGoogle ScholarCrossref
7.
Kaplan  SH, Sullivan  LM, Dukes  KA, Phillips  CF, Kelch  RP, Schaller  JG.  Sex differences in academic advancement: results of a national study of pediatricians.  N Engl J Med. 1996;335(17):1282-1289. doi:10.1056/NEJM199610243351706PubMedGoogle ScholarCrossref
8.
Jena  AB, Olenski  AR, Blumenthal  DM.  Sex differences in physician salary in US public medical schools.  JAMA Intern Med. 2016;176(9):1294-1304. doi:10.1001/jamainternmed.2016.3284PubMedGoogle ScholarCrossref
9.
Association of American Medical Colleges. US medical school applications and matriculants by school, state of legal residence, and sex, 2018-2019. https://www.aamc.org/download/321442/data/factstablea1.pdf. Published November 9, 2018. Accessed September 25, 2018.
10.
Staats  C. State of the science: implicit bias review. http://kirwaninstitute.osu.edu/wp-content/uploads/2014/03/2014-implicit-bias.pdf. Published March 2, 2014. Accessed December 15, 2018.
11.
Gorman  EH.  Gender stereotypes, same-gender preferences, and organizational variation in the hiring of women: evidence from law firms.  Am Sociol Rev. 2005;70(4):702-728. doi:10.1177/000312240507000408Google ScholarCrossref
12.
Moss-Racusin  CA, Dovidio  JF, Brescoll  VL, Graham  MJ, Handelsman  J.  Science faculty’s subtle gender biases favor male students.  Proc Natl Acad Sci USA. 2012;109(41):16474-16479. doi:10.1073/pnas.1211286109PubMedGoogle ScholarCrossref
13.
Pohlhaus  JR, Jiang  H, Wagner  RM, Schaffer  WT, Pinn  VW.  Sex differences in application, success, and funding rates for NIH extramural programs.  Acad Med. 2011;86(6):759-767. doi:10.1097/ACM.0b013e31821836ffPubMedGoogle ScholarCrossref
14.
Hechtman  LA, Moore  NP, Schulkey  CE,  et al.  NIH funding longevity by gender.  Proc Natl Acad Sci USA. 2018;115(31):7943-7948. doi:10.1073/pnas.1800615115PubMedGoogle ScholarCrossref
15.
Zhou  CD, Head  MG, Marshall  DC,  et al.  A systematic analysis of UK cancer research funding by gender of primary investigator.  BMJ Open. 2018;8(4):e018625. doi:10.1136/bmjopen-2017-018625PubMedGoogle ScholarCrossref
16.
Young  S, Rush  L, Shaw  D.  Evaluating gender bias in ratings of university instructors’ teaching effectiveness.  Int J Scholarsh Teach Learn. 2009;3(2):1-14. doi:10.20429/ijsotl.2009.030219Google Scholar
17.
Hirshfield  LE.  ‘She’s not good with crying’: the effect of gender expectations on graduate students’ assessments of their principal investigators.  Gend Educ. 2014;26(6):601-617. doi:10.1080/09540253.2014.940036Google ScholarCrossref
18.
Laube  H, Massoni  K, Sprague  J, Ferber  AL.  The impact of gender on the evaluation of teaching: what we know and what we can do.  NWSA J. 2007;19(3):87-104.Google Scholar
19.
Mengel  F, Sauermann  J, Zölitz  U.  Gender bias in teaching evaluations.  J Eur Econ Assoc. 2018;17(2):535-566. doi:10.1093/jeea/jvx057Google ScholarCrossref
20.
Dayal  A, O’Connor  DM, Qadri  U, Arora  VM.  Comparison of male vs female resident milestone evaluations by faculty during emergency medicine residency training.  JAMA Intern Med. 2017;177(5):651-657. doi:10.1001/jamainternmed.2016.9616PubMedGoogle ScholarCrossref
21.
Mueller  AS, Jenkins  TM, Osborne  M, Dayal  A, O’Connor  DM, Arora  VM.  Gender differences in attending physicians’ feedback to residents: a qualitative analysis.  J Grad Med Educ. 2017;9(5):577-585. doi:10.4300/JGME-D-17-00126.1PubMedGoogle ScholarCrossref
22.
McOwen  KS, Bellini  LM, Guerra  CE, Shea  JA.  Evaluation of clinical faculty: gender and minority implications.  Acad Med. 2007;82(10)(suppl):S94-S96. doi:10.1097/ACM.0b013e3181405a10PubMedGoogle ScholarCrossref
23.
Morgan  HK, Purkiss  JA, Porter  AC,  et al.  Student evaluation of faculty physicians: gender differences in teaching evaluations.  J Womens Health (Larchmt). 2016;25(5):453-456. doi:10.1089/jwh.2015.5475PubMedGoogle ScholarCrossref
24.
Brown  PF, deSouza  PV, Mercer  RL, Della Pietra  VJ, Lai  JC.  Class-based n-gram models of natural language.  Comput Linguist. 1992;18(4):468-479.Google Scholar
25.
Schmader  T, Whitehead  J, Wysocki  VH.  A linguistic comparison of letters of recommendation for male and female chemistry and biochemistry job applicants.  Sex Roles. 2007;57(7-8):509-514. doi:10.1007/s11199-007-9291-4PubMedGoogle ScholarCrossref
26.
Recasens  M, Danescu-Niculescu-Mizil  C, Jurafsky  D. Linguistic models for analyzing and detecting biased language. In: Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics. Sofia, Bulgaria: Association for Computational Linguistics; 2013:1650-1659.
27.
Terkik  A, Prud’hommeaux  E, Alm  CO, Homan  C, Franklin  S. Analyzing gender bias in student evaluations. In: Proceedings of COLING 2016, 26th International Conference of Computational Linguistics. Osaka, Japan: The COLING 2016 Organizing Committee; 2016:868-876.
28.
Aull  L.  Gender in the American anthology apparatus: a linguistic analysis.  Adv in Lit Stud. 2014;2(1):38-45. doi:10.4236/als.2014.21008Google ScholarCrossref
29.
Marafino  BJ, Boscardin  WJ, Dudley  RA.  Efficient and sparse feature selection for biomedical text classification via the elastic net: application to ICU risk stratification from nursing notes.  J Biomed Inform. 2015;54:114-120. doi:10.1016/j.jbi.2015.02.003PubMedGoogle ScholarCrossref
30.
Zou  H, Hastie  T.  Regularization and variable selection via the elastic net.  J R Stat Soc B. 2005;67(2):301-320. doi:10.1111/j.1467-9868.2005.00503.xGoogle ScholarCrossref
31.
Benjamini  Y, Hochberg  Y.  Controlling the false discovery rate: a practical and powerful approach to multiple testing.  J R Stat Soc, Ser B. 1995;57(1):289-300.Google Scholar
32.
Pike  N.  Using false discovery rates for multiple comparisons in ecology and evolution.  Methods Ecol Evol. 2011;2(3):278-282. doi:10.1111/j.2041-210X.2010.00061.xGoogle ScholarCrossref
33.
Trix  F, Psenka  C.  Exploring the color of glass: letters of recommendation for female and male medical faculty.  Discourse Soc. 2003;14(2):191-220. doi:10.1177/0957926503014002277Google ScholarCrossref
34.
Fischbach  A, Lichtenthaler  PW, Horstmann  N.  Leadership and gender stereotyping of emotions.  J Pers Psychol. 2015;14(3):153-162. doi:10.1027/1866-5888/a000136Google ScholarCrossref
35.
Eagty  AH, Makhijani  MG, Klonsky  BG.  Gender and the evaluation of leaders: a meta-analysis.  Psychol Bull. 1992;111(1):3-22. doi:10.1037/0033-2909.111.1.3PubMedGoogle ScholarCrossref
36.
Evans  JB, Slaughter  JE, Ellis  APJ, Rivin  JM.  Gender and the evaluation of humor at work [published online February 7].  J Appl Psychol. 2019. doi:10.1037/apl0000395Google Scholar
1 Comment for this article
Another problem that women faculty face
Frederick Rivara, MD, MPH | University of Washington
We are all aware of the many problems that women in medicine face, including lower pay, differential treatment, and barriers to promotion. This study now shows us another- that trainees use different words in their comments to describe male vs female faculty. As the article states, these biases can have far reaching consequences.
CONFLICT OF INTEREST: Editor in Chief, JAMA Network Open
Original Investigation
Medical Education
May 10, 2019

Assessment of Gender-Based Linguistic Differences in Physician Trainee Evaluations of Medical Faculty Using Automated Text Mining

Author Affiliations
  • 1Department of Medicine, Perelman School of Medicine, University of Pennsylvania, Philadelphia
  • 2Center for Clinical Epidemiology and Biostatistics, Perelman School of Medicine, University of Pennsylvania, Philadelphia
  • 3Palliative and Advanced Illness Research Center, Perelman School of Medicine, University of Pennsylvania, Philadelphia
  • 4Leonard Davis Institute of Health Economics, Perelman School of Medicine, University of Pennsylvania, Philadelphia
  • 5Center for Healthcare Improvement and Patient Safety, Perelman School of Medicine, University of Pennsylvania, Philadelphia
  • 6Perelman School of Medicine, Department of Biostatistics and Epidemiology, University of Pennsylvania, Philadelphia
JAMA Netw Open. 2019;2(5):e193520. doi:10.1001/jamanetworkopen.2019.3520
Key Points español 中文 (chinese)

Question  Are there gender-based linguistic differences in narrative evaluations of medical faculty written by physician trainees?

Findings  In this cohort study of 7326 narrative evaluations for 521 medical faculty members, the words art, trials, master, and humor were significantly associated with evaluations of male faculty, whereas the words empathetic, delight, and warm were significantly associated with evaluations of female faculty. Two-word phrase associations included run rounds, big picture, and master clinician in male faculty evaluations and model physician and attention (to) detail in female faculty evaluations.

Meaning  The findings suggest that quantitative linguistic differences between free-text comments about male and female medical faculty may occur in physician trainee evaluations.

Abstract

Importance  Women are underrepresented at higher ranks in academic medicine. However, the factors contributing to this disparity have not been fully elucidated. Implicit bias and unconscious mental attitudes toward a person or group may be factors. Although academic medical centers use physician trainee evaluations of faculty to inform promotion decisions, little is known about gender bias in these evaluations. To date, no studies have examined narrative evaluations of medical faculty by physician trainees for differences based on gender.

Objective  To characterize gender-associated linguistic differences in narrative evaluations of medical faculty written by physician trainees.

Design, Setting, and Participants  This retrospective cohort study included all faculty teaching evaluations completed for the department of medicine faculty by medical students, residents, and fellows at a large academic center in Pennsylvania from July 1, 2015, through June 30, 2016. Data analysis was performed from June 1, 2018, through July 31, 2018.

Main Outcomes and Measures  Word use in faculty evaluations was quantified using automated text mining by converting free-text comments into unique 1- and 2-word phrases. Mixed-effects logistic regression analysis was performed to assess associations of faculty gender with frequencies of specific words and phrases present in a physician trainee evaluation.

Results  A total of 7326 unique evaluations were collected for 521 faculty (325 men [62.4%] and 196 women [37.6%]). The individual words art (odds ratio [OR], 7.78; 95% CI, 1.01-59.89), trials (OR, 4.43; 95% CI, 1.34-14.69), master (OR, 4.24; 95% CI, 1.69-10.63), and humor (OR, 2.32; 95% CI, 1.44-3.73) were significantly associated with evaluations of male faculty, whereas the words empathetic (OR, 4.34; 95% CI, 1.56-12.07), delight (OR, 4.26; 95% CI, 1.35-13.40), and warm (OR, 3.45; 95% CI, 1.83-6.49) were significantly associated with evaluations of female faculty. Two-word phrases associated with male faculty evaluations included run rounds (OR, 7.78; 95% CI, 1.01-59.84), big picture (OR, 7.15; 95% CI, 1.68-30.42), and master clinician (OR, 4.02; 95% CI, 1.21-13.36), whereas evaluations of female faculty were more likely to be associated with model physician (OR, 7.75; 95% CI, 1.70-35.39), just right (OR, 6.97; 95% CI, 1.51-32.30), and attention (to) detail (OR, 4.26; 95% CI, 1.36-13.40).

Conclusions and Relevance  The data showed quantifiable linguistic differences between free-text comments about male and female faculty in physician trainee evaluations. Further evaluation of these differences, particularly in association with ongoing gender disparities in faculty promotion and retention, may be warranted.

Introduction

Although women compose half of all US medical school graduates,1 recent evidence has shown ongoing sex disparities in academic job achievements and academic promotion.2-8 Specifically, women in medicine represent only 38% of academic faculty and 15% of senior leadership positions across the United States.9 This underrepresentation of women has been attributed in part to high rates of attrition at multiple points along the promotion pathway. Although there is robust evidence2-9 showing the presence of these sex differences, the factors leading to disparities in faculty retention and promotion have not been fully elucidated.

One potential influence may be implicit bias, namely unconscious mental attitudes toward a person or group.10 Evidence suggests that implicit biases pervade academia, influencing hiring processes,11 mentorship,12 publication,4 and funding opportunities.13-15 In previous studies,16-19 implicit gender bias also has been quantitatively and qualitatively shown in evaluation of nonmedical academic faculty. Similar biases have also been noted in medical education within narrative evaluations of physician trainees written by medical faculty.20,21 However, despite the critical role of evaluations from physician trainees in promotion decisions for medical faculty, the presence of gender bias within evaluations of medical faculty written by physician trainees has not been adequately assessed.

Thus far, studies22,23 assessing gender differences in faculty evaluations have focused exclusively on numeric scales evaluating teaching effectiveness, with the largest such study showing no significant differences in numeric rating by gender. However, both numeric ratings and written comments are used by academic medical centers for promotion purposes. No studies, to our knowledge, have examined gender differences in narrative evaluations of medical faculty by physician trainees, which could identify implicit biases not captured by numeric ratings.

The use of computer-based language analysis or natural language processing techniques may provide a quantitative method to identify subtle linguistic differences in narrative evaluation that may not be captured by qualitative coding and thematic analysis. The practice of n-gram analysis, which converts text into a sequence of words and phrases for further quantitative analysis,24 has been successfully used to explore linguistic differences between genders in a variety of other settings.25-28 We hypothesized that use of an n-gram analysis could help quantify gender-associated differences in the vocabulary used by physician trainees in narrative evaluations of medical faculty.

Methods
Setting and Participants

We performed a single-center retrospective cohort analysis of all evaluations of department of medicine faculty that were written by physician trainees at a large academic medical center from July 1, 2015, to June 30, 2016. Anonymous evaluations of faculty, completed by medical students, residents, and fellows, consisted of a cumulative numeric score rating global teaching effectiveness (5-point rating scale: 1, poor; 2, fair; 3, good; 4, very good; and 5, excellent), in addition to mandated free-text comments. These evaluations were submitted by physician trainees within an anonymous online evaluation platform. Other than the global teaching effectiveness score, there were no separate ratings of faculty. Physician trainees completed the evaluations using a web-based platform at the end of each clinical rotation (both inpatient and outpatient), usually representing 1 to 2 weeks of exposure to the attending physician. The institutional review board of the University of Pennsylvania, Philadelphia, approved this study and granted a waiver of consent for participation. A waiver of consent was obtained because the research was retrospective and posed no more than minimal risk to participants, and the research could not be performed without a waiver. Our reporting adheres to the Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) reporting guideline for cohort studies.

Data Collection and Processing

The data set included faculty sex, faculty division within the department of medicine, physician trainee sex, physician trainee level (medical student, resident, or fellow), numeric rating scores (the teaching effectiveness score), and free-text comments. To preserve anonymity among the physician trainees, unique physician trainee identifiers were not associated with any evaluations. A third party deidentified all evaluations before the analysis. Text from the narrative portion of the evaluation was anonymized by replacing all first and last names with a placeholder (eg, first name and last name). In addition, text was converted to all lower case letters, and special characters were normalized. All punctuation was removed in the data preprocessing step, and stop words (such as a, the, and this) were removed from the text.

N-gram Text Analysis

To analyze the narrative data, we used n-gram text mining to convert preprocessed free-text comments into discrete words and phrases. N-grams are 1- and 2-word phrases found in the text. We counted the presence of each n-gram in each document and then created a matrix of binary indicators for unigrams (single words) and bigrams (2-word phrases) for analysis. Because bigrams create phrases based on adjacent words within text, each unigram (discrete word) was represented in all n-gram categories. For example, the phrase “the attending taught the team skills” would be changed to “attending taught team skills” during preprocessing and then broken into 4 unigrams (attending, taught, team, and skills) and 3 bigrams (attending taught, taught team, and team skills). Bigrams were included to provide useful local contextual features that unigrams were unable to capture. N-grams occurring fewer than 10 times in the entire corpus were excluded from analysis because the study was not sufficiently powered to detect differences for such rarely occurring terms. Words describing a specific service (such as cardiology or nephrology), location (medical intensive care unit), or procedure (bronchoscopy) were excluded from analysis, given the potential effect of gender imbalance among faculty within specific specialties.

Statistical Analysis

Data analysis was performed from June 1, 2018, through July 31, 2018. We assessed the associations of faculty gender with word or phrase use in the evaluations. The exposure variables were the frequency of words or phrases, as mined by the n-gram analysis. The dependent variable was faculty gender. Given the sparsity of the data matrix after initial n-gram text mining, variable selection was conducted using elastic net regression.29,30 Elastic net regression is a regularized regression method that eliminates insignificant covariates, preserves correlated variables, and guards against overfitting. To select an appropriate value for the tuning parameter (λ value), we used 10-fold cross-validation and selected the value that minimized the mean cross-validated error and reduced potential overfitting within the model. In addition to the aforementioned variables selected by elastic net regression, several unigrams were preselected for univariate analysis based on previous literature describing gender associations, specifically teacher, first name, and last name.27

The frequencies of selected unigram and bigram words were then fitted into a multivariable logistic regression model to assess their associations with faculty gender (using odds ratios [ORs] and 95% CIs). To assess the association of each individual n-gram with gender, each unigram and bigram was used in distinct mixed-effects logistic regression models (clustered on individual faculty members to assess the influence of instructor characteristics). To ensure anonymity of physician trainees, additional demographics (age and race/ethnicity) were not available in the data set, and unique physician trainee identifiers could not be used in the mixed-effects model. A secondary logistic regression was performed to evaluate the association of physician trainee gender with word choice.

For all analyses, the statistical significance was determined using the Benjamini-Hochberg approach for multiple comparisons, using a false discovery rate of 0.10, given the exploratory purposes of our analysis.31,32 This method controls for the false discovery rate using the analysis 2-sided P value compared with the Benjamini-Hochberg critical value ([i/m])Q, where i is the rank of the P value, m is the total number of tests, and Q is the false discovery rate.

Overall word counts by comment were tabulated for all evaluations in the cohort, and median word counts were compared across faculty gender, physician trainee gender, and physician trainee level using the Wilcoxon rank sum test (α = .05).

To evaluate quantitative differences between ratings by faculty gender, summary statistics were calculated across all submitted evaluations within the cohort. We performed a descriptive analysis of the overall teaching effectiveness score by comparing mean numeric ratings across faculty gender, physician trainee gender, and physician trainee level using the Wilcoxon rank sum test. All statistical analyses and additional text analysis were completed using Stata, version 15.1 (StataCorp).

Results
Demographic Characteristics

A total of 7326 unique evaluations were collected during the 2015 through 2016 academic year for 521 faculty members in the Department of Medicine at the Perelman School of Medicine, University of Pennsylvania, Philadelphia, including 325 men (62.4%) and 196 women (37.6%). Complete data (including the sex of the faculty member and physician trainee) were available for 6840 (93.4%) evaluations. Individual faculty members had a mean (SD) of 14.1 (13.8) evaluations (range, 1.0-76.0) included in the cohort. Most of the evaluations (5732 [78.2%]) were submitted by physicians in residency or fellowship training, and the remainder were submitted by medical students during clerkship, elective, and subinternship rotations (Table). Male physician trainees completed 3724 of 6840 faculty evaluations (50.8%) with complete data.

Faculty members (N = 521) represented all of the following divisions within the department of medicine: general medicine (n = 111; 21.3%); cardiology (n = 88; 16.9%); gastroenterology (n = 51; 9.8%); geriatrics (n = 14; 2.7%); hematology and oncology (n = 76; 14.6%); infectious diseases (n = 47; 9.0%); pulmonary, allergy, and critical care (n = 62; 11.9%); rheumatology (n = 25; 4.8%); renal (n = 36; 6.9%); and sleep medicine (n = 11; 2.1%).

N-gram Text Analysis

After removal of stop words, n-gram text analysis was used to extract text features, creating a matrix of binary indicators for 1610 unigrams and 1698 bigrams. Variable selection with elastic net regression identified a subset of 151 unigrams and 165 bigrams for further analysis. The complete list of selected variables is included in eTable 1 and eTable 2 in the Supplement.

Our findings showed that written evaluations including art (OR, 7.78; 95% CI, 1.01-59.89), trials (OR, 4.43; 95% CI, 1.34-14.69), master (OR, 4.24; 95% CI, 1.69-10.63), and humor (OR, 2.32; 95% CI, 1.44-3.73) were associated with evaluations of male faculty, whereas words including empathetic (OR, 4.34; 95% CI, 1.56-12.07), delight (OR, 4.26; 95% CI, 1.35-13.40), and warm (OR, 3.45; 95% CI, 1.83-6.49) were associated with evaluations of female faculty. Figure 1 highlights all significant single word associations with faculty gender (based on Benjamini-Hochberg threshold). The full list of unigrams that were associated with gender are listed in eTable 1 in the Supplement. Analysis of previously identified words associated with gender, including teacher, first name, and last name, did not reveal significant associations with faculty gender.

The 2-word phrases that were associated with male faculty evaluations included run rounds (OR, 7.78; 95% CI, 1.01-59.84), big picture (OR, 7.15; 95% CI, 1.68-30.42), and master clinician (OR, 4.02; 95% CI, 1.21-13.36). Female faculty evaluations tended to contain more comments like model physician (OR, 7.75; 95% CI, 1.70-35.39), just right (OR, 6.97; 95% CI, 1.51-32.30), and attention (to) detail (OR, 4.26; 95% CI, 1.36-13.40). All 2-word phrases significantly associated with gender are displayed in Figure 2. The full bigram associations are included in eTable 2 in the Supplement. A secondary analysis did not reveal significant differences in word choice based on the gender of the physician trainee after adjustment for multiple comparisons (eTable 3 in the Supplement)

Descriptive Analysis of Narrative Comments and Numeric Rating (Teaching Effectiveness Score)

Narrative evaluations had a median word count of 32 (IQR, 12-60 words). The median word count was significantly higher for evaluations of female faculty (33 words; IQR, 14-62 words) compared with evaluations of male faculty (30 words; IQR, 12-58 words) (P < .001). Neither resident gender nor training level was significantly associated with narrative evaluation word count.

There was no significant difference in the mean numeric ratings of overall teaching effectiveness between female and male faculty (mean difference, 0.02; 95% CI, –0.05 to 0.03). There was also no significant difference between mean numeric ratings assigned by female and male residents or fellows (0.02; 95% CI, –0.05 to 0.03). Compared with residents, medical students assigned significantly higher ratings to the faculty (0.22; 95% CI, 0.17-0.26; P < .001).

Discussion

Our data showed quantitative linguistic differences in free-text comments based on faculty gender that persisted after adjustment for evaluator gender and level of training. These word choice differences, while not innately negative, reflected notable contrasts in the ways male and female faculty were perceived and evaluated. These differences are particularly relevant if job qualifications are defined using stereotypically male characteristics, which may preclude women from these opportunities if evaluations use different descriptors. To our knowledge, this was the first study to focus on narrative evaluation of faculty within the medical field. Because narrative evaluations are frequently used to guide decisions about reappointment, promotion, salary increases, and consideration for leadership positions, bias in these evaluations may have far-reaching consequences for the career trajectories of medical faculty.

The word associations identified in our cohort affirmed linguistic trends by gender previously reported in nonmedical settings,11,12,16-19,33-35 specifically the use of ability terms (such as master and complexity) associated with men and emotive terms (such as empathetic, delight, and warm) associated with women. This mirrors previous qualitative work21,33-35 highlighting gender-based differences in patterns of word choice used in letters of recommendation and performance evaluations that identified differential use of grindstone traits (eg, hardworking), ability traits, and standout adjectives (eg, best). Our results were also consistent with previous findings in other educational settings where trainees (irrespective of age and generational characteristics) displayed biases in ratings of teachers.16,19,20 The linguistic differences noted in our study suggest the need for further evaluation of potential biases in narrative evaluations of faculty.

Despite the linguistic differences, we did not identify differences in numeric rating of teaching effectiveness by faculty gender. We suspect that numeric rating, given the significant positive skew and limited spread of the ratings, may be insensitive to the magnitude and/or types of differences associated with gender bias in this setting. However, this limited range of ratings may suggest greater importance to the written comments, and differential word choice may reveal subtle imbalances not appreciated in ratings data. Although the numeric ratings were equal, there were differences in the physician trainee descriptors used in evaluation of faculty, and there may be differential valuation of the mentioned characteristics by academic medical centers. Specifically, recent evidence36 highlights that the presence of humor is significantly associated with performance evaluation and assessment of leadership capability. Further study on how the different descriptors used in faculty evaluation are associated with performance appraisal and promotion in academic medical centers is warranted.

Although previous qualitative research has shown the presence of bias in narrative evaluation of physician trainees, this study offers novel insights into the potential for gender bias throughout the spectrum of narrative evaluations in medicine. In addition, the use of natural language processing techniques enabled a robust analysis of a large number of evaluations, exposing word choice differences that may not have been perceived by using qualitative techniques alone. Further study is warranted to explore the effect of gender-based linguistic differences in narrative evaluations on decisions about reappointment, promotion, and salary increases. Clarifying the ultimate effect of these differences on career trajectories of women in medicine may help identify interventions (such as implicit bias training for physician trainees and/or designing faculty evaluation forms to encourage the use of standard descriptors for both genders) to prevent or mitigate implicit bias.

Strengths and Limitations

The strengths of our study included the size and composition of our faculty cohort, representing all clinical departments within the department of medicine, as well as the inclusion of physician trainee raters spanning from clinical undergraduate medical education (medical school) to graduate medical education (residency and fellowship). In addition, the use of n-gram analysis allowed for objective quantification of word use within the sample. Although n-gram analysis may be limited for capturing important phrases or context, other natural language processing methods, such as embedding vectors or recurrent neural network models, may be more appropriate and an important area of future work. Further qualitative research may be warranted to examine the context of the word choice differences identified in this study.

A major limitation of the current study is generalizability, since our data set included faculty in the department of medicine within a single large academic institution. However, because the cohort consisted of narrative evaluations written by physician trainees from various geographic backgrounds and previous training institutions, we believe that the results are likely to be representative of academic faculty evaluations at other institutions. Future work may be warranted to determine whether similar associations exist in specialties other than internal medicine and its subspecialties.

Although we were unable to ascertain objective differences in attributes of faculty based on gender, our study highlights that there are certainly differences in which attributes are perceived by physician trainee evaluators. It is possible that existing biases about the inherent value of these characteristics may play a role in interpretation and valuation of these comments by promotion committees. Also, we were not able to assess the effect of underrepresented minority status, given the small numbers of faculty and physician trainees within our institution with this status as well as the observational nature of this study. Although our n-gram analysis identified differences in unigram and bigram descriptors applied to male vs female faculty members' performance, it did not fully capture contextual differences in comments, which would require additional qualitative study and/or application of additional computational techniques.

Conclusions

This study identified quantitative linguistic differences in physician trainee evaluations of faculty on the basis of gender. Stereotypical gender phrases were identified in physician trainee evaluations of medical faculty (such as ability and cognitive terms in male faculty evaluations and emotive terms in female faculty evaluations). There were no significant differences in numeric ratings between female and male faculty members, which may add greater importance to the language used in narrative comments. Overall, these findings suggest the presence of gender differences in narrative evaluations of faculty written by physician trainees. Further research may be warranted to explore the context of the specific words and phrases identified and to understand the downstream effect of such evaluation differences on the career trajectories of women in medicine.

Back to top
Article Information

Accepted for Publication: March 20, 2019.

Published: May 10, 2019. doi:10.1001/jamanetworkopen.2019.3520

Open Access: This is an open access article distributed under the terms of the CC-BY License. © 2019 Heath JK et al. JAMA Network Open.

Corresponding Author: Janae K. Heath, MD, MSCE, Department of Medicine, Perelman School of Medicine, Hospital of the University of Pennsylvania, 3400 Spruce St, 5045 West Gates Bldg, Philadelphia, PA 19104 (janae.heath@uphs.upenn.edu).

Author Contributions: Dr Heath had full access to all of the data in the study and takes responsibility for the integrity of the data and the accuracy of the data analysis.

Concept and design: Heath, Clancy, Farrar, Dine.

Acquisition, analysis, or interpretation of data: All authors.

Drafting of the manuscript: Heath.

Critical revision of the manuscript for important intellectual content: All authors.

Statistical analysis: Heath, Weissman, Shou, Farrar, Dine.

Administrative, technical, or material support: Clancy.

Supervision: Shou, Farrar, Dine.

Conflict of Interest Disclosures: Dr Heath reported receiving T32 grants from the National Institutes of Health (NIH) National Heart, Lung, and Blood Institute during the conduct of the study. Dr Clancy reported receiving grants from the NIH during the conduct of the study and receiving personal fees from the American College of Physicians outside the submitted work. Dr Farrar reported receiving the NIH-T32 grant during the conduct of the study and receiving personal fees from Analgesic Solutions, Aptinyx, Biogen, Campbell Alliance Group Inc, Daiichi Sankyo, DepoMed (now Assertio Therapeutics Inc), Evidera, Janssen Pharmaceuticals Inc, Eli Lilly and Co, Novartis AG, Pfizer Inc, and Cara Therapeutics outside the submitted work. No other disclosures were reported.

Funding/Support: This study was made possible by grants 4T32HL007891-19 and 5T32HL007891-20 from a postdoctoral fellowship funded by the National Heart, Lung, and Blood Institute and the NIH (Dr Heath).

Role of the Funder/Sponsor: The funding organization had no role in the design and conduct of the study; collection, management, analysis, and interpretation of the data; preparation, review, or approval of the manuscript; and decision to submit the manuscript for publication.

Disclaimer: The design and conduct of the study; collection, management, analysis, and interpretation of the data; preparation, review, and approval of the manuscript; and decision to submit the manuscript for publication are solely the responsibility of the authors. The article’s contents do not necessarily represent the official views of the National Heart, Lung, and Blood Institute or the NIH.

Additional Contributions: Rachel Kohn, MD, MSCE, Cameron Baston, MD, MSCE, and Jessica Lee, MD (Perelman School of Medicine, University of Pennsylvania), helped in the writing and editing of the final manuscript. No compensation was provided for their support.

References
1.
Lautenberger  DM, Dandar  VM, Raezer  CL, Sloane  RA.  The State of Women in Academic Medicine: The Pipeline and Pathways to Leadership, 2013-2014. Washington, DC: Association of American Medical Colleges; 2014.
2.
Jena  AB, Khullar  D, Ho  O, Olenski  AR, Blumenthal  DM.  Sex differences in academic rank in US medical schools in 2014.  JAMA. 2015;314(11):1149-1158. doi:10.1001/jama.2015.10680PubMedGoogle ScholarCrossref
3.
Jagsi  R, Griffith  KA, Stewart  A, Sambuco  D, DeCastro  R, Ubel  PA.  Gender differences in salary in a recent cohort of early-career physician-researchers.  Acad Med. 2013;88(11):1689-1699. doi:10.1097/ACM.0b013e3182a71519PubMedGoogle ScholarCrossref
4.
Jagsi  R, Guancial  EA, Worobey  CC,  et al.  The “gender gap” in authorship of academic medical literature—a 35-year perspective.  N Engl J Med. 2006;355(3):281-287. doi:10.1056/NEJMsa053910PubMedGoogle ScholarCrossref
5.
Jagsi  R, Griffith  KA, Stewart  A, Sambuco  D, DeCastro  R, Ubel  PA.  Gender differences in the salaries of physician researchers.  JAMA. 2012;307(22):2410-2417. doi:10.1001/jama.2012.6183PubMedGoogle ScholarCrossref
6.
Reed  DA, Enders  F, Lindor  R, McClees  M, Lindor  KD.  Gender differences in academic productivity and leadership appointments of physicians throughout academic careers.  Acad Med. 2011;86(1):43-47. doi:10.1097/ACM.0b013e3181ff9ff2PubMedGoogle ScholarCrossref
7.
Kaplan  SH, Sullivan  LM, Dukes  KA, Phillips  CF, Kelch  RP, Schaller  JG.  Sex differences in academic advancement: results of a national study of pediatricians.  N Engl J Med. 1996;335(17):1282-1289. doi:10.1056/NEJM199610243351706PubMedGoogle ScholarCrossref
8.
Jena  AB, Olenski  AR, Blumenthal  DM.  Sex differences in physician salary in US public medical schools.  JAMA Intern Med. 2016;176(9):1294-1304. doi:10.1001/jamainternmed.2016.3284PubMedGoogle ScholarCrossref
9.
Association of American Medical Colleges. US medical school applications and matriculants by school, state of legal residence, and sex, 2018-2019. https://www.aamc.org/download/321442/data/factstablea1.pdf. Published November 9, 2018. Accessed September 25, 2018.
10.
Staats  C. State of the science: implicit bias review. http://kirwaninstitute.osu.edu/wp-content/uploads/2014/03/2014-implicit-bias.pdf. Published March 2, 2014. Accessed December 15, 2018.
11.
Gorman  EH.  Gender stereotypes, same-gender preferences, and organizational variation in the hiring of women: evidence from law firms.  Am Sociol Rev. 2005;70(4):702-728. doi:10.1177/000312240507000408Google ScholarCrossref
12.
Moss-Racusin  CA, Dovidio  JF, Brescoll  VL, Graham  MJ, Handelsman  J.  Science faculty’s subtle gender biases favor male students.  Proc Natl Acad Sci USA. 2012;109(41):16474-16479. doi:10.1073/pnas.1211286109PubMedGoogle ScholarCrossref
13.
Pohlhaus  JR, Jiang  H, Wagner  RM, Schaffer  WT, Pinn  VW.  Sex differences in application, success, and funding rates for NIH extramural programs.  Acad Med. 2011;86(6):759-767. doi:10.1097/ACM.0b013e31821836ffPubMedGoogle ScholarCrossref
14.
Hechtman  LA, Moore  NP, Schulkey  CE,  et al.  NIH funding longevity by gender.  Proc Natl Acad Sci USA. 2018;115(31):7943-7948. doi:10.1073/pnas.1800615115PubMedGoogle ScholarCrossref
15.
Zhou  CD, Head  MG, Marshall  DC,  et al.  A systematic analysis of UK cancer research funding by gender of primary investigator.  BMJ Open. 2018;8(4):e018625. doi:10.1136/bmjopen-2017-018625PubMedGoogle ScholarCrossref
16.
Young  S, Rush  L, Shaw  D.  Evaluating gender bias in ratings of university instructors’ teaching effectiveness.  Int J Scholarsh Teach Learn. 2009;3(2):1-14. doi:10.20429/ijsotl.2009.030219Google Scholar
17.
Hirshfield  LE.  ‘She’s not good with crying’: the effect of gender expectations on graduate students’ assessments of their principal investigators.  Gend Educ. 2014;26(6):601-617. doi:10.1080/09540253.2014.940036Google ScholarCrossref
18.
Laube  H, Massoni  K, Sprague  J, Ferber  AL.  The impact of gender on the evaluation of teaching: what we know and what we can do.  NWSA J. 2007;19(3):87-104.Google Scholar
19.
Mengel  F, Sauermann  J, Zölitz  U.  Gender bias in teaching evaluations.  J Eur Econ Assoc. 2018;17(2):535-566. doi:10.1093/jeea/jvx057Google ScholarCrossref
20.
Dayal  A, O’Connor  DM, Qadri  U, Arora  VM.  Comparison of male vs female resident milestone evaluations by faculty during emergency medicine residency training.  JAMA Intern Med. 2017;177(5):651-657. doi:10.1001/jamainternmed.2016.9616PubMedGoogle ScholarCrossref
21.
Mueller  AS, Jenkins  TM, Osborne  M, Dayal  A, O’Connor  DM, Arora  VM.  Gender differences in attending physicians’ feedback to residents: a qualitative analysis.  J Grad Med Educ. 2017;9(5):577-585. doi:10.4300/JGME-D-17-00126.1PubMedGoogle ScholarCrossref
22.
McOwen  KS, Bellini  LM, Guerra  CE, Shea  JA.  Evaluation of clinical faculty: gender and minority implications.  Acad Med. 2007;82(10)(suppl):S94-S96. doi:10.1097/ACM.0b013e3181405a10PubMedGoogle ScholarCrossref
23.
Morgan  HK, Purkiss  JA, Porter  AC,  et al.  Student evaluation of faculty physicians: gender differences in teaching evaluations.  J Womens Health (Larchmt). 2016;25(5):453-456. doi:10.1089/jwh.2015.5475PubMedGoogle ScholarCrossref
24.
Brown  PF, deSouza  PV, Mercer  RL, Della Pietra  VJ, Lai  JC.  Class-based n-gram models of natural language.  Comput Linguist. 1992;18(4):468-479.Google Scholar
25.
Schmader  T, Whitehead  J, Wysocki  VH.  A linguistic comparison of letters of recommendation for male and female chemistry and biochemistry job applicants.  Sex Roles. 2007;57(7-8):509-514. doi:10.1007/s11199-007-9291-4PubMedGoogle ScholarCrossref
26.
Recasens  M, Danescu-Niculescu-Mizil  C, Jurafsky  D. Linguistic models for analyzing and detecting biased language. In: Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics. Sofia, Bulgaria: Association for Computational Linguistics; 2013:1650-1659.
27.
Terkik  A, Prud’hommeaux  E, Alm  CO, Homan  C, Franklin  S. Analyzing gender bias in student evaluations. In: Proceedings of COLING 2016, 26th International Conference of Computational Linguistics. Osaka, Japan: The COLING 2016 Organizing Committee; 2016:868-876.
28.
Aull  L.  Gender in the American anthology apparatus: a linguistic analysis.  Adv in Lit Stud. 2014;2(1):38-45. doi:10.4236/als.2014.21008Google ScholarCrossref
29.
Marafino  BJ, Boscardin  WJ, Dudley  RA.  Efficient and sparse feature selection for biomedical text classification via the elastic net: application to ICU risk stratification from nursing notes.  J Biomed Inform. 2015;54:114-120. doi:10.1016/j.jbi.2015.02.003PubMedGoogle ScholarCrossref
30.
Zou  H, Hastie  T.  Regularization and variable selection via the elastic net.  J R Stat Soc B. 2005;67(2):301-320. doi:10.1111/j.1467-9868.2005.00503.xGoogle ScholarCrossref
31.
Benjamini  Y, Hochberg  Y.  Controlling the false discovery rate: a practical and powerful approach to multiple testing.  J R Stat Soc, Ser B. 1995;57(1):289-300.Google Scholar
32.
Pike  N.  Using false discovery rates for multiple comparisons in ecology and evolution.  Methods Ecol Evol. 2011;2(3):278-282. doi:10.1111/j.2041-210X.2010.00061.xGoogle ScholarCrossref
33.
Trix  F, Psenka  C.  Exploring the color of glass: letters of recommendation for female and male medical faculty.  Discourse Soc. 2003;14(2):191-220. doi:10.1177/0957926503014002277Google ScholarCrossref
34.
Fischbach  A, Lichtenthaler  PW, Horstmann  N.  Leadership and gender stereotyping of emotions.  J Pers Psychol. 2015;14(3):153-162. doi:10.1027/1866-5888/a000136Google ScholarCrossref
35.
Eagty  AH, Makhijani  MG, Klonsky  BG.  Gender and the evaluation of leaders: a meta-analysis.  Psychol Bull. 1992;111(1):3-22. doi:10.1037/0033-2909.111.1.3PubMedGoogle ScholarCrossref
36.
Evans  JB, Slaughter  JE, Ellis  APJ, Rivin  JM.  Gender and the evaluation of humor at work [published online February 7].  J Appl Psychol. 2019. doi:10.1037/apl0000395Google Scholar
×