Gender Disparities in Invited Commentary Authorship in 2459 Medical Journals

This case-control study examines gender differences in authorship of invited commentaries published in medical journals from 2013 to 2017, controlling for field of expertise, seniority, and publication metrics.


Gender inference details
In our study, we used a four-step process to infer author gender. In Step 1, we inputted author first name and country of origin to the software service genderize.io. 1 If Step 1 failed, we moved to Step 2: using genderize.io inputting first name only. In Steps 1 and 2, we required the following criteria in order to assign a gender: (a) the name must appear in the genderize.io dictionary at least five times; (b) the probability of that name being either male or female must be 85% or more according to the statistics provided by genderize.io.
If Steps 1 and 2 both failed to return a gender, we moved to Step 3, which matched author first and last name to a dictionary of names and countries from the journal Nature. Finally, if Steps 1, 2, and 3 all failed, in Step 4 we matched first names to a dictionary of Japanese first names and their genders. We note that in tests using a set of names with known genders, Steps 3 and 4 did not significantly improve accuracy of gender inference.

Technical details
We identified potential controls based on published article abstracts available in Scopus, the largest abstract and citation database of peer-reviewed literature. The matching algorithm accepts as input the set of all available Scopus abstracts (no restrictions on journal) and produces as output a measure of the similarity of the abstracts of all authors in Scopus to the abstracts of each case author. The degree of similarity determined by the algorithm depends both on the semantic concepts identified in an author's abstracts, as well as the frequency with which the author produces abstracts that refer to that concept. Typical controls thus either publish prolifically on at least one topic that the case author works on (a strong match on some concepts) or publish to some extent on all concepts that the case author works on (a moderate match on all concepts), or a balance of the two.
Each Scopus publication was characterized by a set of noun phrases extracted from title and abstract, where the term "noun phrase" refers to a group of words that behaves like a noun and often has a noun as nucleus (see eFigure 1(a) and (b) for examples). The term frequency (TF) of a noun phrase was multiplied by its inverse document frequency (IDF) to obtain the TF-IDF, a well-validated measure of how important the phrase is to the publication in the collection of all publications in the time range of interest. Within each document, noun-phrases were then ranked according to TF-IDF value (see eFigure 1(c) and eFigure 2(a) for examples). Note that the noun-phrases were identified from all abstracts available in Scopus, not just abstracts from the journal where the case ICC was published.
Next, author expertise profiles were generated by computing the rank of each phrase, averaged across all of the author's publications, as a measure of the importance of that phrase for the author (see eFigure 2(b) for an example). Only authors that had at least five publications in the time range of interest (2013 through 2017) were included to ensure a sufficiently rich semantic representation. Rare noun phrases that occurred in only a few documents across the whole database were not included in the profiles. We then compared the profile of each case author to the profiles of all other authors in the database using the BM25 ranking function, 2,3 which has shown strong performance in information retrieval ranking tasks. The top 50 most similar authors for each case were taken as potential matched controls.

Validity of the matching algorithm
We used TF-IDF and BM25 to rank potential controls for each case author based on published abstracts. This system has previously been used in expert search tasks similar to our application, 4 has been well-tested experimentally in a wide range of other document information retrieval tasks, 5,6 and typically rivals or outperforms competing algorithms. 7 A previous study using TF-IDF/BM25 and the same natural language processing methods used here demonstrated good performance in matching unsubmitted manuscripts to potential journals. 8

Comparison with other text matching methods
From other tools that implement text similarity, the approach adopted in this study is most similar to the system Jane, which uses text similarity to suggest journals and experts. 9 The most important algorithmic differences concern the ranking function, the modelling of the text, and the matching at researcher level. The Jane system uses plain TF-IDF values whereas the current approach uses a more powerful ranking function (BM25). Importantly, Jane does only model isolated words (e.g. radon, exposure) whereas the current approach captures meaningful compound words and phrases (e.g. radon_exposure) that better describe the content. Finally, the current approach aggregates to researcher profiles first, carefully balancing the contribution of individual articles. An in-depth comparison is beyond the scope of this study, but each of the differences with respect to Jane is known to contribute towards higher accuracy in semantic similarity matching. 2,7 Estimation of journal-specific odds ratios Estimation of journal-specific odds ratios involved fitting conditional logistic regression models separately to data for each journal. Three models were fit for each journal: Model 1: effect of gender adjusted only for field of expertise through matching; Model 2: further adjusted for percentiles of years active, h-index, and number of publications as covariates in the regression model, and; Model 3: including an interaction term between years active percentile and gender.
For Models 2 and 3, most journals did not have sufficient data to permit use of spline terms, as for our pooled model using data from all journals (see Statistical Analyses section of main manuscript). Based on the functional forms for the effects of years active, h-index and number of publications estimated in our pooled model (see eFigures 3 and 4), we included the following terms in Model 2: a linear term for years active percentile, and linear and quadratic terms for each of h-index percentile and number of publications percentile. In Model 3, an interaction term was included between gender and the linear effect of years active percentile.
Due to small sample sizes, we were able to fit Model 1 for 1,410 journals, Model 2 for 1,196 journals, and Model 3 for 1,087 journals. Journal-specific results for Models 1 and 2 can be found in the eAppendix. Results for Model 3 were used for random effects meta-analysis, discussed in the next section.

Sensitivity analysis 1: two-stage random effects meta-analysis
In our main analysis, we concatenated the datasets for all journals and estimated the overall OR using conditional logistic regression. This approach to combining data from multiple sources (journals, in our study) is known as onestage meta-analysis in the context of meta-analyses where individual-level data are available. 10 In a sensitivity analysis, we compared results from the one-stage meta-analysis to a two-stage meta-analysis. The two-stage approach involved combining estimates of log , the journal-specific log odds ratios, and their estimated variances using random effects meta-analysis. Two-stage meta-analysis accounts for between-journal heterogeneity in exposure and covariate effect sizes, while one-stage meta-analysis is biased in the presence of such heterogeneity. 11 However, the two-stage approach necessarily excluded data from journals with sample sizes too small to permit estimation of log . Estimates of log were obtained as described in the previous section.
We also repeated our secondary analyses using two-stage meta-analysis. To investigate the effect of journal topic on the odds ratio for gender, we pooled journal-specific estimates using random effects meta-analysis for all journals having particular All Science Journal Classification (ASJC) codes. To investigate the effect of journal citation impact on the odds ratio for gender, we conducted a meta-regression of journal-specific log odds ratios on journal Cite Score.

Sensitivity analysis 2: multiple imputation for missing gender data
Overall, 21.0% (35,230 of 167,705) of unique authors in our dataset could not be assigned a gender. This missingness was related to case status, Asian country of origin, years active, number of publications and h-index (eTable 1).
Having unknown gender also may be related to the true unknown gender. Genderize.io is known to return "unknown gender" more often for Asian names, 12 and researchers with Asian names may have a different gender ratio than other researchers. 13,14 If having an Asian name is also related to the chance of authoring an invited commentary, this missingness could bias our results. We defined Asian country of origin as having at least one publication in the author's first year of data in Scopus where the affiliation address could be determined and was in an Asian country. In our dataset, 30,823 (18.4%) unique authors were determined to have Asian country of origin (eTable 1). Gender could not be inferred for 15,743 (51.1%) Asian researchers, compared to 19,054 (14.0%) non-Asian researchers.
We hypothesized that the gender data are approximately missing at random (MAR). Specifically, we claim that missingness in the gender variable is likely to be independent of gender after accounting for author-level characteristics including having an Asian name, case status, years active, number of publications, and h-index. Multiple imputation is therefore an appropriate method to account for missing gender in our data.
We built a mixed effects logistic regression imputation model for gender that included the following variables. Asian country of origin was included as a binary variable. Non-linear effects for percentiles of years active, number of publications and h-index were included using natural cubic splines with internal knots at 0.25, 0.5, and 0.75. For consistency with our outcome model, we also included interactions between case status and the linear terms for each of years active, number of publications, and h-index. Finally, we included a random effect for matched set to account for the association between field of scientific expertise (the matching variable) and gender.
After running the above model, we generated predicted probabilities of being female for all authors with missing gender information. We then generated ten datasets with missing gender imputed randomly based on these probabilities. We ran our conditional logistic regression outcome models using each of these datasets and pooled the regression coefficients using Rubin's rules for multiple imputation.

Sensitivity analysis 3: de-duplication for case authors present in multiple journals, excluding reply articles, and increased stringency of matching criteria
In a third sensitivity analysis, we repeated our main analyses using a dataset based on more conservative assumptions. This allowed us to examine the potential impact of three issues: (1) correlation due to multiple cases representing the same author; (2) the presence of articles that may not have been invited, and; (3) variation in the match quality of controls.
First, in our dataset, one author acts as a case times if they authored an invited commentary in distinct journals over the study period. This leads to "duplicate" records for those authors with > 1. These authors may have an inflated impact on our results, especially when is large. 25.1% of male cases authored ICCs in multiple journals (range of number journals per author: 1 to 22), compared to 16.0% of female cases (range: 1 to 10). To account for this, we removed duplicates so that each author appeared at most once as a case.
Second, our outcome definition, intra-citing commentaries (ICCs), may include some article types that are arguably not invited. Some of these article types could not be identified with the available data (see Limitations section of the manuscript). However, one such article typereplies to other articlesis typically indicated by its title. In many medical journals, authors are given the opportunity to respond to commentaries or letters concerning an article they have authored, and this response may be published alongside the commentary/letter. These replies typically include phrases like "response to", "a reply", or "authors' response", etcetera, in their titles, and cite the article they are responding to, such that they meet our definition of an ICC. We searched for articles in our dataset with titles containing at least one of the words "reply", "replies", "response", "responses", "respond", or "responds". This definition was chosen to be inclusive in order to capture the maximum number of reply articles; however, articles with these words in their titles are not necessarily replies. There were 9,354 such articles in our dataset (9.2% of eligible articles). These articles were excluded.
Third, in our main analysis, we included up to ten controls per case. The quality of the match between each control and the corresponding case depends on the similarity index generated when comparing Scopus abstracts. This similarity index does not have interpretable units; hence, the choice of cut-off for this index is somewhat arbitrary.

Sensitivity analysis 1: two-stage random effects meta-analysis
After excluding 1,139 journals that had insufficient data to obtain a journal-specific estimate, the random effects meta-analysis included data from 1,410 journals with a total of 43,572 matched sets. Adjusted results, shown in eTable 6, were very similar to those of our one-stage meta-analysis, shown in eTables 2 and 3. eFigure 8 shows that topic-specific ORs were broadly similar to those from our sub-group analysis using one-stage meta-analysis, shown in Figure 3 of the main text. eFigure 9 shows that the association between journal-specific ORs and journal Cite Score estimated using meta-regression was very similar to the analogous one-stage result shown in Figure 4 of the main text.

Sensitivity analysis 2: multiple imputation for missing gender data
Results from multiple imputation analyses are shown in eTable 7. Accounting for missing gender data using multiple imputation slightly increased the magnitude of our point estimates. The odds ratio adjusted for field of expertise, years active, h-index, and number of publications was 0.76 (95%CI: 0.74 to 0.78) after accounting for missing data, compared to 0.78 (95%CI: 0.76 to 0.80) in a complete case analysis.

Sensitivity analysis 3: de-duplication for case authors present in multiple journals, excluding reply articles, and increased stringency of matching criteria
After excluding duplicate records for the same author, excluding possible reply articles, and keeping only the top two most closely matched controls per case, 31,821 matched sets were included in this sensitivity analysis. eTable 8 shows that results were very similar to our original analysis.  Caption: Gender was unknown if we could not infer it from author first name and country of origin as described in the eMethods (gender inference details). Cases are authors who published at least one intraciting commentary (ICC) article in an eligible journal during the study period (2013 through 2017). Controls were matched to cases based on field of expertise as determined using natural language processing of abstracts. Case authors also can act as controls for other case authors published in a different journal, but they cannot act as controls for case authors published in the same journal. Asian country of origin was defined as having at least one publication in the author's first year of data in Scopus where the affiliation address could be determined and was in an Asian country. Years active was defined as years since first publication in Scopus.

Number of publications percentile
See eFigure 3 a a Years since first publication, h-index, and number of publications were included in models as percentiles and were adjusted for using natural cubic splines to allow for non-linear effects. The odds ratio as a function of these variables is displayed in eFigure 3, since coefficients for spline terms are not interpretable. Numeric results are available from the authors upon request, or by accessing full results at github.com/emgthomas/gender_and_invited_commentaries.  See eFigure 4 a The interaction between gender and years active is visualized in eFigures 4(a) and 7(a). b Years since first publication, h-index, and number of publications were included in models as percentiles and were adjusted for using natural cubic splines to allow for non-linear effects. The odds ratio as a function of these variables is displayed in eFigure 4, since coefficients for spline terms are not interpretable. Numeric results are available from the authors upon request, or by accessing full results at github.com/emgthomas/gender_and_invited_commentaries.   See eFigure 6 a The interaction between gender and number of publications is visualized in eFigures 6(a) and 7(c). b Years since first publication, h-index, and number of publications were included in models as percentiles and were adjusted for using natural cubic splines to allow for non-linear effects. The odds ratio as a function of these variables is displayed in eFigure 6, since coefficients for spline terms are not interpretable. Numeric results are available from the authors upon request, or by accessing full results at github.com/emgthomas/gender_and_invited_commentaries.  Caption: The table shows results obtained using random effects meta-analysis to pool journal-specific effect estimates. This analysis included 1,410 journals with sufficient data to obtain journal-specific effects. 2 is an estimate of the percent of total between-journal variation in effect size that can be attributed to true between journal heterogeneity, rather than sampling variability. ℎ is the p-value from a test of the null hypothesis of no between-journal heterogeneity in true effect sizes.

eFigure 8. 2-Stage Meta-Analysis Results by Journal Topic
Caption: The figure shows results sub-group analyses for journals by topic, as denoted by All Science Journal Classification (ASJC) codes. Journals may have multiple ASJC codes; thus, the topics are overlapping with respect to journals included. Unadjusted models control for authors' fields of expertise through matching. Adjusted models further control for years active percentile, h-index percentile, and total number of publications percentile. Estimates that could not be obtained due to small sample sizes are shown as N/A. eFigure 9. Meta-Regression of Journal-Specific Effect Sizes Against CiteScore (a) Odds ratio adjusted for field of expertise

(b) Odds ratio adjusted for field of expertise, years active, h-index and number of publications
Caption: The line, confidence interval and prediction interval represent the predicted odds ratio for women vs. men of authoring invited commentaries as a function of journal Cite Score, as estimated via metaregression. Each circle represents the odds ratio estimated for a single journal. Circle diameter is inversely proportional to the standard error of the log odds ratio estimate. For clarity, only journals with more than 50 matched sets are shown. Unadjusted models control for authors' fields of expertise through matching. Adjusted models further control for years active percentile, h-index percentile, and total number of publications percentile.