Quantifying Sex Bias in Clinical Studies at Scale With Automated Data Extraction

Key Points Question What is the magnitude of female underrepresentation in clinical studies? Findings In this cross-sectional study, machine reading to extract sex data from 43 135 published articles and 13 165 clinical trial records showed substantial underrepresentation of female participants, with studies as measurement unit, in 7 of 11 disease categories, especially HIV/AIDS, chronic kidney diseases, and cardiovascular diseases. Sex bias in articles for all categories combined was unchanged over time with studies as the measurement unit but improved with participants as measurement unit. Meaning This study suggests that sex bias against female participants in clinical studies persists, but results differ when studies vs participants are the measurement units.

The requirements of the PubMed-Extract procedure were illustrated with an example table to show key regularities observed that made the procedure tractable (eTable 4). It was necessary to have table information organized in rows, with some participant row headers defined by terms indicating sex such as male, female, men, or women. PubMed-Extract required arithmetic consistency within tables; in the example, the sum of male and female participants had to equal the total number of participants. In some instances, tables were excluded because of arithmetic inconsistency caused by numeric rounding; e.g., 3069 participants × 44% = 1350 participants, not 1355 participants. PubMed-Extract allowed participant numbers to be spread across multiple columns such as columns representing different treatments (e.g., Treatment A, Treatment B, Placebo, Total; no. of participants for Treatment A + Treatment B + Placebo = Total no. of participants). Furthermore, PubMed-Extract enabled various formats for table cells that contained information about numbers of participants, e.g., 20, 20%, 20.2, 20.2%, 20/30, 20 (50.2%), 20 (50), 20/30, or (66%).

PubMed-Extract Functions: Extraction of the Number of Male and Female Participants from Tables
PubMed-Extract extracted female and male counts for each table that had been parsed within an article using 3 functions.
(1) subdivide: To extract participant sex counts, we needed to know which rows and columns contained or did not contain sex counts. We specified which rows and columns had headers that contained text information that indicated the type of row or column. This function subdivided each table into 3 sections: a row header section involving 0 or more of the first columns, a column header section involving 0 or more of the first rows, and the numerical portion of the table below and to the right of the row and column headers (eTable 4). There typically was an additional section at the top left corner of the table, at the intersection of the row and column headers, that we discarded because it was uninformative (eTable 4).
(2) parse_sex_rows: This function evaluated each cell in the table, performed a series of regular expression transformations (known as regex transformations), and extracted number and percentage of participants. Rows that had any successfully parsed values were retained for further analysis. In the example table, the 2 rows that had headings for P values would be deleted (eTable 4).
(3) extract_male_female_counts_from_table: This function combined information from individual cells to create a single set of counts per table, by finding which rows contained female and male counts, extracting male and female counts for each column within those rows, and combining these counts across columns.
Subsequently, PubMed-Extract selected the number of participants from tables that had extracted numbers for male and female participants and the most convincing regularity. The function extract_male_female_counts_from_table provided a notification that summarized the column and row regularity condition, such as the following outputs: (1) all columns added up to another column, and there was a single column named total column (e.g., there were columns such as Treatment A, Placebo, and Total, and the counts for Treatment A and Placebo added up to the Total); (2) all columns added up to another column, and there was not a single column named total column (e.g., there were columns such as Treatment A, Placebo, and Otherwise Named Column, and the counts for Treatment A and Placebo added up to the Otherwise Named Column that was not obviously a Total column by name, but it happened to be the exact or near sum of the other 2 columns); (3) all columns added up to a number of participants smaller than the biggest total column; or (4) there were multiple named total columns and all but 1 of these columns added up to a column of grand totals. The table that had the most highly ranked diagnostic message was selected as the table from which the final participant numbers were used. Ties in ranking of tables were resolved by choosing the table that appeared earliest in the article, because Table 1 often contained the number of participants.

Distant Supervision with Aggregate Analysis of ClinicalTrials.gov
The table parsing core of the PubMed-Extract algorithm was not our first choice for table parsing. A limitation of PubMed-Extract is that we did not have a source of ground truth. The original idea was to connect published articles with their AACT records, and use these AACT records as ground truth. This would have enabled us to (1) train a machine learning algorithm (that needs ground truth) and (2) evaluate the accuracy of PubMed-Extract automatically instead of relying on manual, time-consuming annotations. Therefore, we initially attempted to extract participant numbers from text instead of tables of articles using distant supervision from the AACT records, to enable the use of information from the AACT records as a guide for extracting counts from published articles -a single method of data extraction from published articles and AACT records -but this was unsuccessful.
As an artificial example of this method, the AACT database might have reported that there were 41 women and 52 men participants in a hypothetical clinical trial. Although we considered the method of searching for all articles in PubMed that linked to this AACT trial identification number and had full article text from Semantic Scholar, and searching for the numbers 41 and 52 in the article text, this method was unsuccessful because (1) numbers such as 41 and 52 appeared in more places than expected, (2) the numbers of participants frequently were in tables and not text, and (3) the parse of the published article file was noisy, precluding clean extraction. The numbers of participants in the AACT records did not match those reported in the published article for 48% of a subset of 1400 studies (Results).

Use of Medical Subject Heading Terms to Map Articles to Disease Categories
The selection of the quantity of 250 Medical Subject Heading (MeSH) terms was arbitrary and based on the perceived tradeoff between fewer numbers of MeSH terms that might not be sufficient for mapping vs too many MeSH terms that might require too much processing time. For the 250 MeSH terms, 167 MeSH terms (67%) mapped to the 11 disease categories, and 83 MeSH terms (33%) did not map to any disease category (eTable 5). With 250 MeSH terms, 147 807 articles were mapped to ≥ 1 disease category, resulting in 43 135 articles that enabled extraction of the numbers of male and female participants, judged to be sufficient for this study. The use of more MeSH terms would have yielded more articles, at the expense of greater processing time.