Covariates above the dashed line are statistically significant. Covariates with the most statistically significant associations per each trial characteristic are as follows. A, Participant age with trial phase, number of treatment arms, and industry sponsorship; malignant tumor of urinary bladder with multisite trials and overall enrollment; malignant tumor of lung with use of a data monitoring committee (DMC); and use of opioids with randomization. B, Malignant neoplastic disease with trial phase and overall enrollment; antithrombotic agents with industry sponsorship; primary malignant neoplasm of prostate with randomization; immunosuppressant medications with use of a DMC; and heart disease with multisite trial.
Covariates above the dashed line are statistically significant. Covariates with the most statistically significant associations per each trial characteristic are as follows. A, Viral hepatitis C infection with trial phase; heart disease with blinding; renal impairment with multisite trials and overall enrollment; antithrombotic agents with number of treatment arms and industry sponsorship; and immunosuppressant medications with use of a data monitoring committee (DMC). B, Age with randomization; peripheral vascular disease with trial phase and overall enrollment; atrial fibrillation with number of treatment arms; hyperlipidemia with blinding; heart disease with industry sponsorship; and heart failure with use of a DMC.
eFigure. Overview of primary methodology
eTable 1. Attrition of trials and participants
eTable 2. Covariate comparisons between trial participants and nonparticipants (all trials)
eTable 3. Associations between trial participants’ covariates and trial characteristics, neoplastic disease trials
eTable 4. Associations between trial participants’ covariates and trial characteristics, disorder of digestive system trials
eTable 5. Associations between trial participants’ covariates and trial characteristics, inflammatory disorder trials
eTable 6. Associations between trial participants’ covariates and trial characteristics, disorder of cardiovascular system trials
Customize your JAMA Network experience by selecting one or more topics from the list below.
Identify all potential conflicts of interest that might be relevant to your comment.
Conflicts of interest comprise financial interests, activities, and relationships within the past 3 years including but not limited to employment, affiliation, grants or funding, consultancies, honoraria or payment, speaker's bureaus, stock ownership or options, expert testimony, royalties, donation of medical equipment, or patents planned, pending, or issued.
Err on the side of full disclosure.
If you have no conflicts of interest, check "No potential conflicts of interest" in the box below. The information will be posted with your response.
Not all submitted comments are published. Please see our commenting policy for details.
Rogers JR, Liu C, Hripcsak G, Cheung YK, Weng C. Comparison of Clinical Characteristics Between Clinical Trial Participants and Nonparticipants Using Electronic Health Record Data. JAMA Netw Open. 2021;4(4):e214732. doi:10.1001/jamanetworkopen.2021.4732
Are there differences in clinical characteristics between clinical trial participants and nonparticipants as captured by electronic health record data?
In this cross-sectional study of 1645 clinical trial participants and an aggregated set of 1645 matched nonparticipants, most of the trial participants had fewer underlying conditions and less medication use than nonparticipants.
These findings suggest that a more comprehensive approach to evaluating trials may be beneficial for addressing concerns about the generalizability of clinical trial results.
Assessing generalizability of clinical trials is important to ensure appropriate application of interventions, but most assessments provide minimal granularity on comparisons of clinical characteristics.
To assess the extent of underlying clinical differences between clinical trial participants and nonparticipants by using a combination of electronic health record and trial enrollment data.
Design, Setting, and Participants
This cross-sectional study used data obtained from a single academic medical center between September 1996 and January 2019 to identify 1645 clinical trial participants from a diverse set of 202 available trials conducted at the center. Using an aggregated resampling procedure, nonparticipants were matched to participants 1:1 based on trial conditions, number of recent visits to a health care professional, and calendar time.
Clinical trial enrollment vs no enrollment.
Main Outcomes and Measures
The primary outcome was standardized differences in clinical characteristics between participants and nonparticipants in clinical trials stratified into the 4 most common disease domains.
This cross-sectional study included 1645 participants from 202 trials (929 [56.5%] male; mean [SD] age, 54.65 [21.38] years) and an aggregated set of 1645 nonparticipants (855 [52.0%] male; mean [SD] age, 57.24 [21.91] years). The most common disease domains for the selected trials were neoplastic disease (86 trials; 737 participants), disorders of the digestive system (31 trials; 321 participants), inflammatory disorders (28 trials; 276 participants), and disorders of the cardiovascular system (27 trials; 319 participants); trials could qualify for multiple disease domains. Among 31 conditions, the percentage of conditions for which the prevalence was lower among participants than among nonparticipants per standardized differences was 64.5% (20 conditions) for neoplastic disease trials, 61.3% (19) for digestive system trials, 58.1% (18) for inflammatory disorder trials, and 38.7% (12) for cardiovascular system trials. Among 17 medications, the percentage of medications for which use was less among participants than among nonparticipants per standardized differences was 64.7% (11) for neoplastic disease trials, 58.8% (10) for digestive system trials, 88.2% (15) for inflammatory disorder trials, and 52.9% (9) for cardiovascular system trials.
Conclusions and Relevance
Using a combination of electronic health record and trial enrollment data, this study found that clinical trial participants had fewer comorbidities and less use of medication than nonparticipants across a variety of disease domains. Combining trial enrollment data with electronic health record data may be useful for better understanding of the generalizability of trial results.
Clinical trials are considered one of the best study designs for generating medical evidence. A common challenge for individuals interpreting clinical trials is assessing generalizability, which is the practice of determining how reasonably relevant the results are to a particular group of individuals who were not part of the trial.1 A variety of these comparisons have shown disparities across many disease domains. In cancer trials, many studies2-5 have noted that trial participants tend to be younger, with more promising prognoses and fewer comorbidities. In cardiovascular (CV) disease trials, participants are more likely to be male, with less risk for developing CV outcomes.3,6 In a meta-analysis of end-stage kidney disease trials, participants were found to be younger and to have different comorbidity profiles.7 Other disease domains in which similar concerns are noted include mental health, psoriasis, and type 2 diabetes.3,8-10 Understanding differences between trial participants and nonparticipants is important because underlying characteristics can influence the estimated effect of an intervention, ultimately impacting its clinical meaningfulness.1 However, many of these prior assessments examined only aggregated estimates between trial participants and nonparticipants with comparisons based on tabular data reported by the selected trials, providing limited insight when comparing the 2 groups. A novel combination of prior trial enrollment data and electronic health records (EHRs) may provide more granular and detailed assessments by leveraging individual participants’ medical history. To our knowledge, prior literature reviews on use of EHR data and other similar data sources for generalizability purposes have found no study exploring this particular linkage.3,11,12
The primary aim of this study was to compare clinical profiles of trial participants with those of nonparticipants as collected from their EHR profiles across different disease domains. We hypothesized that clinical differences would exist regardless of the disease domain. As a secondary aim, we examined associations between participant covariates and trial parameters (eg, randomization use and number of treatment arms). We hypothesized that some covariates would be associated with certain trial parameters, suggesting that certain covariates are evaluated only in certain types of trials.
This cross-sectional study used data obtained from a single academic medical center between September 1996 and January 2019 to identify 1645 clinical trial participants from a diverse set of 202 available trials conducted at the center. The eFigure in the Supplement provides an overview of the methods. This study was approved by the Columbia University institutional review board and qualified for a waiver of informed consent per the Code of Federal Regulations (45 CFR 46.116). The study followed the Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) reporting guideline.
The study used 3 data sources. The first was EHR data from the Columbia University Irving Medical Center (CUIMC), an academic medical center in New York, New York. The database contains more than 4.5 million inpatient and outpatient records collected from October 1985 to March 2020. The data are stored in the format of the Observational Medical Outcomes Partnership common data model, version 05, developed and maintained by the Observational Health Data Science and Informatics collaborative.13,14 Data elements of interest were demographic characteristics, medical conditions, and medication prescriptions.
The second data source was an internal report on the participation status of 4022 individuals in 297 interventional medication trials that involved the CUIMC as either the primary site of the trial or a recruitment site for a multisite trial, with records collected from September 1996 to January 2019. The report only contains data regarding patients’ medical record numbers, trial identifiers, status (eg, randomized, completed), and dates of status. The data in the report and patients’ EHR data are linkable through patients’ medical record numbers. The CUIMC setting for these 2 data sources was conducive for the analysis because it is a large medical center in a dense metropolitan area, thus allowing for a large number of patients available, and it has a well-established research environment that supports numerous trials in tandem with clinical care.
The third data source used in this study was the Aggregate Analysis of ClinicalTrials.gov (AACT) database, a publicly available relational database containing records from ClinicalTrials.gov.15 The AACT database and the CUIMC internal report are linkable through trial identifiers. Data were extracted from the AACT database on May 13, 2020.
For each trial participant, we first identified the earliest status date, limiting each participant to their earliest trial. Based on the chosen date, each participant’s record was checked for at least 1 relevant condition code within the 365 days before and including the status date. A relevant condition code was defined as a condition of focus listed in the participant’s trial description per the AACT database. Extracted conditions were converted to standardized codes used in the Observational Medical Outcomes Partnership common data model; if all codes for a trial did not have available conversions, the trial and its accompanying patients were excluded. Likewise, if no code was found in a participant’s record, that participant was excluded. The code in the participant’s record had to be either a direct match or a descendant. The code closest to the status date was used to define the index date for that participant. As a reassurance requirement to increase the confidence of the participant having the index condition, each participant was also required to have the designated index condition or 1 of its descendants recorded as present 365 days before, but not including, the index date.
To identify the pool of nonparticipants, candidates were identified based on the index condition codes used for the trial participants. The aforementioned reassurance requirement was also applied to each candidate. Each participant was then matched to 1 randomly selected nonparticipant based on (1) the index condition, meaning the same condition code; (2) the calendar month and year of the index to control for potential temporal biases in how the data were recorded; and (3) the number of visits to a health care professional within the 365 days before the index date, which was chosen to approximate health care use. Each nonparticipant could be matched to only 1 participant. This matching procedure was repeated 1000 times. If a participant could not be matched in all 1000 iterations, that participant was excluded. Although this procedure relied on straightforward variable-to-variable matching, we chose it over more sophisticated techniques such as propensity scores because the latter would potentially interfere with the study’s primary outcome. Specifically, involvement of clinical covariates for matching could minimize differences between the 2 groups, but these differences were precisely the primary focus of this study.
Descriptive statistics for trials, participants, and nonparticipants are presented. Trial characteristics were selected based on available study design information from the AACT database. Clinical characteristics (ie, covariates) were based on demographic characteristics, medical conditions, and medication history and were stratified by clinical trial disease domain. Disease domains were derived from ancestor codes for the trials’ condition(s) of focus; we focused on disease domains with the most trials.
All data analyses were performed using R statistical software, version 3.5.1 (R Project for Statistical Computing). Covariates were derived using the Observational Health Data Science and Informatics FeatureExtraction package, version 2.2.5, for R.16 We chose these covariates because they constitute a diverse clinical profile, including prior malignancies, CV diseases, medication prescriptions, and other underlying comorbidities. An individual qualified for a covariate if there was at least 1 code in the individual’s record within the 365 days before and including the index date (relevant code definitions are available on request). To establish descriptive statistics for nonparticipants, we used the mean of the 1000 estimates. Standardized differences were used for assessment because they provide a robust analysis for evaluating covariate imbalance between 2 groups and provide a streamlined approach to identifying covariates that differ most substantially; the cutoff to find differences was set at an absolute difference greater than or equal to 0.1.17,18
As a secondary analysis, we examined the association between trial parameters and participant covariates, stratified by disease domain. We performed χ2 (or Fisher exact) tests for each pairing, with the 2-tailed significance level set at P < .01. To account for multiple testing, we applied a Bonferroni correction within each disease domain. Manhattan-like plots were created to visualize results using the ggplot2 package, version 3.3.2, for R.19
After applying cohort requirements, a total of 202 trials with 1645 participants were available for analysis (eTable 1 in the Supplement); 929 (56.5%) were male, and the mean (SD) age was 54.65 (21.38) years. Of the aggregated set of 1645 nonparticipants, 855 (52.0%) were male, and the mean (SD) age was 57.24 (21.91) years (additional baseline information is available in eTable 2 in the Supplement). The most common disease domains were neoplastic disease (86 trials; 737 participants), disorders of the digestive system (31 trials; 321 participants), inflammatory disorders (28 trials; 276 participants), and disorders of the CV system (27 trials; 319 participants); trials could qualify for multiple disease domains, so the disease domains were not mutually exclusive. The most common disease in the neoplastic domain in terms of both the number of trials and the number of participants was lymphoma (22 trials; 146 patients). Hepatitis C virus was the most common disease among digestive system disorders and inflammatory disorders (17 trials and 146 patients in each disease domain). For disorders of the CV system, hypertensive disorder was the most common in terms of the number of trials (8), and myocardial disease was the most common in terms of the number of participants (77).
Table 1 summarizes the trials’ characteristics. The most common trial phase across all disease domains was phase 2 with the exception of CV system trials, which were mostly phase 3. Neoplastic disease was the only disease domain in which most trials did not mention the use of a randomization procedure, whereas the CV system domain had the highest proportion of trials that used a randomization procedure. Across all disease domains, the majority of trials were multisite, had an industry sponsor, involved a data monitoring committee, and recruited fewer than 20 patients at the institution.
Table 2 and Table 3 provide covariate comparisons between trial participants and nonparticipants for each disease domain (nonstratified comparisons are shown in eTable 2 in the Supplement). For demographic covariates, substantial differences (ie, absolute value of the standardized difference ≥0.1) in age and race existed between trial participants and nonparticipants in digestive system trials, inflammatory disorder trials, and CV system trials. Differences between participants and nonparticipants in ethnicity were found for neoplastic trials, digestive system trials, and CV system trials. In addition, participants in digestive system trials and CV system trials were more likely to be male (201 of 321 participants [62.6%] in digestive system trials and 217 of 319 participants [68.0%] in CV system trials).
For comorbidities, participants generally had fewer underlying conditions across all trials. Among the 31 conditions, participants had substantially lower prevalence of conditions compared with their nonparticipant counterparts in neoplastic trials (64.5% [20 conditions]), with the largest difference being for hypertensive disorder (234 [31.8%] vs 315 [42.7%]); in digestive system trials (61.3% ), with the largest difference being for hyperlipidemia (33 [10.3%] vs 65 [20.3%]); in inflammatory disorder trials (58.1% ), with the largest difference being for heart failure (9 [3.3%] vs 27 [9.9%]); and in CV system trials (38.7% ), with the largest difference being for cerebrovascular disease (44 [13.8%] vs 80 [25.1%]). In contrast, nonparticipants had substantially lower prevalence of underlying conditions in neoplastic trials (6.4% [2 conditions]), in digestive system trials (6.4% ), in inflammatory disorder trials (9.7% ), and in CV system trials (9.7% ). In neoplastic trials in particular, after hypertension, the largest differences between trial participants and nonparticipants were found for prevalence of heart disease (26.6% vs 36.9%), renal impairment (9.8% vs 17.3%), ischemic heart disease (1.8% vs 5.5%), and coronary arteriosclerosis (6.8% vs 12.4%), indicating that the largest differences tend to be for CV diseases. Consequently, for CV trials, there was a lower prevalence of malignant neoplastic disease between trial participants and nonparticipants (6.6% vs 13.0%).
For medication history, trial participants generally had fewer prescriptions than nonparticipants within the 17 medication classes assessed. Participants had substantially lower prevalence of prescriptions compared with nonparticipants in neoplastic trials (64.7% [11 medication classes]), with the largest difference being for antithrombotic agents (305 [41.4%] vs 397 [53.9%]); in digestive system trials (58.8% ), with the largest difference being for drugs for treatment of obstructive airway diseases (39 [12.1%] vs 75 [23.5%]); in inflammatory disorder trials (88.2% ), with the largest difference being for antiepileptics (17 [6.2%] vs 42 [15.4%]); and in CV system trials (52.9% ), with the largest difference being for immunosuppressants (10 [3.1%] vs 26 [8.2%]). In contrast, nonparticipants had substantially lower prescriptions than participants in digestive trials (5.9% [1 medication class]) and in CV system trials (17.6% ).
Figure 1 and Figure 2 show the associations between trial participants’ covariates and trial characteristics for each disease domain (data for each individual data point are shown in eTables 3-6 in the Supplement). Neoplastic disease trials had the fewest statistically significant associations; the most prominent associations were for (1) malignant tumor of urinary bladder and multisite trials and overall enrollment and (2) age and phase, number of treatment arms, and industry sponsorship. Regarding age associations specifically, for industry sponsorship, there was a negative association between the inclusion of children younger than 18 years in a trial and industry-sponsor funding (odds ratio, 0.14; 95% CI, 0.09-0.25); 33 of the 69 children (47.8%) included in this study were part of an industry-sponsored trial, compared with 575 of 668 adults (86.1%). For trial phase, no children were involved in phase 1 trials; this is in contrast to 90 of 331 adults (27.2%) aged 18 to 64 years and 63 of 337 elderly participants (18.7%) aged 65 years or older who participated in a phase 1 trial. Among the 26 statistically significant associations for digestive system trials and the 36 statistically significant associations for inflammatory disorder trials, 18 associations overlapped between the 2 disease domains. For CV system trials, the most statistically significant associations were for peripheral vascular disease and phase and overall enrollment.
In this cross-sectional study, we used a novel combination of EHR data and trial enrollment data to compare trial participants and nonparticipants, and we examined associations between participant covariates and trial parameters. We found that trial participants had fewer comorbidities and fewer medication prescriptions than did nonparticipants across 4 different disease domains, similar to the findings of prior work.2-6 We also found statistically significant associations among a variety of participant covariates and trial parameters.
In neoplastic disease trials, trial participants had fewer comorbidities than nonparticipants. The largest differences between trial participants and nonparticipants were found for hypertensive disorder (31.8% of participants vs 42.7% of nonparticipants), heart disease (26.6% vs 36.9%), renal impairment (9.8% vs 17.3%), ischemic heart disease (1.8% vs 5.5%), and coronary arteriosclerosis (6.8% vs 12.4%). The observations for CV disease may be associated with CV-related exclusion criteria20 because many cancer therapies are associated with CV toxic effects.21 However, given the large prevalence of trial nonparticipants who had a CV comorbidity, this finding suggests a need for trials expressly focused on this subpopulation to find safer therapeutic alternatives; none of the neoplastic disease trials included in this study qualified as a CV system trial.
Regarding associations between participant covariates and trial parameters, 2 prominent findings were an association between participant age and industry sponsorship and between participant age and trial phase. Industry sponsors might be cautious about funding pediatric trials because of increased liability, more restrictive regulatory oversight, and minimal financial gain.22-24 Regarding trial phase, the most prominent observation was in phase 1 trials, in which no children were involved. Phase 1 trials typically focus on initial safety assessments of interventions given to humans for the first time, usually to establish a maximum tolerated dose.25-27 Despite no such pediatric trials in this study’s data, some trials were designated as phase 1/phase 2, in which finding the maximum tolerated dose was incorporated as part of the study design for ultimately assessing efficacy. This hybrid design may be particularly important for the pediatric population because it allows for a timelier evaluation of efficacy while also attempting to mitigate potential toxic effects.28
Digestive system disorders and inflammatory disorders were the second and third most common disease domains for which trials were conducted, but the prevalence of trials in both domains was primarily focused on hepatitis C virus; approximately half of the trial participants in both disease domains had viral hepatitis C infection. Subsequently, the 2 sets of covariate differences observed among trial participants and nonparticipants in these disease domains were fairly similar albeit with differing magnitudes for each covariate. One possible explanation is that many of the hepatitis C trials included in this study required a surgical component (ie, liver transplant). Individuals undergoing such a procedure are required to display adequate health, such as having no severe CV disease or no severe renal dysfunction, to ensure tolerance of postsurgery medications and to minimize concerns that may jeopardize the success of the procedure.29 This is supported by the finding in this study of a higher prevalence of immunosuppressant medication use in the trial participant groups (although there was a discrepancy between the number of immunosuppressant medications and the diagnoses of viral hepatitis C infection, this may reflect pretransplant vs posttransplant timing of trial initiation). Many of the associations observed in these 2 disease domains were found in the hepatitis C trials. For example, many of the viral hepatitis C trials were phase 2 trials, and thus many covariates in these trials were significantly associated with trial phase, including viral hepatitis C, chronic liver disease, lesions of the liver, and use of immunosuppressant medications.
Although CV system trials had the fewest prominent differences between participants and nonparticipants, observations consistent with prior studies persisted,3,6 particularly for differences in the prevalence of cerebrovascular disease (present in 13.8% of trial participants vs 25.1% of nonparticipants) and malignant neoplastic disease (present in 6.6% of trial participants vs 13.0% of nonparticipants) as well as in female participation (32.0% of trial participants vs 45.8% of nonparticipants). The difference in the prevalence of cerebrovascular disease might be a result of individuals experiencing a cerebrovascular event that led to cognitive impairment or a debilitating disability, thus precluding trial participation.30-32 Alternatively, the difference might result from some trials designating this covariate as a safety outcome, which would exclude individuals with cerebrovascular disease because they would begin the trial at increased risk.33,34 Regardless, one-fourth of nonparticipants were found to have prior cerebrovascular disease, and overlooking these individuals in trials may hinder how relevant the results are to this group. Regarding the low prevalence of female participation, a possible explanation is that females may perceive greater risk in trial participation than males do, and thus, they may forgo participation; another possibility is that females present with CV disease at later ages, which may exclude them from certain trials.3,35-38 In addition, the difference in the prevalence of malignant neoplastic disease among participants and nonparticipants in CV system trials echoes the aforementioned concern that some chemotherapy regimens may cause cardiotoxic effects.21
This study has limitations. First, there was potential misclassification when selecting nonparticipants. In particular, matching on conditions did not consider trials defined by multiple simultaneous conditions. Likewise, EHR data are susceptible to data-quality concerns, such as missing data elements and erroneous documentation of irrelevant elements, that can affect how patients are assessed.39 We tried to mitigate these concerns by matching trial participants and nonparticipants based on condition, calendar time, and number of visits to a health care professional. Second, trial characteristics were based on ClinicalTrials.gov entries, which represent a condensed summary of protocols.40 Third, we had access to patients from only a single center, resulting in having information regarding only some of the participants in multisite trials. Fourth, the study data were from a single large academic medical center, which may have affected the types of patients available and the types of trials conducted. For example, the CUIMC houses many specialty services, such as oncology clinics and transplant centers, for patients with complex medical histories, which may have led to data from a higher proportion of patients who had greater comorbidity burdens compared with patients in other clinical environments.
In this cross-sectional study, combining data on prior trial enrollment with EHR data provided a source of information to evaluate the generalizability of trials and to inform the designs of future trials. The findings of this analysis support prior observations, highlight potentially overlooked subgroups, and provide insight regarding why certain patient characteristics may be associated with certain trial characteristics. The results also suggest that linking EHR data with data on prior trial enrollment may enhance the interpretation of clinical trial findings.
Accepted for Publication: February 14, 2021.
Published: April 7, 2021. doi:10.1001/jamanetworkopen.2021.4732
Open Access: This is an open access article distributed under the terms of the CC-BY License. © 2021 Rogers JR et al. JAMA Network Open.
Corresponding Author: Chunhua Weng, PhD, Department of Biomedical Informatics, Columbia University, 622 W 168th St, PH-20, Rm 407, New York, NY 10032 (firstname.lastname@example.org).
Author Contributions: Dr Weng had full access to all of the data in the study and takes responsibility for the integrity of the data and the accuracy of the data analysis.
Concept and design: Rogers, Hripcsak, Weng.
Acquisition, analysis, or interpretation of data: Rogers, Liu, Cheung, Weng.
Drafting of the manuscript: Rogers.
Critical revision of the manuscript for important intellectual content: All authors.
Statistical analysis: Rogers, Liu, Cheung.
Administrative, technical, or material support: Liu, Hripcsak, Weng.
Conflict of Interest Disclosures: Mr Rogers and Dr Weng reported receiving grants from the National Library of Medicine during the conduct of the study. Dr Hripcsak reported receiving grants from the National Institutes of Health during the conduct of the study. No other disclosures were reported.
Funding/Support: This research was funded by grants R01LM009886 (Dr Weng) and 5T15LM007079 (Dr Hripcsak) from the National Library of Medicine.
Role of the Funder/Sponsor: The funder had no role in the design and conduct of the study; collection, management, analysis, and interpretation of the data; preparation, review, or approval of the manuscript; and decision to submit the manuscript for publication.
Create a personal account or sign in to: