Comparison of Clinical Characteristics Between Clinical Trial Participants and Nonparticipants Using Electronic Health Record Data

This cross-sectional study uses electronic health records and clinical trial enrollment data to assess differences in the clinical characteristics of trial participants and nonparticipants.


Introduction
Clinical trials are considered one of the best study designs for generating medical evidence. A common challenge for individuals interpreting clinical trials is assessing generalizability, which is the practice of determining how reasonably relevant the results are to a particular group of individuals who were not part of the trial. 1 A variety of these comparisons have shown disparities across many disease domains. In cancer trials, many studies [2][3][4][5] have noted that trial participants tend to be younger, with more promising prognoses and fewer comorbidities. In cardiovascular (CV) disease trials, participants are more likely to be male, with less risk for developing CV outcomes. 3,6 In a metaanalysis of end-stage kidney disease trials, participants were found to be younger and to have different comorbidity profiles. 7 Other disease domains in which similar concerns are noted include mental health, psoriasis, and type 2 diabetes. 3,[8][9][10] Understanding differences between trial participants and nonparticipants is important because underlying characteristics can influence the estimated effect of an intervention, ultimately impacting its clinical meaningfulness. 1 However, many of these prior assessments examined only aggregated estimates between trial participants and nonparticipants with comparisons based on tabular data reported by the selected trials, providing limited insight when comparing the 2 groups. A novel combination of prior trial enrollment data and electronic health records (EHRs) may provide more granular and detailed assessments by leveraging individual participants' medical history. To our knowledge, prior literature reviews on use of EHR data and other similar data sources for generalizability purposes have found no study exploring this particular linkage. 3,11,12 The primary aim of this study was to compare clinical profiles of trial participants with those of nonparticipants as collected from their EHR profiles across different disease domains. We hypothesized that clinical differences would exist regardless of the disease domain. As a secondary aim, we examined associations between participant covariates and trial parameters (eg, randomization use and number of treatment arms). We hypothesized that some covariates would be associated with certain trial parameters, suggesting that certain covariates are evaluated only in certain types of trials.

Methods
This cross-sectional study used data obtained from a single academic medical center between September 1996 and January 2019 to identify 1645 clinical trial participants from a diverse set of 202 available trials conducted at the center. The eFigure in the Supplement provides an overview of the methods. This study was approved by the Columbia University institutional review board and qualified for a waiver of informed consent per the Code of Federal Regulations (45 CFR 46.116). The study followed the Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) reporting guideline.

Data Sources
The study used 3 data sources. The first was EHR data from the Columbia University Irving Medical Center (CUIMC), an academic medical center in New York, New York. The database contains more The second data source was an internal report on the participation status of 4022 individuals in 297 interventional medication trials that involved the CUIMC as either the primary site of the trial or a recruitment site for a multisite trial, with records collected from September 1996 to January 2019.
The report only contains data regarding patients' medical record numbers, trial identifiers, status (eg, randomized, completed), and dates of status. The data in the report and patients' EHR data are linkable through patients' medical record numbers. The CUIMC setting for these 2 data sources was conducive for the analysis because it is a large medical center in a dense metropolitan area, thus allowing for a large number of patients available, and it has a well-established research environment that supports numerous trials in tandem with clinical care.
The third data source used in this study was the Aggregate Analysis of ClinicalTrials.gov (AACT) database, a publicly available relational database containing records from ClinicalTrials.gov. 15 The AACT database and the CUIMC internal report are linkable through trial identifiers. Data were extracted from the AACT database on May 13, 2020.

Selection of Trial Participants
For each trial participant, we first identified the earliest status date, limiting each participant to their earliest trial. Based on the chosen date, each participant's record was checked for at least 1 relevant condition code within the 365 days before and including the status date. A relevant condition code was defined as a condition of focus listed in the participant's trial description per the AACT database.
Extracted conditions were converted to standardized codes used in the Observational Medical Outcomes Partnership common data model; if all codes for a trial did not have available conversions, the trial and its accompanying patients were excluded. Likewise, if no code was found in a participant's record, that participant was excluded. The code in the participant's record had to be either a direct match or a descendant. The code closest to the status date was used to define the index date for that participant. As a reassurance requirement to increase the confidence of the participant having the index condition, each participant was also required to have the designated index condition or 1 of its descendants recorded as present 365 days before, but not including, the index date.

Selection of Nonparticipants
To identify the pool of nonparticipants, candidates were identified based on the index condition codes used for the trial participants. The aforementioned reassurance requirement was also applied to each candidate. Each participant was then matched to 1 randomly selected nonparticipant based on (1) the index condition, meaning the same condition code; (2) the calendar month and year of the index to control for potential temporal biases in how the data were recorded; and (3) the number of visits to a health care professional within the 365 days before the index date, which was chosen to approximate health care use. Each nonparticipant could be matched to only 1 participant. This matching procedure was repeated 1000 times. If a participant could not be matched in all 1000 iterations, that participant was excluded. Although this procedure relied on straightforward variableto-variable matching, we chose it over more sophisticated techniques such as propensity scores because the latter would potentially interfere with the study's primary outcome. Specifically, involvement of clinical covariates for matching could minimize differences between the 2 groups, but these differences were precisely the primary focus of this study.

Statistical Analysis
Descriptive statistics for trials, participants, and nonparticipants are presented. Trial characteristics were selected based on available study design information from the AACT database. Clinical characteristics (ie, covariates) were based on demographic characteristics, medical conditions, and medication history and were stratified by clinical trial disease domain. Disease domains were derived from ancestor codes for the trials' condition(s) of focus; we focused on disease domains with the most trials.
All data analyses were performed using R statistical software, version 3.5.1 (R Project for Statistical Computing). Covariates were derived using the Observational Health Data Science and Informatics FeatureExtraction package, version 2.2.5, for R. 16 We chose these covariates because they constitute a diverse clinical profile, including prior malignancies, CV diseases, medication prescriptions, and other underlying comorbidities. An individual qualified for a covariate if there was at least 1 code in the individual's record within the 365 days before and including the index date (relevant code definitions are available on request). To establish descriptive statistics for nonparticipants, we used the mean of the 1000 estimates. Standardized differences were used for assessment because they provide a robust analysis for evaluating covariate imbalance between 2 groups and provide a streamlined approach to identifying covariates that differ most substantially; the cutoff to find differences was set at an absolute difference greater than or equal to 0.1. 17,18 As a secondary analysis, we examined the association between trial parameters and participant covariates, stratified by disease domain. We performed χ 2 (or Fisher exact) tests for each pairing, with the 2-tailed significance level set at P < .01. To account for multiple testing, we applied a Bonferroni correction within each disease domain. Manhattan-like plots were created to visualize results using the ggplot2 package, version 3.3.2, for R. 19

Trial Characteristics
After applying cohort requirements, a total of 202 trials with 1645 participants were available for analysis (eTable 1 in the Supplement); 929 (56.5%) were male, and the mean (SD) age was 54.65  Table 1 summarizes the trials' characteristics. The most common trial phase across all disease domains was phase 2 with the exception of CV system trials, which were mostly phase 3. Neoplastic disease was the only disease domain in which most trials did not mention the use of a randomization procedure, whereas the CV system domain had the highest proportion of trials that used a randomization procedure. Across all disease domains, the majority of trials were multisite, had an industry sponsor, involved a data monitoring committee, and recruited fewer than 20 patients at the institution. Table 2 and Table 3 provide covariate comparisons between trial participants and nonparticipants for each disease domain (nonstratified comparisons are shown in eTable 2 in the Supplement). For demographic covariates, substantial differences (ie, absolute value of the standardized difference Ն0.1) in age and race existed between trial participants and nonparticipants in digestive system trials, inflammatory disorder trials, and CV system trials. Differences between participants and nonparticipants in ethnicity were found for neoplastic trials, digestive system trials, and CV system trials. In addition, participants in digestive system trials and CV system trials were more likely to be male  [2]), in inflammatory disorder trials (9.7% [3]), and in CV system trials (9.7% [3]).

Comparison of Trial Participants and Nonparticipants
In neoplastic trials in particular, after hypertension, the largest differences between trial participants and nonparticipants were found for prevalence of heart disease (26.6% vs 36.9%), renal impairment (9.8% vs 17.3%), ischemic heart disease (1.8% vs 5.5%), and coronary arteriosclerosis (6.8% vs 12.4%), indicating that the largest differences tend to be for CV diseases. Consequently, for CV trials, there was a lower prevalence of malignant neoplastic disease between trial participants and nonparticipants (6.6% vs 13.0%).
For medication history, trial participants generally had fewer prescriptions than nonparticipants within the 17 medication classes assessed. Participants had substantially lower prevalence of

Discussion
In this cross-sectional study, we used a novel combination of EHR data and trial enrollment data to compare trial participants and nonparticipants, and we examined associations between participant covariates and trial parameters. We found that trial participants had fewer comorbidities and fewer medication prescriptions than did nonparticipants across 4 different disease domains, similar to the findings of prior work. [2][3][4][5][6] We also found statistically significant associations among a variety of participant covariates and trial parameters.
In neoplastic disease trials, trial participants had fewer comorbidities than nonparticipants. The largest differences between trial participants and nonparticipants were found for hypertensive disorder (31.8% of participants vs 42.7% of nonparticipants), heart disease (26.6% vs 36.9%), renal impairment (9.8% vs 17.3%), ischemic heart disease (1.8% vs 5.5%), and coronary arteriosclerosis (6.8% vs 12.4%). The observations for CV disease may be associated with CV-related exclusion criteria 20 because many cancer therapies are associated with CV toxic effects. 21 However, given the large prevalence of trial nonparticipants who had a CV comorbidity, this finding suggests a need for trials expressly focused on this subpopulation to find safer therapeutic alternatives; none of the neoplastic disease trials included in this study qualified as a CV system trial.
Regarding associations between participant covariates and trial parameters, 2 prominent findings were an association between participant age and industry sponsorship and between participant age and trial phase. Industry sponsors might be cautious about funding pediatric trials because of increased liability, more restrictive regulatory oversight, and minimal financial gain. [22][23][24] Regarding trial phase, the most prominent observation was in phase 1 trials, in which no children were involved. Phase 1 trials typically focus on initial safety assessments of interventions given to humans for the first time, usually to establish a maximum tolerated dose. [25][26][27] Despite no such pediatric trials in this study's data, some trials were designated as phase 1/phase 2, in which finding the maximum tolerated dose was incorporated as part of the study design for ultimately assessing efficacy. This hybrid design may be particularly important for the pediatric population because it allows for a timelier evaluation of efficacy while also attempting to mitigate potential toxic effects. 28 Digestive system disorders and inflammatory disorders were the second and third most common disease domains for which trials were conducted, but the prevalence of trials in both domains was primarily focused on hepatitis C virus; approximately half of the trial participants in both disease domains had viral hepatitis C infection. Subsequently, the 2 sets of covariate differences observed among trial participants and nonparticipants in these disease domains were fairly similar albeit with differing magnitudes for each covariate. One possible explanation is that many of the hepatitis C trials included in this study required a surgical component (ie, liver transplant). Individuals undergoing such a procedure are required to display adequate health, such as having no severe CV disease or no severe renal dysfunction, to ensure tolerance of postsurgery medications and to minimize concerns that may jeopardize the success of the procedure. 29 This is supported by the finding in this study of a higher prevalence of immunosuppressant medication use in the trial participant groups (although there was a discrepancy between the number of immunosuppressant Covariates above the dashed line are statistically significant. Covariates with the most statistically significant associations per each trial characteristic are as follows. A, Participant age with trial phase, number of treatment arms, and industry sponsorship; malignant tumor of urinary bladder with multisite trials and overall enrollment; malignant tumor of lung with use of a data monitoring committee (DMC); and use of opioids with randomization. B, Malignant neoplastic disease with trial phase and overall enrollment; antithrombotic agents with industry sponsorship; primary malignant neoplasm of prostate with randomization; immunosuppressant medications with use of a DMC; and heart disease with multisite trial. medications and the diagnoses of viral hepatitis C infection, this may reflect pretransplant vs posttransplant timing of trial initiation). Many of the associations observed in these 2 disease domains were found in the hepatitis C trials. For example, many of the viral hepatitis C trials were phase 2 trials, and thus many covariates in these trials were significantly associated with trial phase, including viral hepatitis C, chronic liver disease, lesions of the liver, and use of immunosuppressant medications.
Although CV system trials had the fewest prominent differences between participants and nonparticipants, observations consistent with prior studies persisted, 3,6 particularly for differences in the prevalence of cerebrovascular disease (present in 13.8% of trial participants vs 25.1% of nonparticipants) and malignant neoplastic disease (present in 6.6% of trial participants vs 13.0% of nonparticipants) as well as in female participation (32.0% of trial participants vs 45.8% of nonparticipants). The difference in the prevalence of cerebrovascular disease might be a result of individuals experiencing a cerebrovascular event that led to cognitive impairment or a debilitating disability, thus precluding trial participation. [30][31][32] Alternatively, the difference might result from some trials designating this covariate as a safety outcome, which would exclude individuals with cerebrovascular disease because they would begin the trial at increased risk. 33,34 Regardless, one-fourth of nonparticipants were found to have prior cerebrovascular disease, and overlooking these individuals in trials may hinder how relevant the results are to this group. Regarding the low prevalence of female participation, a possible explanation is that females may perceive greater risk in trial participation than males do, and thus, they may forgo participation; another possibility is that  Covariates above the dashed line are statistically significant. Covariates with the most statistically significant associations per each trial characteristic are as follows. A, Viral hepatitis C infection with trial phase; heart disease with blinding; renal impairment with multisite trials and overall enrollment; antithrombotic agents with number of treatment arms and industry sponsorship; and immunosuppressant medications with use of a data monitoring committee (DMC). B, Age with randomization; peripheral vascular disease with trial phase and overall enrollment; atrial fibrillation with number of treatment arms; hyperlipidemia with blinding; heart disease with industry sponsorship; and heart failure with use of a DMC. females present with CV disease at later ages, which may exclude them from certain trials. 3,[35][36][37][38] In addition, the difference in the prevalence of malignant neoplastic disease among participants and nonparticipants in CV system trials echoes the aforementioned concern that some chemotherapy regimens may cause cardiotoxic effects. 21

Limitations
This study has limitations. First, there was potential misclassification when selecting nonparticipants.
In particular, matching on conditions did not consider trials defined by multiple simultaneous conditions. Likewise, EHR data are susceptible to data-quality concerns, such as missing data elements and erroneous documentation of irrelevant elements, that can affect how patients are assessed. 39 We tried to mitigate these concerns by matching trial participants and nonparticipants based on condition, calendar time, and number of visits to a health care professional. Second, trial characteristics were based on ClinicalTrials.gov entries, which represent a condensed summary of protocols. 40 Third, we had access to patients from only a single center, resulting in having information regarding only some of the participants in multisite trials. Fourth, the study data were from a single large academic medical center, which may have affected the types of patients available and the types of trials conducted. For example, the CUIMC houses many specialty services, such as oncology clinics and transplant centers, for patients with complex medical histories, which may have led to data from a higher proportion of patients who had greater comorbidity burdens compared with patients in other clinical environments.

Conclusions
In this cross-sectional study, combining data on prior trial enrollment with EHR data provided a source of information to evaluate the generalizability of trials and to inform the designs of future trials. The findings of this analysis support prior observations, highlight potentially overlooked subgroups, and provide insight regarding why certain patient characteristics may be associated with certain trial characteristics. The results also suggest that linking EHR data with data on prior trial enrollment may enhance the interpretation of clinical trial findings.