eAppendix. Multiplicity Coding Manual
eTable 1. Examples of Multiple Analyses and Multiple Outcome Variables
eTable 2. Recommendations on Multiplicity Error in Clinical Trials
Customize your JAMA Network experience by selecting one or more topics from the list below.
Identify all potential conflicts of interest that might be relevant to your comment.
Conflicts of interest comprise financial interests, activities, and relationships within the past 3 years including but not limited to employment, affiliation, grants or funding, consultancies, honoraria or payment, speaker's bureaus, stock ownership or options, expert testimony, royalties, donation of medical equipment, or patents planned, pending, or issued.
Err on the side of full disclosure.
If you have no conflicts of interest, check "No potential conflicts of interest" in the box below. The information will be posted with your response.
Not all submitted comments are published. Please see our commenting policy for details.
Khan MS, Khan MS, Ansari ZN, et al. Prevalence of Multiplicity and Appropriate Adjustments Among Cardiovascular Randomized Clinical Trials Published in Major Medical Journals. JAMA Netw Open. 2020;3(4):e203082. doi:10.1001/jamanetworkopen.2020.3082
What is the prevalence of multiplicity among cardiovascular randomized clinical trials published in 6 medical journals with a high impact factor, and how frequently are multiplicity adjustments made in these trials?
In this cross-sectional study, data were collected from the past issues of 6 journals with high impact factors published between August 1, 2015, and July 31, 2018. Of 511 cardiovascular randomized clinical trials included in this analysis, 300 had some form of multiplicity; of these 300, only 85 adjusted for multiplicity.
Among contemporary cardiovascular randomized clinical trials, it appears that multiplicity adjustments are infrequently reported.
Multiple analyses in a clinical trial can increase the probability of inaccurately concluding that there is a statistically significant treatment effect. However, to date, it is unknown how many randomized clinical trials (RCTs) perform adjustments for multiple comparisons, the lack of which could lead to erroneous findings.
To assess the prevalence of multiplicity and whether appropriate multiplicity adjustments were performed among cardiovascular RCTs published in 6 medical journals with a high impact factor.
Design, Setting, and Participants
In this cross-sectional study, cardiovascular RCTs were selected from all over the world, characterized as North America, Western Europe, multiregional, and rest of the world. Data were collected from past issues of 3 cardiovascular journals (Circulation, European Heart Journal, and Journal of the American College of Cardiology) and 3 general medicine journals (JAMA, The Lancet, and The New England Journal of Medicine) with high impact factors published between August 1, 2015, and July 31, 2018. Supplements and trial protocols of each of the included RCTs were also searched for multiplicity. Data were analyzed December 20 to 27, 2018.
Data from the selected RCTs were extracted and verified independently by 2 researchers using a structured data instrument. In case of disagreement, a third reviewer helped to achieve consensus. An RCT was considered to have multiple treatment groups if it had more than 2 arms; multiple outcomes were defined as having more than 1 primary outcome, and multiple analyses were defined as analysis of the same outcome variable in multiple ways. Multiplicity was examined only for the analysis of the primary end point.
Main Outcomes and Measures
Outcomes of interest were percentages of primary analyses that performed multiplicity adjustment of primary end points.
Of 511 cardiovascular RCTs included in this analysis, 300 (58.7%) had some form of multiplicity; of these 300, only 85 (28.3%) adjusted for multiplicity. Intervention type and funding source had no statistically significant association with the reporting of multiplicity risk adjustment. Trials that assessed mortality vs nonmortality outcomes were more likely to contain a multiplicity risk in their primary analysis (66.3% [177 of 267] vs 50.4% [123 of 244]; P < .001), and larger trials vs smaller trials were less likely to make any adjustments for multiplicity (35.6% [52 of 146] vs 21.4% [33 of 154]; P = .001).
Conclusions and Relevance
Findings from this study suggest that cardiovascular RCTs published in medical journals with high impact factors demonstrate infrequent adjustments to correct for multiple comparisons in the primary end point. These parameters may be improved by more standardized reporting.
Previous studies1-4 have raised concerns about selective reporting of outcomes in randomized clinical trials (RCTs). However, few reports have focused on multiplicity, which (along with incomplete reporting) is a major factor contributing to nonreproducibility of published claims.5 Multiplicity refers to the “potential inflation of type I error rate as a result of multiple testing, for example because of multiple subgroup comparisons, comparisons across multiple treatment arms, analysis of multiple outcomes, and multiple analyses of the same outcome at different times.”6
Negative consequences associated with multiplicity could be prevented by complete and accurate reporting of analyses outlined in the registered trial protocols. Multiplicity could also be mitigated by statistical adjustment when multiple analyses are specified a priori. Several statistical methods, such as defining coprimary outcome variables, performing various stepwise procedures,7-10 applying methods for multiple-group comparisons11,12 and including gatekeeping or hierarchical testing, have been proposed for multiplicity adjustment.13,14
To our knowledge, no study has reported on the prevalence of multiplicity among cardiovascular RCTs and, when applicable, whether appropriate multiplicity adjustments were implemented. To fill this knowledge gap, we conducted a cross-sectional study of cardiovascular RCTs published in medical journals with high impact factors to assess the reporting quality of statistical analyses, including the frequency with which multiplicity adjustments were reported.
This cross-sectional study followed the Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) reporting guideline.15 It also followed methods from the American Heart Association on standards for cardiac prevention and treatment studies.16
Three cardiovascular journals (Circulation, European Heart Journal, and Journal of the American College of Cardiology) and 3 general medicine journals (JAMA, The Lancet, and The New England Journal of Medicine) published between August 1, 2015, and July 31, 2018, were searched for general trial characteristics, multiplicity error, and multiplicity correction to assess the pool of recent and contemporary cardiovascular clinical trials. Data were analyzed December 20 to 27, 2018. These journals were chosen based on their high impact factor, broad readership, and reputation of publishing important clinical trials used in the development of guidelines. Supplements and trial protocols of each of the included RCTs were also searched for general trial characteristics, multiplicity error, and multiplicity correction.
Articles were selected if they reported results of cardiovascular RCTs and compared at least 2 treatment groups. Excluded were brief communications, research letters, and animal studies. Data from the selected RCTs were extracted and verified by 2 of us (M.S.K. and Z.N.A.) independently using a structured data instrument and then cross-checked by another of us (T.J.S).
Data were extracted from both the primary and secondary articles. Primary articles were defined as reports on an empirical research study conducted by the authors analyzing data collected for the first time, while secondary articles were studies derived from data collected and analyzed from primary articles. We analyzed and extracted data only from the analysis of the primary end point of each RCT because multiplicity in a secondary analysis is generally exploratory or hypothesis generating. A multiplicity coding manual was developed to investigate the reporting of primary statistical analyses, multiple analyses, and adjustments for multiplicity issues (eAppendix in the Supplement). The multiplicity coding manual was pretested and modified by coding 15 articles initially. Two of us (M.S.K. and Z.N.A.) coded each article separately and discussed any inconsistencies in the data and modified the multiplicity coding manual accordingly. The rest of the articles were then coded according to this multiplicity coding manual. The complete published articles were searched for general trial characteristics, multiplicity error, and multiplicity correction by the coders, along with additional supplementary material (eg, trial protocols and appendixes if they were referred to in the article). The order of the articles was randomized for each coder.
To measure the extent of agreement between the 2 independent coders, the κ statistic was used and calculated according to the methods by Landis and Koch.17 The frequency of discrepancies between the coders was computed using the Kappa Calculator (Statistics Solutions),18 and the κ statistics were assessed for several outcomes. There was substantial agreement in reproducibility for the presence of multiplicity, with κ = 0.76 (95% CI, 0.51-0.90) in the main text and κ = 0.78 (95% CI, 0.55-0.89) after adjusting for these multiplicity errors. Overall interobserver agreement in extracting data was good, and any discrepancies were resolved after discussion. When consensus could not be reached, another of us (T.J.S.) arbitrated. Finally, a post hoc search was done to assess if the authors of articles stated that their trial was exploratory or hypothesis generating in any section of the article.
The following information was extracted from the RCTs: (1) the number of randomized participants; (2) region of the world where the trial was conducted (North America, Western Europe, multiregional, or rest of the world [multiregional was defined as any trial that had multiple sites across the world, and rest of the world was defined as any trial having sites in a region that was not located in either North America or Western Europe]); (3) intervention type (drugs, procedures [eg, a different approach or method of implementing treatment], medical devices, surgery, testing or imaging, or other [eg, diet]); and (4) funding source. Trial size was extracted (as a proxy for trial phase considering its inconsistent definition) and categorized as small (≤500 participants per group) or large (>500 participants per group). Trial type (prespecification of a primary end point) was categorized and extracted as either a mortality trial (defined as any trial where the primary outcome was mortality during treatment) or a nonmortality trial.
Data were also extracted for whether the article had risk of multiplicity (ie, contained multiple analyses, a term that encompasses any of the following: multiple treatment groups, multiple outcome variables, and multiple analyses of the same outcome variable) and whether the authors defined the methods used for multiplicity correction. An RCT was considered to have multiple treatment groups if it had more than 2 arms, multiple outcome variables were defined as having more than 1 primary outcome, and multiple analyses were defined as analysis of the same outcome variable in multiple ways. All 3 of these scenarios were weighted equally. We considered multiplicity adjustment sufficient when an article outlined that it attempted to adjust for multiple comparisons.
Descriptive statistics were used to assess the proportion of RCTs with (1) multiple primary analyses and (2) a multiplicity adjustment for the analysis of the primary end point. Also recorded were the class of multiplicity in the primary analysis (multiple treatment groups, multiple outcome variables, or multiple analyses of the same outcome variable) and frequencies of each of the methods used to adjust for multiplicity in the primary analysis. Multiplicity was examined only for the analysis of the primary end point. It was deemed unnecessary for secondary analyses, which are generally exploratory or hypothesis generating. Outcomes of interest were percentages of primary analyses that performed multiplicity adjustment of primary end points. Two-sided χ2 tests were used to examine the association between (1) intervention type, (2) funding source, (3) trial size, and (4) trial type; it was noted whether risk for a familywise error because of multiple comparisons was present and whether the RCT adjusted for multiple comparisons. The method described by Holm8 adjusts for multiple comparisons between type of intervention type, funding source, trial size, and trial type. According to this method, the smallest P value from all planned comparisons is compared with a significance level of .05 divided by K, where K represents the number of comparisons to be made. If the null hypothesis is rejected, the next smallest P value is compared with a significance level of P = .05 divided by K minus 1, and so on until the null hypothesis can no longer be rejected. In this scenario, a total of 8 comparisons were made; therefore, the significance level was set to α = .006 (according to .05 ÷ by K − 1, where K = 8 in this case) in the initial step and to α = .05 in the last step (α being the Holm-corrected significance level). A statistical software package (SPSS, version 23; IBM) was used for all analyses.
The initial search identified 2166 trials, which were transferred to a reference management software program (EndNote, Clarivate Analytics). The titles and abstracts of the identified studies were then screened to exclude irrelevant studies. Full-text studies were subsequently obtained and evaluated for the remaining 1273 reports. After assessing for relevance, 511 articles were included in the final analysis.
Of 511 cardiovascular RCTs included in this analysis, 123 (24.1%) were published in Journal of the American College of Cardiology, 112 (21.9%) in Circulation, 107 (20.9%) in European Heart Journal, 71 (13.9%) in The New England Journal of Medicine, 55 (10.8%) in The Lancet, and 43 (8.4%) in JAMA (Table 1). Approximately half (248 [48.5%]) of the trials were industry funded, and approximately half (243 [47.6%]) were large trials. Approximately half (251 [49.1%]) of the trials used a drug intervention. A total of 229 trials (44.8%) made use of composite outcomes as their primary outcome variable.
Of 511 cardiovascular RCTs included in this analysis, 300 (58.7%) had some form of multiplicity (282 of 511 [55.2%] did not mention whether they did or did not adjust for multiplicity). Of these 300 trials, 81 (27.0%) had multiple treatment groups, 45 (15.0%) identified multiple outcome variables as primary, 170 (56.7%) had multiple analyses of the same outcome variable, 3 (1.0%) had multiple treatment groups and multiple outcome variables, and 1 (0.3%) had multiple treatment groups and multiple analyses (Table 2).
Among 300 RCTs, only 85 (28.3%) adjusted for multiplicity for all primary analyses (Table 2). Of 511 trials, 289 (56.6%) did not mention whether they did or did not attempt to adjust for multiple comparisons. Forty-one trials (48.2%) had multiple analyses of the same outcome variable that adjusted for multiplicity, 22 (25.9%) had multiple treatment groups that adjusted for multiplicity, and 19 (22.4%) had multiple outcome variables that adjusted for multiplicity. The individual multiplicity correction tests are also listed in Table 3.
Of 300 trials with multiplicity error risk, 19 (6.3%) were exploratory or hypothesis generating. Twelve of these trials mentioned this exploratory nature in the Discussion section of the article, 5 mentioned it in the Methods section, and 2 mentioned it in more than 1 section of the article. Of the 85 trials that adjusted for multiplicity, 68 (80.0%) mentioned that they adjusted for multiplicity in the main text of the article, and 17 (20.0%) only mentioned it in the supplement or trial protocol.
Intervention type and funding source had no statistically significant association with the reporting of multiplicity risk adjustment (Table 4). Trials that assessed mortality vs nonmortality outcomes were more likely to contain a multiplicity risk in their primary analysis (66.3% [177 of 267] vs 50.4% [123 of 244]; P < .001). Although larger trials had no association with specifying an analysis of the primary end point or containing a multiplicity error risk within their analysis, they were less likely than smaller trials to make any adjustments to correct for multiplicity issues (35.6% [52 of 146] vs 21.4% [33 of 154]; P = .001). All of these results were statistically significant after application of the Holm test.
Our report demonstrates that 58.7% of 511 cardiovascular RCTs included in this analysis contained multiple analyses within their methods and that 55.2% of the total RCTs did not report whether they adjusted for multiple comparisons. Trials that assessed mortality were more likely than nonmortality trials to have some form of multiplicity, which is not unexpected because mortality is usually not the sole end point. However, because of the exigent nature of the mortality component and because some researchers consider mortality a safety end point as well, authors might be inclined to claim an association even if the overall end point fails to show effectiveness. These results have important implications for the performance and interpretation of cardiovascular RCTs.
Articles mentioning that they did not adjust for multiplicity often provided some justification, such as stating that their study was exploratory or hypothesis generating. Some justifications were unique to the trial; for example, 1 article mentioned that a chance finding could not be ruled out because of multiple testing and the sample size of the subgroups.19 It is possible that results might change before and after multiplicity adjustment; for example, in a trial where P = .046 for the primary outcome, the adjusted P value may have been different after multiplicity correction.20
Among 85 of 511 included articles that adjusted for multiplicity for all primary analyses (Table 3), half of the trials used a composite outcome as their primary outcome. Composite outcomes allow increased statistical precision and efficiency with fewer participants to detect a statistically significant difference among comparators,21 especially in the case of total mortality, which is a rare event requiring more power and an extended follow-up to show a difference between interventions.22 Although the use of composite end points is acceptable and in some instances beneficial, it may also increase the risk of introducing a multiplicity error if the observed treatment effect was associated with a softer clinical end point.21,23 For example, a trial where all-cause mortality, myocardial infarction, and recurrent angina are components of a composite end point, recurrent angina might be considered the softest of the 3 components. The present study did not examine the details of each composite end point of every trial or consider composite end points to be a source of multiplicity; however it suggests the need for future investigators and researchers to apply methods to avoid the possibility of a multiplicity error, as described by Sankoh et al.21
Uncertainty in interpreting research results is common and may be attributable to a lack of statistical power or the use of questionable research practices, or it may reflect decisions a researcher makes to conduct a trial.24 These uncertainties might explain gaps in the reporting of multiplicity and adjustments made. Conversely, one could also argue that such gaps are less a reflection of multiplicity issues but rather reflect the unavailability of the trial protocol and statistical analysis plan. We suggest that all RCTs in medical journals should describe the trial protocol–specific analytic plan, including the methods used to adjust for multiple comparisons or acknowledgment of the lack of correction for multiplicity. We believe that this inclusion is especially relevant because most clinicians do not have statistical expertise.
Because the criteria used to classify trial phase (ie, phase 1, 2, or 3) were inconsistent among the RCTs in this study, trial size was used as a proxy for trial phase. For drug interventions, smaller trials (≤500 participants per group) may more likely reflect early to middle stages of development, and larger trials (>500 participants per group) may more likely reflect confirmatory stages.25,26 Among 161 RCTs, Gewandter et al27 found no association between trial size and funding source, with multiplicity adjustment most likely because of the limited power of a study to perform such an analysis. Our analysis included a larger sample of both industry-sponsored RCTs (n = 248) and large RCTs with multiplicity issues (n = 154) (Table 4). We found that smaller trials were more likely to be adjusted for multiplicity. Funding source had no association with adjusting for multiplicity. This observation suggests that RCTs of drugs in early to middle stages of development may be more likely to adjust for multiplicity.
The appropriateness of testing procedures is guided by information on statistical features of a study design or analytic strategy and differs depending on whether there is a single source of multiplicity or several sources and whether there are multiple treatment groups, multiple outcome variables, or multiple analyses of the same outcome variable. Dmitrienko and D’Agostino23 provide some guidance on how to choose the most appropriate test for multiplicity corrections; they state that nonparametric tests, such as the Holm test, can be applied to most multiplicity problems involving a single source of multiplicity. In cases where the association between statistical tests is known, such as in clinical trials with several dose-control comparisons and patient populations, more specific parametric tests, such as the Dunnett test, may be applied. In an effort to better explain types of multiple analyses and multiple outcome variables, detailed examples are listed in eTable 1 in the Supplement.
This study has several limitations. First, we assessed the reporting quality of the methods used for multiplicity adjustments but not necessarily the quality of statistical practices used. Because the methods may have been prespecified but not stated in the articles,28,29 this report is subject to reporting bias. Second, studies were recorded as having adjusted for multiplicity in the primary analysis only if the authors adjusted for all instances of multiplicity. However, this approach does not consider the trials that tried to adjust for some but not all sources of multiple comparisons. Also, whether a study had multiple treatment groups, multiple outcome variables, or multiple analyses of the same outcome variable, all had the same weight in terms of adjusting for multiplicity and thus were considered equally. A third limitation is that we only evaluated the primary outcomes. It is important to note that the secondary end point in a sequence may often influence a conclusion. For instance, if a trial finds a statistically significant difference in major adverse cardiovascular events (myocardial infarction, stroke, and admission for heart failure) and the next end point in the sequence is the myocardial infarction rate, which is not statistically significant, it would be incorrect to conclude a nominally statistically significant stroke association. In addition, we were unable to differentiate our analysis by trial type. Although the objective of this study was to evaluate the overall prevalence of multiplicity among cardiovascular RCTs, it must be remembered that phase 3 trials hold the most importance from a public health perspective, and multiplicity is of lesser concern in phase 2 trials.
This cross-sectional study found frequent inconsistencies associated with multiplicity in primary analysis reporting among cardiovascular RCTs published in medical journals with a high impact factor. These findings adversely reflect on the robustness of data published in journals that carry global reach and generate evidence that can transform clinical guidelines and practice. Our findings suggest that investigators should be encouraged to adjust for multiplicity when warranted. Practical guidelines for multiplicity adjustment in clinical trials (eg, recommendations by Proschan and Waclawiw30) can be consulted. We think that this information should ideally be prespecified in the Methods section of clinical trials before unblinding of the study data (eTable 2 in the Supplement). We believe that it should be the collective responsibility of journal editors, peer reviewers, and readers to pay close attention to the Methods and Statistical Analysis sections of articles reporting clinical trial results to ensure that multiplicity issues have been addressed.
Accepted for Publication: February 17, 2020.
Published: April 17, 2020. doi:10.1001/jamanetworkopen.2020.3082
Open Access: This is an open access article distributed under the terms of the CC-BY License. © 2020 Khan MS et al. JAMA Network Open.
Corresponding Author: Ankur Kalra, MD, Vascular and Thoracic Institute, Tomsich Family Department of Cardiovascular Medicine, Cleveland Clinic, Cleveland Clinic Akron General, 224 W Exchange St, Ste 225, Akron, OH 44302 (email@example.com).
Author Contributions: Drs Maaz Shah Khan and Ansari had full access to all of the data in the study and take responsibility for the integrity of the data and the accuracy of the data analysis.
Concept and design: Muhammad Shahzeb Khan, Siddiqi, Riaz, Stone, Kalra.
Acquisition, analysis, or interpretation of data: Maaz Shah Khan, Ansari, Siddiqi, S.U. Khan, Asad, Mandrola, Wason, Warraich, Bhatt, Kapadia, Kalra.
Drafting of the manuscript: Muhammad Shahzeb Khan, Maaz Shah Khan, Ansari, Siddiqi, Asad, Mandrola, Kalra.
Critical revision of the manuscript for important intellectual content: Muhammad Shahzeb Khan, Ansari, Siddiqi, S.U. Khan, Riaz, Asad, Wason, Warraich, Stone, Bhatt, Kapadia, Kalra.
Statistical analysis: Muhammad Shahzeb Khan, Maaz Shah Khan, Siddiqi, Wason.
Obtained funding: Kalra.
Administrative, technical, or material support: Maaz Shah Khan, Ansari, Asad, Mandrola, Kalra.
Supervision: Muhammad Shahzeb Khan, Asad, Warraich, Stone, Kapadia, Kalra.
Conflict of Interest Disclosures: Dr Stone reported receiving personal fees or other support from Abiomed, Ablative Solutions, Ancora, Applied Therapeutics, ARIA, BioStar, Cagent Vascular, Cardiac Success, Cook, Gore, HeartFlow, MAIA Pharmaceuticals, Matrizyme, MedFocus, Miracor, Neovasc, Orchestra BioMed, Qool Therapeutics, REVA, Robocath, Shockwave, SpectraWAVE, Teruma, TherOx, Valfix, Vascular Dynamics, Vectorious, and V-Wave. Dr Bhatt reported receiving grants, travel support, or personal fees from or having other relationships (board or committee membership, chair, editor, or trustee) with Abbott, Afimmune, Amarin, American College of Cardiology, American Heart Association, Amgen, AstraZeneca, Baim Institute for Clinical Research, Bayer, Belvoir Publications, Biotronik, Boehringer Ingelheim, Boston Scientific, Boston VA Research Institute, Bristol-Myers Squibb, Cardax, Cereno Scientific, Chiesi, Cleveland Clinic, Clinical Cardiology, CSI, CSL Behring, Duke Clinical Research Institute, Eisai, Elsevier, Ethicon, Ferring Pharmaceuticals, FlowCo, Forest Laboratories/AstraZeneca, Fractyl, Harvard Clinical Research Institute (now Baim Institute for Clinical Research), HMP Global, Idorsia, Ironwood, Ischemix, Journal of the American College of Cardiology, Lexicon, Lilly, Mayo Clinic, Medscape Cardiology, Medtelligence/ReachMD, Medtronic, Merck, Mount Sinai School of Medicine, Novo Nordisk, Pfizer, PhaseBio, PLx Pharma, Population Health Research Institute, Regado Biosciences, Regeneron, Roche, Sanofi-Aventis, Slack Publications, Society of Cardiovascular Patient Care, St Jude Medical (now Abbott), Svelte, Synaptic, Takeda, The Medicines Company, TobeSoft, US Department of Veterans Affairs, and WebMD. No other disclosures were reported.