Application of the Reverse Fragility Index to Statistically Nonsignificant Randomized Clinical Trial Results | Medical Journals and Publishing | JAMA Network Open | JAMA Network
[Skip to Navigation]
Sign In
Figure 1.  Flowchart of Search Strategy
Flowchart of Search Strategy

RCT indicates randomized clinical trial.

Figure 2.  Scatterplot of the Correlation Between Reverse Fragility Index and P Value, Sample Size, and Total Number of Events
Scatterplot of the Correlation Between Reverse Fragility Index and P Value, Sample Size, and Total Number of Events

The size of the blue circles is proportional to the sample sizes. The size of the gray shading inside the blue circles is proportional to the number of events.

Table 1.  Characteristics of the Included Trials
Characteristics of the Included Trials
Table 2.  Comparison of RFI With Number of Participants Lost to Follow-up for Each RFI Range
Comparison of RFI With Number of Participants Lost to Follow-up for Each RFI Range
Table 3.  Number of Trials in Each Category With Median RFI and RFQ
Number of Trials in Each Category With Median RFI and RFQ
1.
Alderson  P.  Absence of evidence is not evidence of absence.   BMJ. 2004;328(7438):476-477. doi:10.1136/bmj.328.7438.476 PubMedGoogle ScholarCrossref
2.
Pocock  SJ, McMurray  JJV, Collier  TJ.  Making sense of statistics in clinical trial reports: part 1 of a 4-part series on statistics for clinical trials.   J Am Coll Cardiol. 2015;66(22):2536-2549. doi:10.1016/j.jacc.2015.10.014 PubMedGoogle ScholarCrossref
3.
Wasserstein  RL, Lazar  NA.  The ASA’s statement on P values: context, process, and purpose.   Am Stat. 2016;70(2):129-133. doi:10.1080/00031305.2016.1154108 Google ScholarCrossref
4.
Freiman  JA, Chalmers  TC, Smith  H  Jr, Kuebler  RR.  The importance of beta, the type II error and sample size in the design and interpretation of the randomized control trial—survey of 71 negative trials.   N Engl J Med. 1978;299(13):690-694. doi:10.1056/NEJM197809282991304 PubMedGoogle ScholarCrossref
5.
Akl  EA, Briel  M, You  JJ,  et al.  Potential impact on estimated treatment effects of information lost to follow-up in randomised controlled trials (LOST-IT): systematic review.   BMJ. 2012;344:e2809. doi:10.1136/bmj.e2809 PubMedGoogle Scholar
6.
Walsh  M, Srinathan  SK, McAuley  DF,  et al.  The statistical significance of randomized controlled trial results is frequently fragile: a case for a fragility index.   J Clin Epidemiol. 2014;67(6):622-628. doi:10.1016/j.jclinepi.2013.10.019 PubMedGoogle ScholarCrossref
7.
Docherty  KF, Campbell  RT, Jhund  PS, Petrie  MC, McMurray  JJV.  How robust are clinical trials in heart failure?   Eur Heart J. 2017;38(5):338-345. doi:10.1093/eurheartj/ehw427PubMedGoogle Scholar
8.
Tignanelli  CJ, Napolitano  LM.  The fragility index in randomized clinical trials as a means of optimizing patient care.   JAMA Surg. 2019;154(1):74-79. doi:10.1001/jamasurg.2018.4318 PubMedGoogle ScholarCrossref
9.
Del Paggio  JC, Tannock  IF.  The fragility of phase 3 trials supporting FDA-approved anticancer medicines: a retrospective analysis.   Lancet Oncol. 2019;20(8):1065-1069. doi:10.1016/S1470-2045(19)30338-9 PubMedGoogle ScholarCrossref
10.
Evaniew  N, Files  C, Smith  C,  et al.  The fragility of statistically significant findings from randomized trials in spine surgery: a systematic survey.   Spine J. 2015;15(10):2188-2197. doi:10.1016/j.spinee.2015.06.004 PubMedGoogle ScholarCrossref
11.
Mazzinari  G, Ball  L, Serpa Neto  A,  et al.  The fragility of statistically significant findings in randomised controlled anaesthesiology trials: systematic review of the medical literature.   Br J Anaesth. 2018;120(5):935-941. doi:10.1016/j.bja.2018.01.012 PubMedGoogle ScholarCrossref
12.
Matics  TJ, Khan  N, Jani  P, Kane  JM.  The fragility index in a cohort of pediatric randomized controlled trials.   J Clin Med. 2017;6(8):79. doi:10.3390/jcm6080079 PubMedGoogle ScholarCrossref
13.
Brown  J, Lane  A, Cooper  C, Vassar  M.  The results of randomized controlled trials in emergency medicine are frequently fragile.   Ann Emerg Med. 2019;73(6):565-576. doi:10.1016/j.annemergmed.2018.10.037 PubMedGoogle ScholarCrossref
14.
Wayant  C, Meyer  C, Gupton  R, Som  M, Baker  D, Vassar  M.  The fragility index in a cohort of HIV/AIDS randomized controlled trials.   J Gen Intern Med. 2019;34(7):1236-1243. doi:10.1007/s11606-019-04928-5 PubMedGoogle ScholarCrossref
15.
Chase Kruse  B, Matt Vassar  B.  Unbreakable? an analysis of the fragility of randomized trials that support diabetes treatment guidelines.   Diabetes Res Clin Pract. 2017;134:91-105. doi:10.1016/j.diabres.2017.10.007 PubMedGoogle ScholarCrossref
16.
Ioannidis  JPA.  The proposal to lower P value thresholds to .005.   JAMA. 2018;319(14):1429-1430. doi:10.1001/jama.2018.1536 PubMedGoogle ScholarCrossref
17.
Wayant  C, Scott  J, Vassar  M.  Evaluation of lowering the P value threshold for statistical significance from .05 to .005 in previously published randomized clinical trials in major medical journals.   JAMA. 2018;320(17):1813-1815. doi:10.1001/jama.2018.12288 PubMedGoogle ScholarCrossref
18.
Carter  RE, McKie  PM, Storlie  CB.  The fragility index: a P value in sheep’s clothing?   Eur Heart J. 2017;38(5):346-348. doi:10.1093/eurheartj/ehw495PubMedGoogle Scholar
19.
Walsh  M, Devereaux  PJ, Sackett  DL.  Clinician trialist rounds: 28: when RCT participants are lost to follow-up, part 1: why even a few can matter.   Clin Trials. 2015;12(5):537-539. doi:10.1177/1740774515597702 PubMedGoogle ScholarCrossref
20.
Ahmed  W, Fowler  RA, McCredie  VA.  Does sample size matter when interpreting the fragility index?   Crit Care Med. 2016;44(11):e1142-e1143. doi:10.1097/CCM.0000000000001976 PubMedGoogle ScholarCrossref
21.
Edwards  E, Wayant  C, Besas  J, Chronister  J, Vassar  M.  How fragile are clinical trial outcomes that support the CHEST clinical practice guidelines for VTE?   Chest. 2018;154(3):512-520. doi:10.1016/j.chest.2018.01.031 PubMedGoogle ScholarCrossref
22.
Schober  P, Bossers  SM, Schwarte  LA.  Statistical significance versus clinical importance of observed effect sizes: what do P values and confidence intervals really represent?   Anesth Analg. 2018;126(3):1068-1072. doi:10.1213/ANE.0000000000002798 PubMedGoogle ScholarCrossref
23.
Velazquez  EJ, Lee  KL, Deja  MA,  et al; STICH Investigators.  Coronary-artery bypass surgery in patients with left ventricular dysfunction.   N Engl J Med. 2011;364(17):1607-1616. doi:10.1056/NEJMoa1100356 PubMedGoogle ScholarCrossref
24.
Velazquez  EJ, Lee  KL, Jones  RH,  et al; STICHES Investigators.  Coronary-artery bypass surgery in patients with ischemic cardiomyopathy.   N Engl J Med. 2016;374(16):1511-1520. doi:10.1056/NEJMoa1602001 PubMedGoogle ScholarCrossref
25.
Combes  A, Hajage  D, Capellier  G,  et al; EOLIA Trial Group, REVA, and ECMONet.  Extracorporeal membrane oxygenation for severe acute respiratory distress syndrome.   N Engl J Med. 2018;378(21):1965-1975. doi:10.1056/NEJMoa1800385 PubMedGoogle ScholarCrossref
26.
Wasserstein  RL, Schirm  AL, Lazar  NA.  Moving to a world beyond “p<0.05”.   Am Stat. 2019;73(1):1-19. doi:10.1080/00031305.2019.1583913 Google ScholarCrossref
27.
Bauchner  H, Golub  RM, Fontanarosa  PB.  Reporting and interpretation of randomized clinical trials.   JAMA. 2019;322(8):732-735. doi:10.1001/jama.2019.12056 PubMedGoogle ScholarCrossref
28.
Harrington  D, D’Agostino  RB  Sr, Gatsonis  C,  et al.  New guidelines for statistical reporting in the journal.   N Engl J Med. 2019;381(3):285-286. doi:10.1056/NEJMe1906559 PubMedGoogle ScholarCrossref
29.
Schriger  DL.  Problems with current methods of data analysis and reporting, and suggestions for moving beyond incorrect ritual.   Eur J Emerg Med. 2002;9(2):203-207. doi:10.1097/00063110-200206000-00021 PubMedGoogle ScholarCrossref
30.
Kruschke  JK, Liddell  TM.  The bayesian new statistics: hypothesis testing, estimation, meta-analysis, and power analysis from a bayesian perspective.   Psychon Bull Rev. 2018;25(1):178-206. doi:10.3758/s13423-016-1221-4 PubMedGoogle ScholarCrossref
31.
Kruschke  JK.  Bayesian estimation supersedes the t test.   J Exp Psychol Gen. 2013;142(2):573-603. doi:10.1037/a0029146 PubMedGoogle ScholarCrossref
32.
Ryan  EG, Harrison  EM, Pearse  RM, Gates  S.  Perioperative haemodynamic therapy for major gastrointestinal surgery: the effect of a bayesian approach to interpreting the findings of a randomised controlled trial.   BMJ Open. 2019;9(3):e024256. doi:10.1136/bmjopen-2018-024256 PubMedGoogle Scholar
33.
Johnson  KW, Rappaport  E, Khader  S, Glicksberg  BS, Dudley  JT. Fragility index: an R package for statistical fragility estimates in biomedicine. bioRxiv. Preprint posted online February 27, 2019. doi:10.1101/562264
Limit 200 characters
Limit 25 characters
Conflicts of Interest Disclosure

Identify all potential conflicts of interest that might be relevant to your comment.

Conflicts of interest comprise financial interests, activities, and relationships within the past 3 years including but not limited to employment, affiliation, grants or funding, consultancies, honoraria or payment, speaker's bureaus, stock ownership or options, expert testimony, royalties, donation of medical equipment, or patents planned, pending, or issued.

Err on the side of full disclosure.

If you have no conflicts of interest, check "No potential conflicts of interest" in the box below. The information will be posted with your response.

Not all submitted comments are published. Please see our commenting policy for details.

Limit 140 characters
Limit 3600 characters or approximately 600 words
    Views 5,917
    Citations 0
    Original Investigation
    Statistics and Research Methods
    August 5, 2020

    Application of the Reverse Fragility Index to Statistically Nonsignificant Randomized Clinical Trial Results

    Author Affiliations
    • 1Department of Medicine, Cook County Health Sciences, Chicago, Illinois
    • 2Division of Cardiology, Ronald Reagan–UCLA (University of California, Los Angeles) Medical Center, Los Angeles
    • 3Department of Medical Statistics, University Medical Center Goettingen, Goettingen, Germany
    • 4Department of Medicine, Creighton University, Omaha, Nebraska
    • 5Department of Medicine, West Virginia University, Morgantown
    • 6Department of Cardiology, and Berlin Institute of Health Center for Regenerative Therapies, German Centre for Cardiovascular Research partner site Berlin; Charité Universitätsmedizin Berlin, Berlin, Germany
    • 7Department of Biostatistics, Vanderbilt University School of Medicine, Nashville, Tennessee
    • 8Department of Medicine, University of Mississippi Medical Center, Jackson
    JAMA Netw Open. 2020;3(8):e2012469. doi:10.1001/jamanetworkopen.2020.12469
    Key Points español 中文 (chinese)

    Question  In clinical trials with statistically nonsignificant primary end point results, what is the minimum number of events that must be changed to move the result from nonsignificant to statistically significant (ie, the reverse fragility index)?

    Findings  In this cross-sectional study of 167 randomized clinical trials with statistically nonsignificant results, the median reverse fragility index at a threshold of P = .05 was 8. A median of 8 events were required to change to enable a nonsignificant primary end point to become statistically significant.

    Meaning  Results of this cross-sectional study suggest that the reverse fragility index, along with effect sizes and associated 95% CIs, may provide a useful context for interpreting null clinical trial results.

    Abstract

    Importance  Interpreting randomized clinical trials (RCTs) and their clinical relevance is challenging when P values are either marginally above or below the P = .05 threshold.

    Objective  To use the concept of reverse fragility index (RFI) to provide a measure of confidence in the neutrality of RCT results when assessed from the clinical perspective.

    Design, Setting, and Participants  In this cross-sectional study, a MEDLINE search was conducted for RCTs published from January 1, 2013, to December 31, 2018, in JAMA, the New England Journal of Medicine (NEJM), and The Lancet. Eligible studies were phase 3 and 4 trials with 1:1 randomization and statistically nonsignificant binary primary end points. Data analysis was performed from August 1, 2019, to August 31, 2019.

    Exposures  Single vs multicenter enrollment, total number of events, private vs government funding, placebo vs active control, and time to event vs frequency data.

    Main Outcomes and Measures  The primary outcome was the median RFI with interquartile range (IQR) at the P = .05 threshold. Secondary outcomes were the number of RCTs in which the number of participants lost to follow-up was greater than the RFI; the median RFI with IQR at different P value thresholds; the median reverse fragility quotient with IQR; and the correlation between sample sizes, number of events, and P values of the RCT and RFI.

    Results  Of the 167 RCTs included, 76 (46%) were published in the NEJM, 50 (30%) in JAMA, and 41 (24%) in The Lancet. The median (IQR) sample size was 970 (470-3427) participants, and the median (IQR) number of events was 251 (105-570). The median (IQR) RFI at the P = .05 threshold was 8 (5-13). Fifty-seven RCTs (34%) had an RFI of 5 or lower, and in 68 RCTs (41%) the number of participants lost to follow-up was greater than the RFI. Trials with P values ranging from P = .06 to P = .10 had a median (IQR) RFI of 3 (2-4). When compared, median (IQR) RFIs were not statistically significant for single-center vs multicenter enrollment (5 [4-13] vs 8 [5-13]; P = .41), private vs government-funded studies (9 [5-13] vs 8 [5-13]; P = .34), and time-to-event primary end points vs frequency data (9 [5-14] vs 7 [4-13]; P = .43). The median (IQR) RFI at the P = .01 threshold was 12 (7-19) and at the P = .005 threshold was 14 (9-21).

    Conclusions and Relevance  This cross-sectional study found that a relatively small number of events (median of 8) had to change to move the primary end point of an RCT from nonsignificant to statistically significant. These findings emphasize the nuance required when interpreting trial results that did not meet prespecified significance thresholds.

    Introduction

    Interpreting randomized clinical trial (RCT) results and their clinical relevance when P values are marginally above or below the threshold of P = .05 is challenging.1 Although the clinical relevance may not be different, a P value marginally below the P = .05 threshold is usually accepted as a favorable finding in a trial, and a P value above the P = .05 threshold is considered an unfavorable result.2 Efficacy of an intervention should be evaluated comprehensively on the basis of the effect size measures, such as relative risk reduction or number needed to treat accompanied by P values and 95% CIs, but clinical research continues to emphasize the prespecified threshold of P = .05 when interpreting results. Such reliance on P values invites the risk of a type II error (ie, nonrejection of a false null hypothesis, which is also known as a false-negative or β error), especially in the presence of fewer events, small sample sizes, and/or limited follow-up times.3-5 Thus, it is critical to also evaluate the robustness of null trial results in cases in which the clinical consequences of a type II error are more important than those of a type I error (ie, rejection of a true null hypothesis, which is also known as a false-positive or α error), such as in disease states with high mortality and limited therapeutic options and with an acceptable intervention safety profile.

    Robustness of statistically significant trials is often evaluated using the fragility index (FI), which is exclusively applied to trials that reach traditional statistical significance.6-15 Although a few studies have applied the FI to null clinical trial results of specific diseases,7,15 none of these previous studies have systematically assessed the robustness of a large number of statistically nonsignificant phase 3 to 4 trials with an emphasis on interpretability of null clinical trial results.

    In this cross-sectional study, we used the concept of a reverse fragility index (RFI) to calculate the minimum number of events needed to change trial results from statistically nonsignificant to statistically significant. Our intent was to provide a measure of confidence in the neutrality of results when assessed from the clinical perspective.

    Methods
    Data Sources and Study Selection

    In July 2019, 2 of us (N.L. and M.S.K) conducted a MEDLINE search for RCTs published in peer-reviewed general medical journals between January 1, 2013, and December 31, 2018. We used the following search specification: (“Lancet (London, England)”[Journal] OR “The New England Journal of Medicine”[Journal]) OR “Journal of the American Medical Association”[Journal]) AND (Randomized Controlled Trial [ptyp] AND (“2013/01/01”[PDAT]: “2018/12/31”[PDAT])). JAMA, the New England Journal of Medicine, and The Lancet were chosen because of their history of publishing landmark RCTs. No search restrictions were applied. Because this cross-sectional study used only publicly available data and did not involve patients, no institutional review board approval or informed consent was sought. We followed the Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) reporting guideline.

    Two of us (N.L. and M.S.K) screened all of the RCT titles and abstracts based on the predefined eligibility criteria, which included (1) phase 3 or 4 RCTs, (2) 2-arm studies with 1:1 randomization, and (3) statistically nonsignificant binary primary end points. Letters, editorials, systematic reviews or meta-analyses, opinions, observational studies, economic or cost-effective analyses of RCTs, cohort nonrandomized studies, quasi-randomized trials, and post hoc or secondary analyses of previously reported RCTs were excluded. Only RCTs were considered because the concept of fragility is not applicable to non-RCTs owing to the presence of confounders and selection bias.8 Moreover, only RCTs that reported dichotomous outcomes were considered because an RFI, defined as the number of converted cases needed to make a nonsignificant result significant, cannot be applied for continuous variables.8

    Data Extraction and Outcomes

    We used a prespecified data collection form to extract data from all RCTs. In cases of discrepancies, another independent reviewer (S.U.K) reviewed the data and adjudicated. Data abstracted included the study outcome, event rates, sample sizes of comparative groups, location of the trial, type of blinding, single-center vs multicenter enrollment, type of funding (government or private), number of participants lost to follow-up, follow-up duration, and acknowledgment of the potential for underpowering in the discussion or conclusion section. For time-to-event outcomes, the total number of events in each group over the entire follow-up period was included. For the number of participants lost to follow-up, only those who were truly lost to follow-up were considered, and other diminution factors of the denominator, such as deaths, were not accounted for because they were considered outcome events. In this study, the primary outcome was the median (interquartile range [IQR]) RFI at the P = .05 threshold. The secondary outcomes were the number of RCTs in which the number of participants lost to follow-up was greater than the RFI; the median RFI with IQR at different P value thresholds; the median reverse fragility quotient (RFQ) with IQR; and the correlation between sample sizes, number of events, and P values of the RCT and RFI.

    Statistical Analysis

    The RFI was calculated by subtracting events from the group with a lower number of events while simultaneously adding nonevents to the same group to keep the number of participants constant until the Fisher exact test 2-sided P value became less than .05. Because of the recent proposals to lower the P value threshold,16,17 we also calculated the RFI at the P = .01 and P = .005 thresholds. These calculations were carried out in the same method as already described except that events were subtracted from the group with a lower number of events while simultaneously adding nonevents to the same group to keep the number of participants constant until the Fisher exact test 2-sided P value became less than .01 for the P = .01 threshold and .005 for the P = .005 threshold. This calculation allowed us to see how the magnitude of the RFI changed if the P value threshold was made more rigorous. We calculated the RFI only for primary end points because RCTs are powered to detect the treatment effect for the primary end point. Thus, the relevance of fragility might be limited for secondary outcomes.8

    A lower RFI indicates less statistical robustness and vulnerability to move from statistical nonsignificance to significance on the basis of a few events. Currently, no cutoff is deemed acceptable for fragility. The number of participants lost to follow-up was compared with the RFI for each trial given that loss to follow-up was associated with both the number of study participants at risk and the number of recorded events. Trials in which the number of participants lost to follow-up is greater than the RFI is concerning because factoring in the unknown outcomes of participants who are lost to follow-up can easily alter the significance of the results.18,19 Moreover, we calculated the RFQ, which is the RFI divided by the sample size, because the RFI is an absolute measure and does not account for the sample size, making it difficult to compare the reverse fragility of different RCTs or to set a standard RFI value.20 Knowing the RFQ enables the assessment of the proportion of events that must be changed to move the results from nonsignificant to significant. For example, trial X has an RFI of 2 and a sample size of 100, whereas trial Y has an RFI of 2 and a sample size of 200. Although both trials have the same RFI, we can use the RFQ to gauge which trial is relatively more fragile. Trial X has an RFQ of 0.02, which means that approximately 2 events per 100 participants are needed to change the significance of the results. Trial Y has an RFQ of 0.01, which means that the nonsignificance of trial Y is contingent on approximately 1 event per 100 participants, suggesting that trial Y is relatively more fragile. A smaller RFQ indicates a less robust study. In addition, we calculated the proportion of RCTs with an RFI that was 1% or less of the total sample size.

    We reported the overall RFIs as medians with IQRs. Spearman rank correlation coefficient was used to assess the correlation between sample size, number of events, and P value of the RCT and RFI. The Kruskal-Wallis statistic was applied to detect the association between RFIs and nominal variables in more than 2 groups, and the Mann-Whitney rank sum test was used for the association in 2 groups (sensitivity analysis). A 2-tailed P < .05 indicated statistical significance for all assessments. All analyses were performed with R, version 3.51 (R Foundation for Statistical Computing) and Excel, version 14.1.3 (Microsoft Corp). Data analysis was conducted from August 1, 2019, to August 31, 2019.

    Results

    Of the 2521 potentially relevant studies identified, 167 RCTs (7%) met the eligibility criteria (eTable in the Supplement); Figure 1 shows the literature search strategy. The characteristics of the included RCTs are presented in Table 1. All trials used P ≤ .05 as the threshold of significance, and the range of sample sizes was from 48 to 31 999 participants. Of the 167 RCTs, 76 (46%) were published in the New England Journal of Medicine, 50 (30%) in JAMA, and 41 (24%) in The Lancet. The median (IQR) follow-up time was 6 (2-20) months, whereas the median (IQR) number of participants lost to follow-up was 38 (19-79). The median (IQR) total sample size was 970 (470-3427) patients, with 472 (226-1717) participants in the control groups and 459 (228-1707) participants in the intervention groups. The median (IQR) total number of events was 251 (105-570), with 127 (54-286) events in the control groups and 128 (50-279) events in the intervention groups. Eighteen RCTs (11%) had P values between P = .06 and P = .10. Mortality was assessed as a primary end point in 90 RCTs (54%). Sixty-four RCTs (38%) had time-to-event primary end points, and 49 trials (29%) acknowledged the potential for being underpowered in their discussion or conclusion section.

    Reverse Fragility Index at P = .05 Threshold

    The median (IQR) RFI of the 167 trials was 8 (5-13), indicating that a median of 8 events was required to change the results of the primary end point from nonsignificant to significant. Fifty-seven RCTs (34%) had an RFI of 5 or lower. In 68 RCTs (41%), the number of participants lost to follow-up was greater than the RFI (Table 2). In such trials, the median (IQR) RFI was 8.5 (5-13), and the median (IQR) number of participants lost to follow-up was 37.5 (18-78). When compared, median (IQR) RFIs were not statistically significant for single center vs multicenter enrollment (5 [4-13] vs 8 [5-13]; P = .41), private vs government funding (9 [5-13] vs 8 [5-13]; P = .34), and time-to-event primary end points vs frequency data (9 [5-14] vs 7 [4-13]; P = .43) (Table 3).

    Trials with P values ranging from P = .06 to P = .10 had a median (IQR) RFI of 3 (2-4). The RFIs and P values were statistically significantly correlated (Spearman correlation coefficient [r] = 0.54; 95% CI, 0.42-0.64; P < .001). Similarly, the RFI and total number of events demonstrated a statistically significant correlation (r = 0.65; 95% CI, 0.55-0.73; P < .001). Figure 2 shows the scatterplots of the correlation between RFI and P value, sample size, and total number of events.

    Reverse Fragility Quotient at P = .05 Threshold

    The median (IQR) RFQ was 0.008 (0.0025-0.0155), indicating that the nonsignificance of the results was contingent on only 0.8 events per 100 participants. The median (IQR) RFQ of trials with an RFI lower than the number of participants lost to follow-up was 0.004 (0.0018-0.0118), whereas the median (IQR) RFQ of the remaining trials was 0.01 (0.004-0.0180). Of the 167 RCTs, 124 (74%) had an RFI that was 1% or less of the total sample size, and 107 RCTs (64%) had an RFI that was 2% or less of the total number of events.

    Reverse Fragility Index at Different P Value Thresholds

    The median (IQR) RFI at the P = .01 threshold was 12 (7-19) and at the P = .005 threshold was 14 (9-21). Only 11 RCTs at the P = .01 threshold and 7 RCTs at the P = .005 threshold had an RFI of 5 or lower. eFigures 1 and 2 in the Supplement show the number of RCTs within each range of RFI at different P value thresholds. The RFI at both thresholds had a statistically significant correlation with sample size (r = 0.83, P < .001 at the P = .01 threshold; r = 0.69, P < .001 at the P = .005 threshold) and total number of events (r = 0.83, P < .001 at the P = .01 threshold; r = 0.86, P < .001 at the P = .005 threshold) (eFigures 3 to 6 in the Supplement).

    Reverse Fragility Quotient at P = .01 and P = .005 Thresholds

    At the P = .01 threshold, the median (IQR) RFQ was 0.0113 (0.0042-0.0217), suggesting that the nonsignificance of the results was contingent on approximately 1 event per 100 participants. The median (IQR) RFQ at the P = .005 threshold was 0.6937 (0.5960-0.7472), indicating that nonsignificance of the results was contingent on approximately 70 events per 100 participants. Of the 167 RCTs, 98 RCTs (59%) had an RFI that was 1% or less of the total sample size at the P = .01 threshold compared with no trials for the P = .005 threshold.

    Discussion

    This study found that approximately one-third of null RCTs published in journals with a high impact factor had an RFI of 5 or lower and that, in 41% of the trials, the number of participants lost to follow-up was greater than the RFI. The RFI was particularly low in trials with P values that ranged from P = .06 to P = .10, and a strong correlation between P values and RFIs was noted. The median RFQ at the P = .05 threshold was 0.008, indicating that nonsignificance of the results was only contingent on 0.8 events per 100 participants.

    The FI has been reported for many specialties, subspecialties, and clinical care guidelines.6-15,21 The median (IQR) FI score reported was 26 (0-118) in heart failure trials,7 16 (4-29) in diabetes treatment guidelines,15 and 5 (1-9) in antithrombotic therapy21 for venous thromboembolism guidelines. Similarly, a review of almost 400 medical and surgical trials published in the New England Journal of Medicine, JAMA, and The Lancet showed that the median (IQR) FI score was 8 (0-109) and that, in 53% of these trials, the number of participants lost to follow-up was greater than the FI score.6 These results are similar to our RFI findings.

    Statistical nonsignificance interpreted in the form of P values can misinform and may be disconnected from the real effect of the intervention.22 Clinicians who lack in-depth knowledge of statistical and trial design parameters, such as event rates, power to detect estimate differences, number of participants lost to follow-up, and follow-up duration to allow a sufficient time window for outcome differences to emerge, may have a tendency to base their conclusions solely on P values. This tendency may be relevant when interpreting the results of interventional or surgical trials in which a small initial excess of events in the intervention group could be switched if the intervention is advantageous. For example, the STICH (Surgical Treatment for Ischemic Heart Failure) trial23 was statistically nonsignificant at 5 years of follow-up with an RFI of 5, but the results became statistically significant in the 10-year extended follow-up trial.24 Another example is the EOLIA (Extracorporeal Membrane Oxygenation to Rescue Acute Lung Injury in Severe Acute Respiratory Distress Syndrome) trial.25 In the EOLIA study, although the primary outcome of 60-day mortality was not statistically significantly different between veno-venous extracorporeal membrane oxygenation and conventional therapy (35% vs 46%; relative risk, 0.76; 95% CI, 0.55-1.04; P = .09), the RFI was lower than 5 and clear advantages were seen in most secondary outcomes. In the present context, the RFI provides readers, participants, and clinicians alike with an alternative way to understand null trial results and to convey uncertainty numerically to guide clinical interpretation and future research.

    We propose carefully assessing trials with P values ranging from P = .05 to P = .10 to avoid potentially overlooking advantageous interventions given that a type II error might be more damaging than a type I error. No specific cutoff value of FI or RFI exists that can be used to define acceptable robustness. A high RFI value does not necessarily indicate a robust result, and a low RFI may not mean the result is not robust. For instance, if the number of participants lost to follow-up is greater than the RFI, the RCT results should still be viewed with skepticism even if the RFI value is high. Therefore, considering different biases when interpreting trial results is important. Moreover, because RFIs are strongly correlated with P values and sample sizes, RFIs may not be used in assessing the robustness of trials without a broader context. Nevertheless, RFI is an intuitive index for interpreting null trial results in addition to effect sizes and 95% CIs.

    Recently, some researchers have proposed to change the statistical significance threshold from P = .05 to P = .005 to guard against false-positive results.16,17 Thus, we calculated the RFIs at different P value thresholds to allow stakeholders to gauge how some of these proposals will play out using the concept of RFI. We found that the median RFI increased from 8 to 12 at the threshold of P = .01 and to 14 at the threshold of P = .005. Changing the statistical significance threshold does not solve the primary problem of P values being viewed in a dichotomized manner. Moreover, using a rigorous threshold will make it challenging for investigators and sponsors to come up with breakthrough therapeutics and may make it easier to overlook many potential treatments. In line with the recommendation of the American Statistical Association, we suggest the continued interpretation of P values rather than a distinct demarcation based on the prespecified value of P = .05, especially when borderline results are obtained at trials.26 Furthermore, all P values should be accompanied by point estimates and margins of error as recommended by the statistical guidelines of JAMA and the New England Journal of Medicine.27,28 Despite what P value threshold is taken, RFI may serve as an additional metric to show how statistical significance can be missed.

    The RFI does not solve the more complex statistical issues underlying clinical practice but is basically an extension of a frequentist approach to trial analysis. It also has the many inherent limitations of null hypothesis testing. Null hypothesis statistical testing assumes a single hypothesis and generally brings about the use of a dichotomized approach of rejecting or not rejecting the null hypothesis on the basis of a single study.29-31 Thus, RFI and FI have been criticized for perpetuating the dichotomization of P values and have been described as just a restatement of a P value.18 Despite these claims, RFI is important because it provides an easy, quick way to see the gray in a typically black-or-white interpretation of results; it also helps illustrate the drawback of the use of P value thresholds in defining the statistical significance of treatment effects by showing their relative fragility. Compared with null hypothesis statistical testing, a bayesian approach offers a natural framework for incorporating additional information, including data from other trials or elicited expert opinion, into the analyses. Formally, this outcome is achieved by so-called priori distributions that reflect the additional information. In this way, bayesian approaches can facilitate the interpretation of trials.32 However, although bayesian approaches make sense theoretically, their application is sometimes hindered by the need to prespecify prior distributions. Moreover, although a bayesian approach is attractive, most clinical research is currently based on frequentist approaches; in such a case, RFI is a useful supplemental method. Therefore, large RCTs with a high number of events should be advocated for generating robust evidence.

    Limitations

    This study has several limitations. First, use of the RFI concept was limited to null trial results, which had 1:1 randomization and primary dichotomous end points. Clinically important continuous end points were excluded from the study. Second, the use of Fisher exact test in calculating the RFI was crude because some of the included RCTs analyzed data in models with covariates or time-to-event techniques in which the original numbers, if analyzed with Fisher exact test, would not have the same P value as the published study. Thus, the RFIs did not account for the contribution of time to the difference in treatment effects. However, Walsh et al6 found no difference in FI scores between the time-to-event data and frequency data. This finding suggests that the RFI may also not be affected because the results were more sensitive to the number of events in each group rather than to timing of the events. Nevertheless, concerns remain that the RFI might give excessively fragile results in time-to-event data, especially when the events are similar in each group but the timing of events is different. Results of the sensitivity analyses, however, were not statistically significantly different between the included trials, which used time-to-event primary end points vs frequency data. Moreover, the concept of survival FI, defined as the number of participants with an event at the mean exposure time of all participants in the study whose addition would result in a loss of statistical significance, has been proposed as an alternative to accounting for events over time.33 Methods to calculate FI for logistic regression β coefficients have also been proposed.33 These concepts can be extended to RFI in future studies. Third, we analyzed only RCTs with null results that were published in select peer-reviewed journals.

    Conclusions

    In this cross-sectional study of the RFI of statistically nonsignificant results of published clinical trials, the median RFI appeared to be low, indicating that a relatively small number of events were required to change to turn the primary end point from nonsignificant to statistically significant. The findings in this study emphasize the nuance required when interpreting trial results that did not meet prespecified significance thresholds.

    Back to top
    Article Information

    Accepted for Publication: May 18, 2020.

    Published: August 5, 2020. doi:10.1001/jamanetworkopen.2020.12469

    Correction: This article was corrected on August 28, 2020, to fix an incorrect author degree in the byline.

    Open Access: This is an open access article distributed under the terms of the CC-BY License. © 2020 Khan MS et al. JAMA Network Open.

    Corresponding Author: Javed Butler, MD, MPH, MBA, Department of Medicine, University of Mississippi Medical Center, 2500 N State St, Jackson, MS 39216 (jbutler4@umc.edu).

    Author Contributions: Drs M. S. Khan and Lateef had full access to all of the data in the study and take responsibility for the integrity of the data and the accuracy of the data analysis.

    Concept and design: M. S. Khan, Fonarow, Lateef, Anker, Butler.

    Acquisition, analysis, or interpretation of data: M. S. Khan, Fonarow, Friede, Lateef, S. U. Khan, Harrell, Butler.

    Drafting of the manuscript: M. S. Khan, Lateef, Butler.

    Critical revision of the manuscript for important intellectual content: Fonarow, Friede, S. U. Khan, Anker, Harrell, Butler.

    Statistical analysis: Friede, Lateef, Harrell.

    Administrative, technical, or material support: M. S. Khan, S. U. Khan, Butler.

    Supervision: M. S. Khan, Butler.

    Conflict of Interest Disclosures: Dr Fonarow reported receiving personal fees from Abbott, AstraZeneca, Amgen, Bayer, Edwards, Janssen, Merck, and Medtronic outside the submitted work as well as being associate editor of JAMA Cardiology. Dr Friede reported receiving personal fees from Bayer, Novartis, Vifor, Enanta, Daiichy Sankyo, Johnson & Johnson, Boehringer Ingelheim, Roche, Fresenius Kabi, LivaNova, Galapagos, Penumbra, and Relaxera outside the submitted work. Dr Anker reported receiving grants and personal fees from Vifor and AV-Pharma as well as personal fees from Bayer, Boehringer Ingelheim, Novartis, Impulse Dynamics, and Servier outside the submitted work. Dr Butler reported receiving personal fees as a consultant from Abbott, Adrenomed, Amgen, Array, AstraZeneca, Bayer, Berlin Cures, Boehringer Ingelheim, Bristol-Myers Squib, CVRx, G3 Pharmaceutical, Innolife, Janssen, LinaNova, Luitpold, Medtronic, Merck, Novartis, Novo Nordisk, Relypsa, Roche, Sanofi, SC Pharma, V-Wave Limited, and Vifor outside the submitted work. No other disclosures were reported.

    References
    1.
    Alderson  P.  Absence of evidence is not evidence of absence.   BMJ. 2004;328(7438):476-477. doi:10.1136/bmj.328.7438.476 PubMedGoogle ScholarCrossref
    2.
    Pocock  SJ, McMurray  JJV, Collier  TJ.  Making sense of statistics in clinical trial reports: part 1 of a 4-part series on statistics for clinical trials.   J Am Coll Cardiol. 2015;66(22):2536-2549. doi:10.1016/j.jacc.2015.10.014 PubMedGoogle ScholarCrossref
    3.
    Wasserstein  RL, Lazar  NA.  The ASA’s statement on P values: context, process, and purpose.   Am Stat. 2016;70(2):129-133. doi:10.1080/00031305.2016.1154108 Google ScholarCrossref
    4.
    Freiman  JA, Chalmers  TC, Smith  H  Jr, Kuebler  RR.  The importance of beta, the type II error and sample size in the design and interpretation of the randomized control trial—survey of 71 negative trials.   N Engl J Med. 1978;299(13):690-694. doi:10.1056/NEJM197809282991304 PubMedGoogle ScholarCrossref
    5.
    Akl  EA, Briel  M, You  JJ,  et al.  Potential impact on estimated treatment effects of information lost to follow-up in randomised controlled trials (LOST-IT): systematic review.   BMJ. 2012;344:e2809. doi:10.1136/bmj.e2809 PubMedGoogle Scholar
    6.
    Walsh  M, Srinathan  SK, McAuley  DF,  et al.  The statistical significance of randomized controlled trial results is frequently fragile: a case for a fragility index.   J Clin Epidemiol. 2014;67(6):622-628. doi:10.1016/j.jclinepi.2013.10.019 PubMedGoogle ScholarCrossref
    7.
    Docherty  KF, Campbell  RT, Jhund  PS, Petrie  MC, McMurray  JJV.  How robust are clinical trials in heart failure?   Eur Heart J. 2017;38(5):338-345. doi:10.1093/eurheartj/ehw427PubMedGoogle Scholar
    8.
    Tignanelli  CJ, Napolitano  LM.  The fragility index in randomized clinical trials as a means of optimizing patient care.   JAMA Surg. 2019;154(1):74-79. doi:10.1001/jamasurg.2018.4318 PubMedGoogle ScholarCrossref
    9.
    Del Paggio  JC, Tannock  IF.  The fragility of phase 3 trials supporting FDA-approved anticancer medicines: a retrospective analysis.   Lancet Oncol. 2019;20(8):1065-1069. doi:10.1016/S1470-2045(19)30338-9 PubMedGoogle ScholarCrossref
    10.
    Evaniew  N, Files  C, Smith  C,  et al.  The fragility of statistically significant findings from randomized trials in spine surgery: a systematic survey.   Spine J. 2015;15(10):2188-2197. doi:10.1016/j.spinee.2015.06.004 PubMedGoogle ScholarCrossref
    11.
    Mazzinari  G, Ball  L, Serpa Neto  A,  et al.  The fragility of statistically significant findings in randomised controlled anaesthesiology trials: systematic review of the medical literature.   Br J Anaesth. 2018;120(5):935-941. doi:10.1016/j.bja.2018.01.012 PubMedGoogle ScholarCrossref
    12.
    Matics  TJ, Khan  N, Jani  P, Kane  JM.  The fragility index in a cohort of pediatric randomized controlled trials.   J Clin Med. 2017;6(8):79. doi:10.3390/jcm6080079 PubMedGoogle ScholarCrossref
    13.
    Brown  J, Lane  A, Cooper  C, Vassar  M.  The results of randomized controlled trials in emergency medicine are frequently fragile.   Ann Emerg Med. 2019;73(6):565-576. doi:10.1016/j.annemergmed.2018.10.037 PubMedGoogle ScholarCrossref
    14.
    Wayant  C, Meyer  C, Gupton  R, Som  M, Baker  D, Vassar  M.  The fragility index in a cohort of HIV/AIDS randomized controlled trials.   J Gen Intern Med. 2019;34(7):1236-1243. doi:10.1007/s11606-019-04928-5 PubMedGoogle ScholarCrossref
    15.
    Chase Kruse  B, Matt Vassar  B.  Unbreakable? an analysis of the fragility of randomized trials that support diabetes treatment guidelines.   Diabetes Res Clin Pract. 2017;134:91-105. doi:10.1016/j.diabres.2017.10.007 PubMedGoogle ScholarCrossref
    16.
    Ioannidis  JPA.  The proposal to lower P value thresholds to .005.   JAMA. 2018;319(14):1429-1430. doi:10.1001/jama.2018.1536 PubMedGoogle ScholarCrossref
    17.
    Wayant  C, Scott  J, Vassar  M.  Evaluation of lowering the P value threshold for statistical significance from .05 to .005 in previously published randomized clinical trials in major medical journals.   JAMA. 2018;320(17):1813-1815. doi:10.1001/jama.2018.12288 PubMedGoogle ScholarCrossref
    18.
    Carter  RE, McKie  PM, Storlie  CB.  The fragility index: a P value in sheep’s clothing?   Eur Heart J. 2017;38(5):346-348. doi:10.1093/eurheartj/ehw495PubMedGoogle Scholar
    19.
    Walsh  M, Devereaux  PJ, Sackett  DL.  Clinician trialist rounds: 28: when RCT participants are lost to follow-up, part 1: why even a few can matter.   Clin Trials. 2015;12(5):537-539. doi:10.1177/1740774515597702 PubMedGoogle ScholarCrossref
    20.
    Ahmed  W, Fowler  RA, McCredie  VA.  Does sample size matter when interpreting the fragility index?   Crit Care Med. 2016;44(11):e1142-e1143. doi:10.1097/CCM.0000000000001976 PubMedGoogle ScholarCrossref
    21.
    Edwards  E, Wayant  C, Besas  J, Chronister  J, Vassar  M.  How fragile are clinical trial outcomes that support the CHEST clinical practice guidelines for VTE?   Chest. 2018;154(3):512-520. doi:10.1016/j.chest.2018.01.031 PubMedGoogle ScholarCrossref
    22.
    Schober  P, Bossers  SM, Schwarte  LA.  Statistical significance versus clinical importance of observed effect sizes: what do P values and confidence intervals really represent?   Anesth Analg. 2018;126(3):1068-1072. doi:10.1213/ANE.0000000000002798 PubMedGoogle ScholarCrossref
    23.
    Velazquez  EJ, Lee  KL, Deja  MA,  et al; STICH Investigators.  Coronary-artery bypass surgery in patients with left ventricular dysfunction.   N Engl J Med. 2011;364(17):1607-1616. doi:10.1056/NEJMoa1100356 PubMedGoogle ScholarCrossref
    24.
    Velazquez  EJ, Lee  KL, Jones  RH,  et al; STICHES Investigators.  Coronary-artery bypass surgery in patients with ischemic cardiomyopathy.   N Engl J Med. 2016;374(16):1511-1520. doi:10.1056/NEJMoa1602001 PubMedGoogle ScholarCrossref
    25.
    Combes  A, Hajage  D, Capellier  G,  et al; EOLIA Trial Group, REVA, and ECMONet.  Extracorporeal membrane oxygenation for severe acute respiratory distress syndrome.   N Engl J Med. 2018;378(21):1965-1975. doi:10.1056/NEJMoa1800385 PubMedGoogle ScholarCrossref
    26.
    Wasserstein  RL, Schirm  AL, Lazar  NA.  Moving to a world beyond “p<0.05”.   Am Stat. 2019;73(1):1-19. doi:10.1080/00031305.2019.1583913 Google ScholarCrossref
    27.
    Bauchner  H, Golub  RM, Fontanarosa  PB.  Reporting and interpretation of randomized clinical trials.   JAMA. 2019;322(8):732-735. doi:10.1001/jama.2019.12056 PubMedGoogle ScholarCrossref
    28.
    Harrington  D, D’Agostino  RB  Sr, Gatsonis  C,  et al.  New guidelines for statistical reporting in the journal.   N Engl J Med. 2019;381(3):285-286. doi:10.1056/NEJMe1906559 PubMedGoogle ScholarCrossref
    29.
    Schriger  DL.  Problems with current methods of data analysis and reporting, and suggestions for moving beyond incorrect ritual.   Eur J Emerg Med. 2002;9(2):203-207. doi:10.1097/00063110-200206000-00021 PubMedGoogle ScholarCrossref
    30.
    Kruschke  JK, Liddell  TM.  The bayesian new statistics: hypothesis testing, estimation, meta-analysis, and power analysis from a bayesian perspective.   Psychon Bull Rev. 2018;25(1):178-206. doi:10.3758/s13423-016-1221-4 PubMedGoogle ScholarCrossref
    31.
    Kruschke  JK.  Bayesian estimation supersedes the t test.   J Exp Psychol Gen. 2013;142(2):573-603. doi:10.1037/a0029146 PubMedGoogle ScholarCrossref
    32.
    Ryan  EG, Harrison  EM, Pearse  RM, Gates  S.  Perioperative haemodynamic therapy for major gastrointestinal surgery: the effect of a bayesian approach to interpreting the findings of a randomised controlled trial.   BMJ Open. 2019;9(3):e024256. doi:10.1136/bmjopen-2018-024256 PubMedGoogle Scholar
    33.
    Johnson  KW, Rappaport  E, Khader  S, Glicksberg  BS, Dudley  JT. Fragility index: an R package for statistical fragility estimates in biomedicine. bioRxiv. Preprint posted online February 27, 2019. doi:10.1101/562264
    ×