Survival-Inferred Fragility Index of Phase 3 Clinical Trials Evaluating Immune Checkpoint Inhibitors

Key Points Question How stable are the conclusions of phase 3 randomized clinical trials of immune checkpoint inhibitors in oncology? Findings This cross-sectional study of 45 randomized clinical trials calculated the survival-inferred fragility index and found that many oncologic trials assessing immune checkpoint inhibitors have a low survival-inferred fragility index, often less than a small fraction of the sample size and less than the number of patients censored soon after randomization. Meaning These results challenge the robustness of many phase 3 randomized clinical trials of immune checkpoint inhibitors in oncology and address the uncertainty regarding their potential clinical benefit.


Introduction
Immune checkpoint inhibitors (ICIs) targeting cytotoxic T-lymphocyte-associated protein 4 (CTLA- 4) or programmed cell death 1 (PD-1) and programmed cell death 1 ligand 1 (PD-L1) have revolutionized cancer treatment and led to their approval as first-line therapies, either alone or in combination with chemotherapy, for many solid tumors and hematologic malignant neoplasms. 1 However, the clinical benefit associated with ICIs cannot be generalized into a single category, as the therapeutic effectiveness varies widely across different cancer indications. [2][3][4][5][6][7] The number of active clinical trials of ICIs is growing rapidly, along with an increased pace of accelerated approvals by the US Food and Drug Administration (FDA). 8,9 The eligibility criteria for ICI therapy are dynamic, and results of postmarketing studies often lead to label revisions, with more changes expected to follow. 10 Despite the popularity of ICIs and the expanding eligibility for expensive and potentially toxic treatments, the percentage of eligible patients who benefit from ICIs is decreasing. 10,11 This gap between ICI eligibility and clinical benefit is concerning and is not fully understood.
Since the introduction of the P value almost a century ago, reliance on a fixed cutoff serving as the gatekeeper for establishing significance in clinical trials has caused controversy. 12,13 Statistically significant differences in outcomes using an arbitrary threshold (P < .05) may not be clinically relevant, especially when the estimated outcome does not offer substantial clinical benefit. 14,15 The fragility of statistical inference can be signified by the ease with which a significant P value (P < .05) crosses over the significance threshold (P > .05). 16, 17 Johnson et al 18 introduced a method to compute the fragility for survival analysis by iteratively adding artificial patients to the experimental group with events at the mean exposure time of all individuals until significance is lost. Using this method, one study has recently shown that the fragility index of time-to-event data can be used to estimate the level of confidence of positive results reported in randomized clinical trials (RCTs) leading to FDA approval of anticancer drugs. 19 However, this approach that simulates average "virtual" patients might inflate the fragility estimate as patients at the extreme, who contribute the most to the survival curves, are disregarded. Many possible ways could be formulated to estimate the fragility of survival data. Therefore, we aimed to define a simple and intuitive fragility measure for survival analysis, based on real-life conditions, that captures the vulnerability of the data. Hence, we define the survival-inferred fragility index (SIFI) as the minimum number of reassignments of the best survivors (defined as the patients with the longest follow-up time, regardless of having an event or being censored; the worst survivors were defined as the patients with the earliest events) from the experimental group to the control group resulting in loss of significance (Figure 1). The purpose of this study is to evaluate the fragility of phase 3 RCTs comparing ICIs with control or standard treatments in a time-aware context.

Study Design
The cross-sectional study followed the Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) reporting guideline. 20 We searched PubMed from inception until January 1, 2020, for phase 3 RCTs of ICIs (anti-CTLA-4, anti-PD-1, and anti-PD-L1) compared with standard treatment in solid and hematologic malignant neoplasms. Key words for the literature search included randomised, randomized, phase 3, phase III, ipilimumab, nivolumab, pembrolizumab, cemiplimab, durvalumab, avelumab, and atezolizumab. For the fragility analysis, we included 2-or 3-group studies that reported overall survival as a primary or secondary outcome. We excluded retrospective studies, pooled studies, and post hoc subgroup analyses. When duplicate publications for the same trial were identified, we included the most updated publication. We abstracted information on trial design and the number of enrolled patients in the study. According to institutional review board policy, ethical approval is not required because no human data were included and publicly available information was used.

JAMA Network Open | Oncology
Survival-Inferred Fragility of Phase 3 Trials of Immune Checkpoint Inhibitors

Data Extraction
Overall survival data from 45 trials were extracted from Kaplan-Meier curves in the main text using DigitizeIt software (DigitizeIt) and the method by Wei and Royston 21 using Stata, version 13.0  (StataCorp). This reverse-engineering strategy enabled us to reproduce survival time and censoring status at the individual patient level with minor differences between reconstructed and published data. 19 We excluded publications of trials with raster images in which data extraction could not be performed directly. We separated the populations into 2 cohorts-the intention-to-treat (ITT) populations, which also included modified ITT populations, and subgroup populations.

Statistical Analysis
The SIFI was calculated from Kaplan-Meier curves by the iterative redesignation of the best survivors from the experimental group to the control group until positive significance (defined as P < .05 obtained with a 2-sided log-rank test) was lost. Negative SIFI was calculated similarly, but the direction was opposite-redesignation of the best survivors from the control group to the experimental group. In addition to the default SIFI application (flipping the best survivor from the intervention group to the control group), we defined 3 alternative approaches: flipping the worst survivor from the experimental group to the control group, cloning the best survivor in the experimental group into the control group, and cloning the worst survivor in the control group to the experimental group. P values were calculated with the 2-sided unstratified log-rank test. The follow-up time distribution was calculated using the prodlim package in R (R Foundation for Statistical Computing). All other analyses were performed in R, version 3.5.0. The code used to calculate SIFI is available online. 22 To provide a reference for the ranges of SIFI for various parameters of survival data, we generated synthetic survival data with the survsim package in R. 23 The "simple.surv.sim" function was used with the Weibull distribution for both the time to event and the time to censoring. The cohort size was set to range from 100 to 1200 individuals in intervals of 100 (with a 1:1 allocation).
The ancillary parameter for the events was set to 1.5, and the ancillary parameter for the censoring was set to 2, 4, 6, 8, or 10. The covariate for the effect size was set to all values between −1 and 0.2 in increments of 0.05. The β 0 parameter for the event distribution was set to 2.0, and the β 0 for the censoring distribution was set to 2.01.

Results
For the period until January 1, 2020, we identified 45 phase 3 RCTs

Discussion
In our study, we found that the statistical significance of a substantial amount of phase 3 trials of ICIs could be lost or gained with a change in assignment of very few of the best surviving patients, often less than 1% of the respective trial sample size. Although this is an arbitrary number and does not reflect a random sampling of the patients, it represents a small fraction of the population that can overturn the statistical conclusions. Also, the change in the number of patients required for fragility is often smaller than the number of patients censored in the experimental group shortly after randomization, adding further uncertainties and raising concerns about the statistical outcomes had these and other patients been assessed to their end point. Eligibility for treatment with ICIs is assessed by concluding whether results of a trial are positive or negative. Our findings demonstrate how unstable these conclusions may be, and explain, in part, the widening gap between eligibility and benefit associated with ICIs.    timing of events. 19 Although descriptions of time-to-event fragility exist, 18,19 to our knowledge, no previous peer-reviewed original investigations have estimated time-aware fragility index for clinical trials, including oncology trials. Also, to our knowledge, no study has evaluated negative fragility measures for survival analysis.

JAMA Network Open | Oncology
In general, the P value serves as a measure of the compatibility of collected data with a defined statistical model. In a testing framework, smaller P values indicate greater evidence against the null hypothesis-a conjecture of no difference between outcomes of the intervention and control groups. 75 Undoubtedly, the P value plays a central role in the clinical testing of new drugs, and since the 1960s, the FDA has relied on significance testing to establish their effectiveness in the approval process. 76 As such, nowhere is this role more important than in clinical trials, where the smallest change in the P value can decisively influence the drug approval process and result in trial success or failure. Consequently, passing the statistical significance threshold has become the ultimate goal, and unless an analysis is adequately prespecified, most research designs allow enough leeway to manipulate the results to claim importance. 77-80 Therefore, reliance on P values falling to either side of the significance threshold can result in extreme conclusions and be misleading, especially for a low threshold such as P < .05. Recently, an influential commentary published in Nature 12 has even called for the abandonment of the conventional threshold for statistical significance, regardless of the level (eg, P < .05), owing to this imposed dichotomization. However, statistical inferences are unavoidably dichotomous in many scientific fields. Most decisions in medicine are dichotomous, such as a new drug will either be approved or not, and will either be prescribed or not. 77 This study introduces the SIFI as a novel measure that enables us to estimate the vulnerability of the statistical conclusions of clinical trials with time-to-event outcomes. This index transforms the dichotomous conclusion to a discrete variable that provides more perspective regarding the potential benefit associated with ICIs or any other intervention. The SIFI provides context to the P value and statistical significance, which may not necessarily be intuitive and are often poorly understood. 77 Therefore, the SIFI translates uncertainty to a specified number that represents actual patients and events and places it on a linear scale that allows for assessment of the robustness of the results. For example, consider 2 comparable studies with similar P values. Although the SIFI is not a measure of effect, a trial with a high SIFI with an acceptable association with the sample size and censoring provides more robustness than a trial with a small SIFI representing a small fraction of the sample size and censoring. The latter relies on fragile evidence with higher uncertainty regarding the incompatibility with the null hypothesis. We did not define criteria for fragile vs nonfragile values, nor do we believe that a measure aimed to address the dichotomization of results by a threshold should be replaced by another. Perhaps trials involving the addition of a costly and a toxic drug to the standard treatment with a small effect size would require a higher level of robustness than trials comparing 2 drugs with similar overall properties. In contrast, concluding that statistically significant results show no real association when the fragility measure is very low is discouraged; it is equally inaccurate to claim that nonsignificant results with very small negative fragility point to an important signal. However, the SIFI allows for putting these 2 scenarios in context, expressing uncertainty and suggesting that the interpretation of their importance should be similar or, de facto, the same. In both cases, and especially for negative fragility measures, small values indicate that the true underlying effects either are negligible or lack statistical power. Nevertheless, considerations such as study design, data quality, comprehension of the underlying mechanisms, and other factors may often have more importance than statistical findings 12 such as P values or fragility indices.
The default solution for improving the confidence level would be making the barrier more demanding; however, this is a suboptimal option because the chance for false-negative results increases accordingly, and it still fails to address the vulnerability of the statistics. Nevertheless, fragility corresponding to one threshold is not comparable with another, and it is reasonable to expect lower fragility measures for lower P value thresholds, as they are interrelated. Hence, the approach encourages using lower significance thresholds. A trial not meeting a low prespecified significance threshold (eg, P < .0001), with a small negative SIFI (eg, −2), may provide higher confidence in the validity of the results compared with a trial that meets a higher threshold (eg, P < .05) but has a low positive SIFI (eg, 2). The SIFI relative to sample size can be useful to estimate the robustness of the results, but it could be misleading for small sample sizes. Although SIFI less than 1% in many RCTs could suggest extreme fragility, small trials with less than 100 patients cannot achieve a SIFI of less than 1%, even when the results are certainly less robust. Therefore, the SIFI relative to sample size, especially for small trials, should not be interpreted alone and must be accompanied by the SIFI.

Limitations
Several limitations of the study should be recognized. We did not address prespecified P value thresholds, which were allocated and controlled differently in every trial and are often much lower than .05. Instead, we used the standard α level of .05 as a common reference; therefore, some trials did not meet the prespecified threshold but resulted in a positive SIFI. Although not a strict rule by the FDA, the standard 2-trial α level is .05 but is smaller for approval based on a single trial. 76  Furthermore, we aimed to define a simple and intuitive method that can be recreated using existing routines, is quantifiable in all conditions, and is applicable to real-world practice in which patients are randomly assigned from a pool of eligible patients. Although random variations alone can lead to large disparities in P values, the calculation of the SIFI is not based on random variations in the assignment of patients but on the reassignment of patients at the extreme ends of the scale.
However, the random allocation of patients can lead to different proportions of the best (or worst) survivors in the groups, which may impact the outcomes. Therefore, the SIFI serves as a simple and conservative approach to reflect the fragility of the statistics. Alternatively, the mean or median survival time can be exploited in different ways to quantify the fragility 18,19 ; however, this approach can underestimate the fragility if the few patients who cause most of the difference are not captured.

Conclusions
The results of this study suggest that many phase 3 RCTs evaluating ICI therapies are fragile and challenge the confidence in rejecting or concluding superiority for these drugs compared with standard treatments. Low fragility levels express uncertainty when there is no appreciable difference between the interpretative significance of data. In contrast, high fragility levels can provide robustness and aid in binary decision-making, especially for treatments associated with high cost and toxic effects that require strong support. Interpretation of any outcome is far more complicated than just significance testing, and the SIFI as a statistical and communication tool may serve as a better starting point for discerning between science and fiction.