Use of Quantile Treatment Effects Analysis to Describe Antidepressant Response in Randomized Clinical Trials Submitted to the US Food and Drug Administration

Key Points Question What percentage of patients with severe depression experience improvement with antidepressant therapy and by how much? Findings In this pooled secondary analysis of data from the US Food and Drug Administration that included 57 313 participants with severe depression from 232 randomized clinical trials of antidepressants for major depressive disorder, all quantiles of depression response were more favorable among drug-assigned participants, by 3% to 14%. These findings depend heavily on statistical assumptions. Meaning These findings suggest that antidepressants may improve depression severity for a broad range of patients with severe depression, but for many patients, the magnitude of the reduction may be small.


Introduction
Major depressive disorder (MDD) is a leading cause of global distress and disability. 1,2 Antidepressants and psychotherapy are mainstays of MDD treatment. 3,4 Most meta-analyses of antidepressant therapy randomized clinical trials focus on average treatment effect and/or its association with baseline characteristics. [5][6][7][8][9][10][11] They generally find that patients with severe depression benefit from antidepressant therapy but only by a small amount on average, and this generates debate about whether use of antidepressants is worth the risk of adverse effects. [12][13][14] This debate also depends on how antidepressant efficacy is distributed in populations and in individuals, which is less commonly studied. Both population-level distributions and individual-level distributions matter, and the conceptual differences between them are subtle but important as explained in the eAppendix in Supplement 1.
Stone et al 15 recently estimated the distribution of antidepressant efficacy with a high-quality US Food and Drug Administration (FDA) data set and inferred that 15% of participants experience a robust response specific to active drug. That study used mixture models 16 to estimate the distribution of antidepressant response. This study by Stone et al 15 is an important contribution to the literature, but we do not share the statistical assumptions of that work. Specifically, their model assumes that the mixtures identify the 3 natural, distinct subtypes of MDD patient response, whereas our belief is that the identified mixtures may represent statistical artifact. 17 We wrote to Stone et al 15 for access to a shareable portion of their data to reestimate the population and individual antidepressant efficacy distributions using different assumptions.
We model the distribution of antidepressant response using a quantile treatment effect (QTE) framework. 18 When estimating the distribution of individual antidepressant response, we assume rank similarity. Rank similarity is a popular premise in QTE analysis that allows extrapolation from the population-level distribution of response to the individual-level distribution of response. 19 Under the rank similarity premise, the expected counterfactual placebo response at a given quantile of response among participants in the drug arm is modeled as the actual placebo response at that quantile. For example, it assumes that the 55th percentile of individuals who respond to drug therapy, had they instead been assigned to placebo, would have experienced the depression response of the 55th percentile of placebo-assigned participants and likewise for other quantiles.
Rank similarity is appropriate when the features that affect a participant's rank within one treatment arm exert a similar effect in both arms. This is clinically plausible in the depression context because many of the same factors are highly clinically significant in determining the course of a patient's depression in either arm of a trial. For instance, patients with worsening social circumstances tend to have worse depression responses compared with their same-arm peers in the trial whether assigned to drug or placebo. Similarly, patients whose depression before the trial had been long-standing tend to have worse depression responses compared with their same-arm peers whether assigned to drug or placebo. Rank similarity would be violated if factors that affect antidepressant responsiveness were more important for determining relative rank within an arm than features that affect the course of depression regardless of treatment assignment. A recent review stated that such factors have not been conclusively identified, 20 although we note that this is an area in which developments in precision medicine are still emerging. [21][22][23] If rank similarity is not met, then QTE analysis that assumes rank similarity will tend to underestimate the amount of heterogeneity in treatment effects. 24 For interested readers unfamiliar with QTE analysis or who seek more intuition regarding the rank similarity premise in the present context, we include a miniature, fully explained example in the eAppendix, eTable 1, and eTable 2 in Supplement 1.

Data Acquisition
This quantile treatment effects study is a secondary analysis of pooled participant data. We obtained aggregate data through personal correspondence with the authors of Stone et al, 15  any study and the number in the placebo arm of any study. These aggregate data imply a certain core set of IPD, such that we could extract for each participant their baseline depression severity, final depression severity, and whether they were assigned to the drug or placebo arms. In the aggregate data we received, depression severity had been converted to 17-item Hamilton Rating Scale for Depression (HAMD-17) equivalents as reported by Stone et al, 15 rounded to the nearest integer. We did not receive any study-level information or any other data about participants. This was the minimum necessary information for completing our planned QTE analysis. Because we used aggregated, anonymized data, our study was determined exempt from institutional review board approval by Duke University Health System. The analysis plan was not preregistered and no analyses were prespecified. We followed the relevant portions of the Preferred Reporting Items for Systematic Reviews and Meta-analyses (PRISMA) reporting guideline.

Statistical Analysis Data Quality Testing and Processing
To test data completeness, we compared the counts of treatment-assigned and placebo-assigned participants in the aggregate data we received against the values reported by Stone et al. 15 To test data consistency, we compared the range of baseline and final depression severity scores against the 0 to 52 range of the HAMD-17 scale. To test for risk of bias across studies introduced by the pooling process, we used a Wilcoxon rank sum test to compare baseline depression severity scores between treatment groups. When this test raised concern for a slight baseline imbalance, we used a literature-inspired filtering procedure of participants by baseline depression severity. A systematic review found that 20 is the most common HAMD-17 cutoff score for antidepressant trial inclusion among trials using the HAMD-17 scale; thus, we filtered out participants with baseline HAMD-17 scores less than 20. 25 Then we again tested for baseline balance on the postfiltration set.
From the data obtained, we calculated 2 candidate measures of depression response since both percentage and absolute depression response are commonly used in studies. 26 Percentage depression response was defined as 1 -F/B, expressed as a percentage, where F is the final depression severity and B is the baseline depression severity. Absolute depression response was defined as B -F. We used tests of rank similarity to guide which candidate measure of depression response to use in later parts of the study.

Testing Rank Similarity
We separately tested for rank similarity using our 2 candidate measures of depression response. We used the rank similarity test of Frandsen and Lefgren 19 in which an available baseline attribute is tested for a significant interaction with the treatment arm when predicting the response, with the response separately tested as percentage depression response and absolute depression response.
The available baseline attribute we used was baseline depression severity. Based on the results of these tests, we used percentage depression response as our chosen measure of depression response JAMA Network Open | Psychiatry moving forward. A sensitivity analysis based on absolute depression response is presented in the eAppendix and eFigure in Supplement 1.

Quantile Treatment Effects
We calculated each quantile of percentage depression response separately in the treatment and placebo arms, at quantiles from the 5th percentile to 95th percentile, in increments of 5%. The QTEs were calculated as the difference between percentage depression response in the treatment vs placebo arms at a given quantile. The QTE 95% CIs were calculated by bootstrapping, using 10 000 iterations of the basic algorithm from the ci_quantile_diff function of the Hmisc R package, version 4.7-1. The QTEs were graphically plotted without covariates using the ci_qtet function of the qte R package, version 1.3.1 (R Foundation for Statistical Computing). Significance testing was performed via paired, 2-sided tests, with a significance threshold of P = .05.

Data Quality
The aggregate data we received implied 71 393 participants, of whom 47 243 were assigned to drug and 24 150 to placebo. These values match the counts reported by Stone et al. 15 All participants had an integer score for baseline depression severity and for final depression severity, and their range from 0 to 50 was within the 0 to 52 range of the HAMD-17 scale.
There was a slight and statistically significant difference in mean baseline depression severity between drug and placebo arms equivalent to 0.15 points on the HAMD-17 scale (P = 3.6 × 10 −5 by Wilcoxon rank sum test); this difference in baselines could have arisen from variable randomization ratios among the individual studies composing our pooled data. To address this difference, we noted that participants with very low HAMD-17 scores may not meet the criteria for MDD and that many published randomized clinical trials of antidepressants only include participants with a HAMD-17 score of 20 or greater. 25 Thus, we excluded 9115 drug-assigned participants and 4965 placeboassigned participants with baseline HAMD-17 scores less than 20 from our pool, yielding a final analysis set of 57 313 participants. After this filtration, the difference in mean baseline depression severity between drug and placebo arms decreased to 0.037 HAMD-17 points and was no longer significant (P = .11 by Wilcoxon rank sum test). That is, the synthesis procedure used to produce our pooled data had some evidence of bias, which we were able to mitigate through filtering. Due to our study design and limitations in available data, some additional common tests were not applicable; the eAppendix in Supplement 1 provides details.

Testing Rank Similarity
Since QTE analysis is most richly interpretable when rank similarity can be assumed, we tested whether our data and intended formulation were consistent with rank similarity. Following the method of Frandsen and Lefgren, 19 we trained a linear model to predict percentage depression response from baseline depression severity, treatment arm, and the interaction between baseline depression severity and treatment arm. If there was a statistically significant interaction between baseline depression severity and treatment arm for prediction of percentage depression response in this model, then rank similarity would be rejected. When we applied this test, rank similarity was not rejected. Specifically, the interaction term between baseline depression severity and treatment arm in the model was not significant (P > .99). While this test cannot prove that rank similarity holds, the results we obtained from this test provide some statistical reassurance in the plausibility of the rank similarity assumption for our context.
Since there is not consensus in the literature about whether to define depression response as a percentage change from baseline as we have done vs an absolute change from baseline, we also tested for rank similarity using absolute change from baseline as the response variable. Under this alternative formulation, the interaction term between baseline depression severity and treatment arm in the model becomes significant (P = .003) (the eAppendix in Supplement 1 provides a discussion of evidence that the magnitude of this interaction is very small). At a minimum this finding indicates that a QTE analysis in which depression response is defined as an absolute change in baseline must adjust for baseline depression severity. We instead choose to use percentage depression response and not adjust for baseline depression severity.

Quantile Treatment Effects
Next, we characterized the estimated distribution of antidepressant response. We calculated the depression response distribution separately in treatment and control conditions (Figure 1), then calculated QTEs as the difference between these distributions (Figure 2). We observed that depression responses were more favorable in the treatment arm than in the placebo arm at all reported quantiles. At the 55th quantile, treatment arm participants had a final depression score that was 52.0% improved from baseline, and the corresponding value for placebo was 38.5% for a QTE of 13.5% (95% CI, 12.4%-14.4%), with values at other quantiles listed in the Table. The QTEs were greater in magnitude toward the center of the distribution and dissipated toward the tails. These  For each quantile (τ) of depression response, the difference between the treatment arm and placebo arm depression response at that quantile is shown, expressed as a percentage, along with its bootstrapped 95% CI. results suggest that, if rank similarity holds, then participants at any quantile of depression response experienced at least some additional response from antidepressant treatment.

Discussion
We We did not observe violations of rank similarity in our data when using the percentage definition of depression response. In contrast, when we defined depression response in terms of absolute improvement from baseline, there was a formal violation of rank similarity, although the magnitude of that observed violation was small. Our ability to test for rank similarity was limited. For instance, we only had the data to test for the potentially rank-distorting influence of a single potential moderator of treatment effect and not several other potential moderators that have been reported in the literature, such as brain network and perfusion patterns. [21][22][23] If the rank similarity premise is indeed true, then our results are compatible with the possibility that all individuals with MDD may experience at least slightly better depression responses while receiving antidepressant therapy compared with placebo.   These findings are exploratory and would need to be confirmed through specialized placebo run-in trials with a prolonged run-in duration (eg, the 6 weeks of placebo tested herein) followed by randomization of all run-in period participants, in contrast to common placebo run-in practices. 27 The prediction is that partial responders in the run-in period would experience a greater benefit from active drug than would run-in nonresponders and run-in robust responders. If confirmed, future randomized clinical trials of antidepressants might increase their sensitivity by only randomizing partial responders of an adequate run-in period rather than the more common practice of randomizing nonresponders of a potentially inadequate run-in period.
Our analysis does not directly contradict that of Stone et al 15

Strengths and Limitations
A strength of this study is its use of FDA data, which include both published and unpublished highquality randomized clinical trials and their pooled IPD. Another strength is that the population-level findings do not depend on any special assumptions.
One limitation of the study is that individual-level findings depend on the rank similarity assumption, which is unproven. Other limitations include lack of associated data and therefore omitted analyses concerning adverse effects, long-term effects, demographic characteristic covariates, and study-specific information. The lack of study-specific information in particular means that we cannot account for between-study heterogeneity, which might otherwise affect the study- although the magnitude of the response is more clinically meaningful in some patients than others. If our assumptions are not met, it is also possible that the same aggregate benefit is concentrated in substantially fewer patients. Regardless, estimating the percentage of patients who benefit from antidepressant therapy is a challenging task that depends on the statistical assumptions used.