Comparison of Dimethyl Fumarate vs Fingolimod and Rituximab vs Natalizumab for Treatment of Multiple Sclerosis

Key Points Question In the management of multiple sclerosis, is there a difference in relapse outcomes associated with commonly prescribed, standard-efficacy medications as well as with common higher-efficacy medications? Findings This comparative effectiveness study integrated electronic health records with research registry data and found no significant differences in relapse outcomes between dimethyl fumarate and fingolimod after correcting for confounding biases. Rituximab was associated with a lower relapse rate when compared with natalizumab after bias correction. Meaning The study illustrates the value of incorporating electronic health record data as high-dimensional covariates in real-world comparative effectiveness analysis of multiple sclerosis medications.


Multiple Testing Adjustment.
To adjust for the multiple testing (difference in 1-year relapse rate, difference in 2-year relapse rate and relative risks of 2-year relapse from time-to relapse), we bootstrapped the distribution of the maximal absolute value of the three Wald test statistics ( Table 2). In each of the 10000 bootstrap repeats, we used the same bootstrap sample to calculate the three bootstrap estimates for difference in 1-year relapse rate, difference in 2-year relapse rate and relative risks of 2-year relapse, which preserved the correlation of the estimates. We standardized each of the three bootstrap estimates, turning them into bootstrap Wald statistics. We took the maximum of the absolute value for each bootstrap Wald statistics and used it to empirically evaluate the distribution of the maximal absolute value of the three Wald test statistics.
After the multiple testing adjustment, the p-values were .02 for difference in 1-year relapse rate, .004 for difference in 2-year relapse rate and .01 for relative risks of 2-year relapse with RTX as the reference ( Table 2).

Adapted E-value Sensitivity Analysis to Assess Unmeasured Confounding.
We adapted the E-value sensitivity analysis 1 , which was developed for discrete feature space with small number of states, to high dimensional setting with many continuous elements ( Table 2). Specifically, we performed the following steps: (1) stratify the feature space by the propensity score (PS) and outcome regression (OR) model predictions; (2) calculate the E-value with covariate stratification for the DR estimation and its 95% confidence interval obtained via the bootstrap. A key quantity in the E-value analysis is the bounding factor, which quantifies the unmeasured confounding under a hypothetical true outcome model. The bounding factor can also quantify the reduction in unmeasured confounding between two nested models assuming the larger model is the true model. We adopted this bounding factor approach to illustrate the degree to which adjustment for additional EHR factors reduces the confounding biases compared to the analysis solely based on registry data. Given the limited number of observed relapse events, we stratified the feature space into 8 strata for difference in 1-year or 2-year relapse rate (the three-way interaction of high/low OR prediction for NTZ relapse, high/low OR prediction for RTX relapse, and high/low propensity for RTX) and 4 strata for relative risks of 2-year nonrelapse (the two-way interaction of high/low OR prediction for relapse risk and high/low propensity for RTX). Interactions were multiplicative. High propensity for RTX was the same as low propensity for NTZ.
We summarized the E-values in Table 2. For difference in 1-year relapse rate, the residual unmeasured confounding would negate the significance of the association if its bounding factor were greater than 1.13 and would change the direction of the association if its bounding factor were greater than 1.50. For difference in 2year relapse rate, the unmeasured confounding would negate the significance of the association if its bounding factor were greater than 1.31 and would change the direction of the association if its bounding factor were greater than 2.26. For the relative risk of 2-year non-relapse (from time-to-relapse analysis), the unmeasured confounding would negate the significance of the association if its bounding factor were greater than 1.06 and would change the direction of the association if its bounding factor were greater than 1.11. Given that the maximal relative risks among full EHR features associated with the time-to-relapse outcome was 1.03 (eTable 3), the E-values were relatively moderate to large, indicating that the conclusion had a moderate to large tolerance to unmeasured confounding.
To evaluate the unmeasured confounding in the registry-derived features that was explained by the highdimensional full EHR features that included expert-defined features, we calculated the bounding factor for the registry-derived analysis 1 , treating hypothetically the fitted OR and PS model from the full EHR feature analysis as the underlying true models (eFigure 1). The OR and PS model predictions from the full EHR feature analysis were the hypothetical unmeasured confounders in the calculation. Similar to E-value sensitivity analysis for the "full EHR" analysis, we discretized the feature space into 10 by 10 strata according to the OR and PS model predictions from the registry-derived analysis (see Supplementary Material 1.4). We also discretized each hypothetical unmeasured confounders to 10 levels. Within each stratum, we calculated the bounding factor using the OR and PS model predictions from the "full EHR" analysis as the true models for outcomes and exposures given the covariates and unmeasured confounders. We visualized the bounding factors as heat maps. The bounding factors for empty strata were set as one. The average bounding factor was calculated as the mean over the bounding factors of all strata proportional to their sizes. Higher bounding factor values indicated greater extent of confounding unexplained by registry features but explained by highdimensional EHR features. Here, the average bounding factors were 1.06 for 1-year relapse rate, 7.38 for 2year relapse rate, and 1.04 for time-to-relapse. These results indicated that unmeasured residual confounding in registry-derived analysis was explained by the "full EHR analysis", particularly for the 2-year relapse rate.

Expert-defined Analysis.
In the expert-defined analysis using registry-annotated treatment groups with RTX as the reference (eTable 6), after adjusting for covariates based on clinical knowledge, we found no significant difference in the 1-year relapse rate or 2-year relapse rate between the NTZ group and the RTX group after correction for multiple testing, while patients in the NTZ group had a significantly lower relative risk of 2-year non-relapse rate 0.926 (95% CI = 0.870-0.980, p-value .01 and adjusted p-value .04). The difference in results between the "full EHR analysis" and "expert-defined analysis" suggested that the high-dimensional full EHR feature set captured additional, important confounders not among the low-dimensional expert-defined covariates.
The analyses using the EHR RxNorm-identified treatment groups demonstrated results consistent with the primary analysis based on registry-annotated treatment groups (eTable 6). After adjusting for expert-defined set of covariates in the expert-defined analysis, the NTZ group had a marginally higher 1-year relapse rate than the RTX group, but there was no significant difference in 2-year relapse rate or time-to-relapse between the two groups after correction for multiple testing. As with the registry-annotated groups, in the full EHR analysis, RTX had lower relapse rate than NTZ for all three relapse outcomes after correction for multiple testing: 1-year relapse rate (DR estimate=0.091; 95% CI=0.046-0.138, p-value <.001 and adjusted p-value .001), 2-year relapse rate (DR estimate=0.138; 95% CI =0.058-0.188, p-value <.001 and adjusted p-value .001), and time-torelapse (DR estimate=0.910; 95% CI=0.856-0.984, p-value .01 and adjusted p-value .04).

Time-adjusted Analysis.
Patients in the NTZ and RTX groups initiated treatment during 2006-2016. To adjust for potential temporal effect during the 10-year span, we conducted a time-adjusted analysis (eTable 8). In the analysis, we included the following features: OR and PS model predictions from the original analysis, time indicators for DMT initiation in 2006-2008, 2009-2011 and 2012-2016, interaction between time indicators and the features selected in the OR and PS model by adaptive Lasso. As with the original analysis, we found that RTX was associated with lower relapse rate when compared to NTZ for the 1-year relapse rate (DR estimate=0.074; 95% CI=0.019-0.139, p-value .01 and adjusted p-value .05) and 2-year relapse rate (DR estimate=0.132; 95% CI=0.055-0.236, p-value <.001 and adjusted p-value .01), and nearly significant for the time-to-relapse (DR estimate=0.906; 95% CI=0.840-0.999, p-value .05 and adjusted p-value .07). Point estimates were similar to "Full EHR" analysis. The elevated adjusted p-values might be due to the additional variability from temporal adjustments.

Multiple Testing Adjustment.
Using the same multiple testing adjustment (see Supplementary Material 1.1), we calculated the adjusted pvalues after accounting for multiple testing ( Table 2). The adjusted p-values were .38 for difference in 1-year relapse rate, .08 for difference in 2-year relapse rate and .50 for relative risks of 2-year relapse. Thus, there was no significant difference in the 1-year relapse rate, 2-year relapse rate, or time-to-relapse between the DMF and the FGL group.

Adapted E-value Sensitivity Analysis to Assess Unmeasured Confounding.
To evaluate the unmeasured confounding in the registry-derived features that was explained by the highdimensional "full EHR" features that included expert-defined features, we calculated the bounding factor 1 , treating the OR and PS model predictions from the full EHR feature analysis as the hypothetical unmeasured confounders in the registry-derived analysis (eFigure 1). Here, the average bounding factors were 4.66 for 1year relapse rate, 7.35 for 2-year relapse rate, and 1.10 for time-to-relapse. Thus, unmeasured confounding in registry-derived analysis was explained by the "full EHR" analysis, particularly for 1-year and 2-year relapse rates.

Expert-defined Analysis.
In the expert-defined analysis using registry-annotated treatment groups with DMF as the reference (eTable 7), after adjusting for the covariates based on clinical knowledge, we found no significant difference in the 1-year relapse rate, 2-year relapse rate, or time-to-relapse between the DMF and the FGL group.
The analyses using the EHR RxNorm-identified treatment groups likewise did not find significant difference in the 1-year relapse rate, 2-year relapse rate, or time-to-relapse between the DMF and FGL group in either the expert-defined analysis or the high-dimensional full EHR analysis (eTable 7).

Benchmark Analysis.
For benchmark, we again performed: (1) the crude analysis without any adjustment; (2) the registry-derived analysis with feature exclusively from the registry (gender, race, age at MS diagnosis, follow-up duration, disease duration, prior relapse within 12 months and 24 months) (eTable 7). In the unadjusted / crude analysis, DMF was associated with the lower relapse when compared to FGL: 1-year relapse rate (DR estimate=0.062; 95% CI = 0.008-0.111, p-value .02 and adjusted p-value .05), 2-year relapse rate (DR estimate=0.099; 95% CI=0.036-0.158, p-value <.001 and adjusted p-value .006), and time-to-relapse (DR estimate=0.926; 95% CI=0.878-0.978, p-value <.001 and adjusted p-value .01). In the registry-derived analysis, we found no significant difference in the 1-year relapse rate, 2-year relapse rate, or time-to-relapse between the DMF and the FGL group after correction for multiple testing.

Time-adjusted Analysis.
Patients in the FGL and DMF groups initiated treatment during 2010-2016 and 2013-2016, respectively. To adjust for potential temporal effect during the nonoverlapping time span, we conducted a time-matched analysis (eTable 8). In this analysis, we excluded FGL patients who initiated DMT during 2010-2012 to match the observation window 2013-2016 for DMF since FGL received FDA approval in 2010 whereas DMF received FDA approval in 2013. We included the following features: OR and PS model predictions from the original analysis and the features selected in the OR and PS model by adaptive Lasso. As with the original analysis, we found no significant difference in the 1-year relapse rate, 2-year relapse rate, or time-to-relapse between the DMF and the FGL group.   coefficients between -0.01 and 0.01 were considered as zero. 3. Features affecting both the treatment assignment and the MS relapse were the confounders presented in eTable 1. These features had nonzero propensity score (PS) coefficient and at least one nonzero outcome regression (OR) coefficient. 4. Features affecting only the treatment assignment had nonzero PS coefficient but no nonzero OR coefficient. 5. Features affecting only relapse outcomes had at least one nonzero OR coefficient but no nonzero PS coefficient. We further divided these features (or risk factors for relapse) into three categories: those affect MS relapse in NTZ arm, those affect MS relapse in RTX arm and those affect MS relapse in both arms according to their selection in the specific OR models. 6. The relative risk (RR) of the feature in the time-to-relapse analysis was calculated as the exponential of the standardized coefficient. 7. Please refer to eTable 5 for full description of narrative features (i.e., concept unique identifiers, CUI) as well as PheCodes. Note: 1. The standardized coefficient is the product of the model coefficient from adaptive LASSO and the standard deviation of the feature. By standardizing the coefficients, we made the coefficient invariant to scaling of features. Thus, the coefficients for features of different frequencies became comparable. 2. The classification of the features was based on nonzero coefficients from adaptive LASSO. Standardized coefficients between -0.01 and 0.01 were considered as zero. 3. Features affecting both the treatment assignment and the MS relapse were the confounders presented in eTable 2. These features had nonzero propensity score (PS) coefficient and at least one nonzero outcome regression (OR) coefficient. 4. Features affecting only the treatment assignment had nonzero PS coefficient but no nonzero OR coefficient. 5. Features affecting only relapse outcomes had at least one nonzero OR coefficient but no nonzero PS coefficient. We further divided these features (or risk factors for relapse) into three categories: those affect MS relapse in DMF arm, those affect MS relapse in FGL arm and those affect MS relapse in both arms according to their selection in the specific OR models. 6. The relative risk (RR) of the feature in the time-to-relapse analysis was calculated as the exponential of the standardized coefficient. 7. Please refer to eTable 5 for description of narrative features (i.e., concept unique identifiers, CUI) as well as PheCodes. Note: 1. Patients were assigned to treatment groups either according to the CLIMB registry annotation (registry) or the RxNorm electronic prescription records in the MGB EHR system (EHR RxNorm). 2. In addition to the crude / unadjusted analysis, covariate-adjusted analyses included the following sets of features: (1) registry-derived, (2) expert-defined, (3) full EHR, which additionally included expertdefined features. 3. For each outcome, we applied two adjustments, outcome regression (OR) and propensity scores (PS), to derive the doubly robust (DR) estimation. 4. P-values were adjusted for multiple testing among the 3 analyses with the same treatment group and feature set. 5. With RTX as reference, a positive difference in the 1-year or 2-year relapse rate or a relative risk (ratio) of non-relapse rates <1 would indicate higher relapse probability of NTZ. 6. The primary analysis findings were highlighted in green and shown in Table 2