Less than a year has elapsed since coronavirus disease 2019 (COVID-19) emerged as a global pandemic. A search of PubMed undertaken on September 30, 2020, with sars-cov-2 as a search term yielded nearly 34 000 related articles that have been published on the topic in the scientific literature. Despite this proliferation of COVID-19 studies, randomized clinical trials (RCTs) of potential therapeutics remain scarce (only approximately 50 publications resulted from our PubMed search). Meanwhile, potential treatments with slim evidence of efficacy—or even safety—have been promoted for off-label use, hydroxychloroquine prominent among them. As such treatments have been administered to thousands of patients, “real-world data” have begun to accumulate and post-hoc analyses are being performed. As scientists and clinicians, we must be smart consumers of resultant published observational studies to ensure that evidence-based medicine is the basis for treating our patients.
Randomization vs Observation
Randomized clinical trials are the criterion standard of evidence in medicine. A thoughtful and well-designed schema for the random assignment of study participants to a treatment or control arm ensures balanced arms at baseline, minimizing bias. A correctly chosen control group allows for valid conclusions about the effect of a treatment compared with alternative treatments or no treatment. Because the statistical inputs in an RCT are front-loaded to design the study to test a prespecified hypothesis while eliminating potential sources of bias, analysis of the data can be quite straightforward.
Real-world data, by contrast, are inherently biased at baseline (eg, sicker patients are more likely to receive treatment), and a reasonable control group is often difficult to identify. Therefore, in reviewing an observational study, we should question the reliability of conclusions unless we see significant back-loaded statistics in the data analysis. Key techniques for back-loading statistics, with particular reference to COVID-19 studies,1,2 are described here.
Prespecified Statistical Analysis Plan
All published studies should include an explicit statement of the primary hypothesis, along with the clinical and/or biological rationale. Studies should also include a prespecified statistical analysis plan for testing this hypothesis. All secondary end points also must be prespecified with an appropriate analysis plan.
For RCTs, power analysis is part of the study design, to determine adequate sample size for statistical significance of findings. When the data are analyzed, the P value provides a measure of the likelihood that the prespecified hypothesis is false based on anticipated treatment effect size as predetermined in the study design.
Effect size refers to how much better (or worse) patients fare with treatment. Precision analysis answers the question: How precise is our estimate of effect size, given the observed data? The confidence interval (CI) around the point estimate of effect size is particularly important. Any value within the CI is equally likely— regardless of the point estimate—thus, we should carefully consider the lower bound of the CI to assess the significance of the findings.
Note that the P value for treatment effect in an observational study provides little information. Because the study was not designed and powered to detect a prespecified effect size, a statistically significant P value of P ≪ .001 may not reflect a clinically insignificant effect size.
An observational study of hydroxychloroquine treatment in COVID-19 conducted by Rivera et al2 provides an example of precision analysis. (The authors of this article were part of that study.) Computer simulation based on observed mortality rates was used to estimate effect size with CI.
Use of “no-drug” as the control arm in an observational study is highly problematic given the likelihood that these patients are less sick. A positive control arm (alternative drug) along with a negative control (no drug) is preferable. For example, the hydroxychloroquine study included these two groups, as well as a third control arm (hydroxychloroquine plus any other therapeutic).2 Many observational studies do not make a comparison to any control arm, rendering any findings highly suspect for potential confounding factors.
Multiple Imputation for Missing Data
In an observational study, data are likely drawn from the medical record, often requiring manual review or use of natural language processing methods to data mine from text entries. The likelihood of missing data is high. However, confining analysis to complete cases discards partial information and introduces another potential source of bias. Alternatively, including all cases, with multiple imputation for missing data,3 generates more reliable and less biased results. With imputation, missing data are filled in with values drawn from the distribution of known values. With multiple imputation, the imputation is performed more than once, yielding multiple analysis data sets that bring appropriate variation.
With COVID-19 studies, imputation for missing data is especially important, as data are very unlikely to be missing at random. For example, sicker patients will probably have more extensive medical record annotation. Thus, confining analysis to complete cases can be a source of bias.
Propensity Score Matching
Propensity score matching (PSM)4 fits a model on analysis covariates for the outcome of drug vs no-drug, thus identifying predictors of drug treatment. Each patient is assigned an overall score that reflects his or her likelihood (propensity) for receiving the drug. Propensity outliers are discarded to yield an analysis data set with score distribution matched between treated and untreated patients, with the overall goal of balancing baseline covariate distributions between study arms.
As with imputation, PSM is especially important for COVID-19 studies. Evidence suggests that certain comorbidities have a strong impact on the severity of disease and risk of death, yet our knowledge of potential confounders remains limited. Through baseline matching, PSM mitigates analysis misinterpretation due to potential confounders.
For multivariable regression modeling, the number of variables to be included in the model should be dictated by the degrees of freedom in the data set, and which variables to include should be prespecified based on existing clinical and biological knowledge. Variable selection that asks the data set which variables to include in the model—in particular, univariate analysis of each potential covariate against the outcome, with selection based on univariate P value—should be avoided, as this biases (overfits) the analysis toward the particular sampling represented in that data set.
If variable selection must be performed, the Lasso,5 elastic net,6 or horseshoe7 method should be used. Regression modeling should include mediation analysis, which regresses on potential mediators of the hypothesized causal relationship, answering the question: Does treatment have a direct effect on the outcome of interest? Or is the observed effect fully or partially explained by other covariates associated with treatment?
Sensitivity analysis assesses how sensitive (or robust) findings are to changes in input. For example, in PSM, the allowable window of variation to be considered a match may be widened or narrowed. When the resulting data sets are modeled, does the estimated effect size (precision analysis) change or remain stable? Similarly, models generated using the Lasso, elastic net, or horseshoe method for shrinkage may be compared for consistency of variable coefficients from model to model.
With COVID-19 studies, sensitivity analysis is critical, especially because data sets may include relatively few events (eg, need for mechanical ventilation, death). With few events, the data are less informative, and sensitivity analysis can help determine if we are asking too much of the data.
Randomized clinical trials front-load statistical inputs in the form of study design that simplifies subsequent data analysis. With observational studies, study design cannot be relied on to support causal inference8; thus, statistical methods must be back-loaded in the data analysis to address the bias inherent in real-world data and generate reliable, reproducible findings.
Corresponding Author: Yu Shyr, PhD, Department of Biostatistics, Vanderbilt University Medical Center, 2525 West End Ave, Ste 1100, Room 11132, Nashville, TN 37203 (yu.shyr@vumc.org).
Published Online: December 10, 2020. doi:10.1001/jamaoncol.2020.6639
Conflict of Interest Disclosures: Dr Shyr reported receiving grants from the National Institutes of Health during the conduct of the COVID-19 and Cancer Consortium study. No other disclosures were reported.
2.Rivera
DR, Peters
S, Panagiotou
OA,
et al; COVID-19 and Cancer Consortium. Utilization of COVID-19 treatments and clinical outcomes among patients with cancer: a COVID-19 and cancer consortium (CCC19) cohort study.
Cancer Discov. 2020;10(10):1514-1527. doi:
10.1158/2159-8290.CD-20-0941
PubMedGoogle ScholarCrossref 4.Ho
D, Imai
K, King
G, Stuart
E. Matching as nonparametric preprocessing for reducing model dependence in parametric causal inference.
Polit Anal. 2007;15(3):199-236. doi:
10.1093/pan/mpl013Google ScholarCrossref