[Skip to Navigation]
Sign In
COVID-19: Beyond Tomorrow
December 10, 2020

Scientific Rigor in the Age of COVID-19

Author Affiliations
  • 1Department of Biostatistics, Vanderbilt University Medical Center, Nashville, Tennessee
  • 2Center for Quantitative Sciences, Vanderbilt University Medical Center, Nashville, Tennessee
JAMA Oncol. 2021;7(2):171-172. doi:10.1001/jamaoncol.2020.6639

Less than a year has elapsed since coronavirus disease 2019 (COVID-19) emerged as a global pandemic. A search of PubMed undertaken on September 30, 2020, with sars-cov-2 as a search term yielded nearly 34 000 related articles that have been published on the topic in the scientific literature. Despite this proliferation of COVID-19 studies, randomized clinical trials (RCTs) of potential therapeutics remain scarce (only approximately 50 publications resulted from our PubMed search). Meanwhile, potential treatments with slim evidence of efficacy—or even safety—have been promoted for off-label use, hydroxychloroquine prominent among them. As such treatments have been administered to thousands of patients, “real-world data” have begun to accumulate and post-hoc analyses are being performed. As scientists and clinicians, we must be smart consumers of resultant published observational studies to ensure that evidence-based medicine is the basis for treating our patients.

Randomization vs Observation

Randomized clinical trials are the criterion standard of evidence in medicine. A thoughtful and well-designed schema for the random assignment of study participants to a treatment or control arm ensures balanced arms at baseline, minimizing bias. A correctly chosen control group allows for valid conclusions about the effect of a treatment compared with alternative treatments or no treatment. Because the statistical inputs in an RCT are front-loaded to design the study to test a prespecified hypothesis while eliminating potential sources of bias, analysis of the data can be quite straightforward.

Real-world data, by contrast, are inherently biased at baseline (eg, sicker patients are more likely to receive treatment), and a reasonable control group is often difficult to identify. Therefore, in reviewing an observational study, we should question the reliability of conclusions unless we see significant back-loaded statistics in the data analysis. Key techniques for back-loading statistics, with particular reference to COVID-19 studies,1,2 are described here.

Prespecified Statistical Analysis Plan

All published studies should include an explicit statement of the primary hypothesis, along with the clinical and/or biological rationale. Studies should also include a prespecified statistical analysis plan for testing this hypothesis. All secondary end points also must be prespecified with an appropriate analysis plan.

Precision Analysis

For RCTs, power analysis is part of the study design, to determine adequate sample size for statistical significance of findings. When the data are analyzed, the P value provides a measure of the likelihood that the prespecified hypothesis is false based on anticipated treatment effect size as predetermined in the study design.

Effect size refers to how much better (or worse) patients fare with treatment. Precision analysis answers the question: How precise is our estimate of effect size, given the observed data? The confidence interval (CI) around the point estimate of effect size is particularly important. Any value within the CI is equally likely— regardless of the point estimate—thus, we should carefully consider the lower bound of the CI to assess the significance of the findings.

Note that the P value for treatment effect in an observational study provides little information. Because the study was not designed and powered to detect a prespecified effect size, a statistically significant P value of P ≪ .001 may not reflect a clinically insignificant effect size.

An observational study of hydroxychloroquine treatment in COVID-19 conducted by Rivera et al2 provides an example of precision analysis. (The authors of this article were part of that study.) Computer simulation based on observed mortality rates was used to estimate effect size with CI.

Selection of Control Arm

Use of “no-drug” as the control arm in an observational study is highly problematic given the likelihood that these patients are less sick. A positive control arm (alternative drug) along with a negative control (no drug) is preferable. For example, the hydroxychloroquine study included these two groups, as well as a third control arm (hydroxychloroquine plus any other therapeutic).2 Many observational studies do not make a comparison to any control arm, rendering any findings highly suspect for potential confounding factors.

Multiple Imputation for Missing Data

In an observational study, data are likely drawn from the medical record, often requiring manual review or use of natural language processing methods to data mine from text entries. The likelihood of missing data is high. However, confining analysis to complete cases discards partial information and introduces another potential source of bias. Alternatively, including all cases, with multiple imputation for missing data,3 generates more reliable and less biased results. With imputation, missing data are filled in with values drawn from the distribution of known values. With multiple imputation, the imputation is performed more than once, yielding multiple analysis data sets that bring appropriate variation.

With COVID-19 studies, imputation for missing data is especially important, as data are very unlikely to be missing at random. For example, sicker patients will probably have more extensive medical record annotation. Thus, confining analysis to complete cases can be a source of bias.

Propensity Score Matching

Propensity score matching (PSM)4 fits a model on analysis covariates for the outcome of drug vs no-drug, thus identifying predictors of drug treatment. Each patient is assigned an overall score that reflects his or her likelihood (propensity) for receiving the drug. Propensity outliers are discarded to yield an analysis data set with score distribution matched between treated and untreated patients, with the overall goal of balancing baseline covariate distributions between study arms.

As with imputation, PSM is especially important for COVID-19 studies. Evidence suggests that certain comorbidities have a strong impact on the severity of disease and risk of death, yet our knowledge of potential confounders remains limited. Through baseline matching, PSM mitigates analysis misinterpretation due to potential confounders.

Regression Modeling

For multivariable regression modeling, the number of variables to be included in the model should be dictated by the degrees of freedom in the data set, and which variables to include should be prespecified based on existing clinical and biological knowledge. Variable selection that asks the data set which variables to include in the model—in particular, univariate analysis of each potential covariate against the outcome, with selection based on univariate P value—should be avoided, as this biases (overfits) the analysis toward the particular sampling represented in that data set.

If variable selection must be performed, the Lasso,5 elastic net,6 or horseshoe7 method should be used. Regression modeling should include mediation analysis, which regresses on potential mediators of the hypothesized causal relationship, answering the question: Does treatment have a direct effect on the outcome of interest? Or is the observed effect fully or partially explained by other covariates associated with treatment?

Sensitivity Analysis

Sensitivity analysis assesses how sensitive (or robust) findings are to changes in input. For example, in PSM, the allowable window of variation to be considered a match may be widened or narrowed. When the resulting data sets are modeled, does the estimated effect size (precision analysis) change or remain stable? Similarly, models generated using the Lasso, elastic net, or horseshoe method for shrinkage may be compared for consistency of variable coefficients from model to model.

With COVID-19 studies, sensitivity analysis is critical, especially because data sets may include relatively few events (eg, need for mechanical ventilation, death). With few events, the data are less informative, and sensitivity analysis can help determine if we are asking too much of the data.


Randomized clinical trials front-load statistical inputs in the form of study design that simplifies subsequent data analysis. With observational studies, study design cannot be relied on to support causal inference8; thus, statistical methods must be back-loaded in the data analysis to address the bias inherent in real-world data and generate reliable, reproducible findings.

Back to top
Article Information

Corresponding Author: Yu Shyr, PhD, Department of Biostatistics, Vanderbilt University Medical Center, 2525 West End Ave, Ste 1100, Room 11132, Nashville, TN 37203 (yu.shyr@vumc.org).

Published Online: December 10, 2020. doi:10.1001/jamaoncol.2020.6639

Conflict of Interest Disclosures: Dr Shyr reported receiving grants from the National Institutes of Health during the conduct of the COVID-19 and Cancer Consortium study. No other disclosures were reported.

Kuderer  NM, Choueiri  TK, Shah  DP,  et al; COVID-19 and Cancer Consortium.  Clinical impact of COVID-19 on patients with cancer (CCC19): a cohort study.   Lancet. 2020;395(10241):1907-1918. doi:10.1016/S0140-6736(20)31187-9 PubMedGoogle ScholarCrossref
Rivera  DR, Peters  S, Panagiotou  OA,  et al; COVID-19 and Cancer Consortium.  Utilization of COVID-19 treatments and clinical outcomes among patients with cancer: a COVID-19 and cancer consortium (CCC19) cohort study.   Cancer Discov. 2020;10(10):1514-1527. doi:10.1158/2159-8290.CD-20-0941 PubMedGoogle ScholarCrossref
Rubin  DB.  Multiple Imputation for Nonresponse in Surveys. Wiley; 1987. doi:10.1002/9780470316696
Ho  D, Imai  K, King  G, Stuart  E.  Matching as nonparametric preprocessing for reducing model dependence in parametric causal inference.   Polit Anal. 2007;15(3):199-236. doi:10.1093/pan/mpl013Google ScholarCrossref
Tibshirani R.  Regression shrinkage and selection via the lasso.   J R Stat Soc Series B Stat Methodol. 1996;58(1):267-288. Accessed November 8, 2020. http://www.jstor.org/stable/2346178 doi:10.1111/j.2517-6161.1996.tb02080.xGoogle Scholar
Zou  H, Hastie  T.  Regularization and variable selection via the elastic net.   J R Stat Soc Series B Stat Methodol. 2005;67(2):301-320. doi:10.1111/j.1467-9868.2005.00503.xGoogle ScholarCrossref
Carvalho  CM, Polson  NG, Scott  JG.  The horseshoe estimator for sparse signals.   Biometrika. 2010;97(2):465-480. doi:10.1093/biomet/asq017Google ScholarCrossref
Collins  R, Bowman  L, Landray  M, Peto  R.  The magic of randomization versus the myth of real-world evidence.   N Engl J Med. 2020;382(7):674-678. doi:10.1056/NEJMsb1901642 PubMedGoogle ScholarCrossref
1 Comment for this article
Statistical Shenanigans.
Gary Ordog, MD | County of Los Angeles, Department of Health Services, (retired)
Another article in JAMA discussing another alternative statistical analysis method, which, like bootstrapping is an additional way of what we used to call "fudging the data." In summary, this article is recommending making up missing data points by averaging the data in the other subjects. My concern is this is not correct. I agree that it is better to have more subjects, but these additional ones are "fudged." If enough of these methods are added together, all of the data will end up questionable. Thus, confirming the old statement: "You can prove anything with statistics!" (The article title is also misleading, mentioning "Scientific Rigor" but is the opposite, promoting a method to "fudge" the data. This adds even more to my worry about the direction of research, supporting the statement: "Statistics can be used to prove anything.")

I will give an extreme example of the statistical methods recommended in this article. Lets say that a new drug is being tested on 100 subjects. The researcher collects complete data on 95 and all of these patients survived. Five patients were lost to follow-up, but by estimating 100% survival of the data collected and backdoor inputting into the 5 missing subjects results in 100% survival and the drug is approved. But in reality, the 5 subjects who were lost to follow-up, actually died, unbeknownst to the researcher. Conclusion, bad drug in reality, but statistically a great drug. This is an extreme example, but shows the problem with methods such as 'Multiple Imputation of Missing Data.' Now imagine several of these statistical methods of back-door input of data, being used simultaneously.

In summary, the use of these statistical methods may invalidate the study. By estimating missing data to increase the "power" of the study, the legitimacy of the whole study may be lost. Would you rather see the actual results without manipulation, or manipulated results that are invalid? I would rather see the real results obviously. In summary, the use of these statistical methods may invalidate the study. By estimating missing data to increase the "power" of the study, the legitimacy of the whole study may be lost. Would you rather see the actual results without manipulation, or manipulated results that are invalid? I would rather see the real results obviously.