[Skip to Content]
Sign In
Individual Sign In
Create an Account
Institutional Sign In
OpenAthens Shibboleth
[Skip to Content Landing]
February 2017

Leveraging Statistical Methods to Improve Validity and Reproducibility of Research Findings

Author Affiliations
  • 1Department of Psychiatry and Behavioral Sciences, Vanderbilt University Medical Center, Nashville, Tennessee

Copyright 2016 American Medical Association. All Rights Reserved.

JAMA Psychiatry. 2017;74(2):119-120. doi:10.1001/jamapsychiatry.2016.3730

Scientific discoveries have the profound opportunity to impact the lives of patients. They can lead to advances in medical decision making when the findings are correct, or mislead when not. We owe it to our peers, funding sources, and patients to take every precaution against false conclusions, and to communicate our discoveries with accuracy, precision, and clarity. With the National Institutes of Health’s new focus on rigor and reproducibility, scientists are returning attention to the ideas of validity and reliability. At JAMA Psychiatry, we seek to publish science that leverages the power of statistics and contributes discoveries that are reproducible and valid. Toward that end, I provide guidelines for using statistical methods: the essentials, good practices, and advanced methods.

The Essentials

Choosing the appropriate statistical approach is an essential first step. Traditional parametric statistics are based on assumptions of normally distributed data and linear associations. While many commonly used statistical tests and estimation procedures are fairly robust to violations of normality, both validity and statistical power can be substantially reduced when data are not normally distributed.1,2 This is especially important in psychiatry, where many phenomena are not normally distributed or linearly associated, for example, symptom ratings, duration of untreated illness, number of medications, and even education level. Early adoption of nonparametric statistics was thwarted by the requirement of substantial computing power; however, this is no longer an issue. Permutation tests, nonparametric tests and estimates of effect sizes, and bootstrap approaches to generate confidence intervals are now readily available in most statistical packages.

Scientists often create categories from continuous data. For example, categories of depressed/not depressed are created from continuous measures of depression, like the Patient Health Questionnaire-9 (PHQ-9) or Beck Depression Inventory (BDI). While categories can be helpful for diagnosis and treatment selection, imposing artificial boundaries weakens statistical power and increases the likelihood of a spurious finding, especially when effect sizes are modest. See MacCallum et al3 for an excellent discussion of the dangers of the median split. If the continuous measure is both valid and reliable, but the category is critical for clinical decision making, then hypothesis testing should still be based on the continuous measure with descriptive statistics and effect sizes presented for both the categorical and continuous data.

Appropriate sample sizes are critical for reliable, reproducible, and valid results. Evidence generated from small sample sizes is especially prone to error, both false negatives (type II errors) due to inadequate power and false positives (type I errors) due to biased samples. Especially hazardous are unexpected significant findings based on small samples, for example pre-post treatment comparisons in small pilot studies. Spurious findings can generate substantial interest and may lead researchers to invest significant resources to pursue them further. The minimum necessary sample size for any statistical test is based on the number of parameters to be tested (or the number of tests run). For example, Maxwell’s simulations4 suggest a minimum of 20:1 (participants:predictors) for regression models, although Maxwell himself cautions against using rules of thumb. A simulation study of event-related functional magnetic resonance imaging data shows that the reliability of cluster-level findings across the whole brain requires at least 20 participants (but ideally at least 27) per group.5 The minimal sample size for reproducibility is often much too small for adequate statistical power or precise estimates of effect size.

Good Practices

In the era of huge samples—combined data sets from collaborative projects, epidemiological studies that report on the populations of entire countries, and genome-wide association studies of tens of thousands individuals—it is crucial to report effect sizes. Effect sizes, estimates of population parameters that are independent of sample size and other design decisions, provide a tool for determining whether a finding is not only statistically significant, but also clinically significant. Despite the prominence of the Cohen d (or Hedges g), there are many different effect sizes. For 2-groups comparisons, besides d or g, are the success rate difference and its reciprocal number needed to treat (or take). For correlation, there are the Pearson r, Spearman r, and Kendall τ. Of note, some fields, like neuroimaging, are still considering the most appropriate way to compute effect sizes.

As lucidly described in a recent JAMA Psychiatry editorial by Statistical Editor Helena Chmura Kraemer, PhD,6 covariates must be justified and selected a priori. Including covariates to adjust for baseline differences between groups is inherently post hoc and will increase the likelihood of false-positive results. When using covariates, both the unadjusted and adjusted models must be presented so that the impact of the covariates on the final model can be fully evaluated.

Advanced Methods

The most unbiased and valid estimates of effect size are generated with independent replication across multiple samples. While external validity is the ultimate goal, an important first step—internal validity—can be established within a single sample using cross-validation methods. Cross-validation is commonly used in model prediction analyses, but these methods have yet to proliferate to other fields. In the simplest form of cross-validation, data are randomly split into 2 subsets; the model is generated from the first set and then validated with the second set. Another variation, the leave-one-out (LOO) uses the N-1 sample to generate the estimate and tests the model on the left-out observation. However, 5-fold and 10-fold cross-validations tend to be more stable and are therefore preferred to LOO. Cross-validation methods provide final parameter estimates that are less biased and more likely to replicate in independent future studies and therefore should be used whenever possible.

The proliferation of affordable and readily available statistical software has supported the exponential increase in sophisticated data analyses methods that have led to novel discoveries. However, with the ease of these sophisticated “black-box” programs comes extra caution, as illustrated by recent reports of bugs in 2 widely used neuroimaging software packages.7,8 Russ Poldrack recently shared his experience in “Reproducible Analyses in the MyConnectome Project”9 where he describes how a bug in the Linux version of an R package produced different results despite the same data and analytic pipeline. He concludes, “the experience showed just how fragile our results can be when we rely upon complex black-box analysis software.” Thus, when using black-box programs, let the user beware.

Selective reporting creates bias. The registration of clinical trials before enrollment of participants is an important step forward. However, other types of studies also suffer from bias. Prospective registration of meta-analyses (eg, Cochrane Prospective Meta-Analysis Methods group or PROSPERO) and observational studies will further enhance transparency. Reporting the results for all variables examined—either in the manuscript or Supplement—allows an evaluation of the significant findings within the context of all findings. Increasing transparency will provide for a growing collection of all findings ultimately leading to more accurate meta-analyses and valid conclusions.


In summary, it is our collective responsibility to ensure that we attend to the essentials, use good practices, and strive to use methods that will enhance both reproducibility and validity. Over the next several months, we will publish several Viewpoint articles that address critical issues related to statistics and research methods. With proper attention to data and statistical methods, we will advance psychiatry.

Back to top
Article Information

Corresponding Author: Jennifer Urbano Blackford, PhD, Department of Psychiatry and Behavioral Sciences, Vanderbilt University Medical Center, 1601 23rd Ave S, Ste 3057J, Nashville, TN 37212 (jennifer.blackford@vanderbilt.edu).

Published Online: December 28, 2016. doi:10.1001/jamapsychiatry.2016.3730

Conflict of Interest Disclosures: None reported.

Additional Contributions: Special thanks to Stephan Heckers, MD, and Helena Kraemer, PhD, for feedback on the manuscript.

Sawilowsky  SS, Blair  RC.  A more realistic look at the robustness and type II error properties of the t test to departures from population normality.  Psychol Bull. 1992;111(2):352-360. doi:10.1037/0033-2909.111.2.352Google ScholarCrossref
Tanizaki  H.  Power comparison of non-parametric tests: small-sample properties from Monte Carlo experiments.  J Appl Stat. 1997;24(5):603-632. doi:10.1080/02664769723576Google ScholarCrossref
MacCallum  RC, Zhang  S, Preacher  KJ, Rucker  DD.  On the practice of dichotomization of quantitative variables.  Psychol Methods. 2002;7(1):19-40.PubMedGoogle ScholarCrossref
Maxwell  SE.  Sample size and multiple regression analysis.  Psychol Methods. 2000;5(4):434-458.PubMedGoogle ScholarCrossref
Thirion  B, Pinel  P, Mériaux  S, Roche  A, Dehaene  S, Poline  JB.  Analysis of a large fMRI cohort: statistical and methodological issues for group analyses.  Neuroimage. 2007;35(1):105-120.PubMedGoogle ScholarCrossref
Kraemer  HC.  A source of false findings in published research studies: adjusting for covariates.  JAMA Psychiatry. 2015;72(10):961-962.PubMedGoogle ScholarCrossref
Eickhoff  SB, Laird  AR, Fox  PM, Lancaster  JL, Fox  PT.  Implementation errors in the GingerALE software: description and recommendations.  Hum Brain Mapp. 2016;00(July).PubMedGoogle Scholar
Eklund  A, Nichols  TE, Knutsson  H.  Cluster failure: why fMRI inferences for spatial extent have inflated false-positive rates.  Proc Natl Acad Sci U S A. 2016;113(28):7900-7905.PubMedGoogle ScholarCrossref
Poldrack  R. Reproducible analyses in the MyConnectome project. http://www.russpoldrack.org/2015/12/reproducible-analysis-in-myconnectome.html. Published 2015. Accessed September 29, 2016.