STATISTICS IS not merely about distributions or probabilities, although these are part of the discipline. In the broadest sense, statistics is the use of numbers to quantify relationships in data and thereby answer questions. Statistical methods allow the researcher to reduce a spreadsheet of data to counts, means, proportions, rates, risk ratios, rate differences, and other quantities that convey information. We believe that the presentation of numerical information will be enhanced if authors keep in mind that their goal is to clarify and explain. We offer suggestions here for the presentation of statistical information to the readers of general medical journals.
Numbers that can be omitted
Most statistical software packages offer a cornucopia of output. Authors need to be judicious in selecting what should be presented. A chi-square test will typically produce the chi-square statistic, the degrees of freedom in the data, and the P value for the test. In general, chi-square statistics, t statistics, F statistics, and similar values should be omitted. Degrees of freedom are not needed.
Even the P value can usually be omitted. In a study that compares groups, it is customary to present a table that allows the reader to compare the groups with regard to variables such as age, sex, or health status. A case-control study typically compares the cases and controls, whereas a cohort study typically compares those exposed and not exposed. Sometimes authors use P values to compare these study groups. We suggest that these P values should be omitted. In a case-control or cohort study, there is no hypothesis that the 2 groups are similar. We are interested in a comparison because differences between groups may confound estimates of association. If the study sample is large, small differences that have little confounding influence may be statistically significant. If the study sample is small, large differences that are not statistically significant may be important confounders.1 The bias due to confounding cannot be judged by statistical significance; we usually judge this based on whether adjustment for the confounding variable changes the estimate of association.2-6
Even in a randomized trial, in which it is hoped that the compared groups will be similar as a result of randomization, the use of P values for baseline comparisons is not appropriate. If randomization was done properly, then the only reason for any differences must be chance.7 It is the magnitude of any difference, not its statistical significance, that may bias study results and that may need to be accounted for in the analysis.8
Much regression output serves little purpose in medical research publication; this usually includes the intercept coefficient, R2, log likelihood, standard errors, and P values. Estimates of variance explained (such as R2, correlation coefficients, and standardized regression coefficients (sometimes called effect size) are not useful measures of causal associations or agreement and should not be presented as the main results of an analysis.9-12 These measures depend not only on the size of any biological effect of an exposure, but also on the distribution of the exposure in the population. Because this distribution can be influenced by choice of study population and study design, it makes little sense to standardize on a measure that can be made smaller or larger by the investigator. Useful measures for causal associations, such as risk ratios and rate differences, are discussed below. Useful measures of agreement, such as kappa statistics, the intraclass correlation coefficient, the concordance coefficient, and other measures, are discussed in many textbooks and articles.11-20
Global tests of regression model fit are not helpful in most articles. Investigators can use these tests to check that a model does not have major conflicts with the data, but they should be aware that these tests have low power to detect problems.21 If the test yields a small P value, which suggests a problem with the model, investigators need to consider what this means in the context of their study. But a large P value cannot reassure authors or readers that the model presented is correct.5
Complex formulas or mathematical notation, such as log likelihood expressions or symbolic expressions for regression models, are not useful for general medical readers.
Numbers that should be included
Several authors, including the International Committee of Medical Journal Editors, have urged that research articles present measures of association, such as risk ratios, risk differences, rate ratios, or differences in means, along with an estimate of the precision for these measures, such as a 95% confidence interval.1,22-29
Imagine that we compared the outcomes of patients who received treatment A with the outcomes of other patients. If we find that the 2-sided P value for this comparison is .02, we conclude that the probability of obtaining the observed difference (or a greater difference) is 1 in 50 if, in the population from which we selected our study subjects, the treatment actually had no effect on the outcomes. But the P value does not tell readers if those who received treatment A did better or worse compared with those who did not. Nor does it tell readers how much better or worse one group did compared with the other. However, if we report that the risk ratio for a bad outcome was 0.5 among those who received treatment A, compared with others, readers can see both the direction (beneficial) and the size (50% reduction in the risk of a bad outcome) of treatment A's association with bad outcomes. It is also useful, when possible, to show the proportion of each group that had a bad outcome or to use something similar, such as a Kaplan-Meier survival curve. If we report that the 95% confidence interval around the risk ratio of 0.5 was 0.3 to 0.7, readers can see that the null hypothesis of no association (a risk ratio of 1.0) is unlikely and that risk ratios of 0.9 or 0.1 are also unlikely. If we report that the risk ratio was 0.5 (95% confidence interval, 0.2 to 1.3), a reader can see that the estimate of 0.5 is imprecise and the data are compatible with no association between treatment and outcome (a risk ratio of 1.0) and are even compatible with a harmful association (a risk ratio greater than 1.0). A point estimate and confidence interval convey more information than the P value for a test of the hypothesis of no association. Similarly, means can be compared by presenting their differences with a 95% confidence interval for the difference.
We acknowledge that sometimes P values may serve a useful purpose,30 but we recommend that point estimates and confidence intervals be used in preference to P values in most instances. If P values are given, please use 2 digits of precision (eg, P = .82). Give 3 digits for values between .01 and .001 and report smaller values as P<.001. Do not reduce P values to "not significant" or "NS."
In tables that compare study groups, it is usually helpful to include both counts (of patients or events) and column percentages (Table 1). In a case-control study, there are usually column headings for the cases and controls. For clinical trials or cohort studies, the column headings are typically the trial arms or the exposure categories. Listing column percentages allows the reader to easily compare the distribution of data between groups. Do not give row percentages.
In tables of column percentages, do not include a row for counts and percentages of missing data. Doing this will distort the other percentages in the same column, making it difficult for readers to compare known information in different columns. The records with missing data are best omitted for each variable. The investigator hopes that the distribution of information about those with missing data was similar to those with known data. The amount of missing data should be described in the methods section. If there is a lot of missing data for a variable, say more than 5%, a table footnote can point this out (Table 1).
Odds ratios vs risk ratios
In a case-control study, associations are commonly estimated using odds ratios. Because case-control studies are typically done when the study outcome is uncommon in the population from which the cases and controls arose, odds ratios will approximate risk ratios.31,32 Logistic regression is typically used to adjust odds ratios to control for potential confounding by other variables.
In clinical trials or cohort studies, however, the outcome may be common. If more than 10% of the study subjects have the outcome, or if the baseline hazard of disease is common in a subgroup that contributes a substantial portion of subjects with the outcome, then the odds ratio may be considerably further from 1.0 than the risk ratio.6,33 This may result in a misinterpretation of the study results by authors, editors, or readers.34-40 One option is to do the analysis using logistic regression and convert the odds ratios to risk ratios.41-43 Another option is to estimate a risk ratio using Poisson regression, negative binomial regression, or a generalized linear model with a log link and binomial error distribution.44-55 Whatever choice is made, we urge authors not to interpret odds ratios as if they were risk ratios in studies where this interpretation is not warranted.
Power calculations after the results are known
Reporting of power calculations makes little sense once the study has been done.56,57 We think that reviewers who request such calculations are misguided. We can never know what a study would have found if it had been larger. If a study reported an association with a 95% confidence interval that excludes 1.0, then the study was not underpowered to reject the null hypothesis using a 2-sided significance level of .05. If the study reported an association with a 95% confidence interval that includes 1.0, then by that standard the data are compatible with the range of associations that fall within the confidence interval, including the possibility of no association. Point estimates and confidence intervals tell us more than any power calculation about the range of results that are compatible with the data.58 In a review of this topic, Goodman and Berlin wrote that " . . . we cannot cross back over the divide and use pre-experiment numbers to interpret the result. That would be like trying to convince someone that buying a lottery ticket was foolish (the before-experiment perspective) after they hit the lottery jackpot (the after-experiment perspective)".59(p201)
Citations for methods sections
In the methods section, authors should provide sufficient information so that a knowledgeable reader can understand how the quantitative information in the results section was generated. For common methods, such as the chi-square test, Fisher exact test, the 2-sample t test, linear regression, and logistic regression, we see no need for a citation. For proportional hazard models, Poisson regression, and other less common methods, we recommend that a textbook be cited so that an interested person could read further.
Authors sometimes state that their analytic method was a specific command in a software package. This is not helpful to persons without that software. Tell readers the method using statistical nomenclature and give appropriate citations to statistical textbooks and articles, so that they could implement the analysis in the software of their choice.
We see no reason to mention or cite the software used for common statistical methods. For uncommon methods, citing the software may be helpful because the reader may want to acquire software that implements the described method and, for some newer methods, results may be somewhat different depending on the software used. If you are in doubt, we suggest citing your software.
Clarity vs statistical terms
Because the readers of general medical journals are not usually statisticians, we urge that technical statistical terms be replaced with simpler terms whenever possible. Words such as "stochastic" or "Hessian matrix" are rarely appropriate in an article and are never appropriate in the results section.
As an example, imagine that we have done a randomized trial to estimate the risk ratio for pneumonia among those who received a vaccine compared with others. Study subjects ranged in age from 40 to 79 years. We used regression to estimate that the risk ratio for pneumonia was 0.5 among vaccine recipients compared with controls. As part of our analysis, we wanted to know if this association was different among those who were younger (40 to 59 years) compared with those who were older (60 to 79 years). To do this, we introduced what statisticians call an interaction term between treatment group and age group. It is fine to say this in the methods. But in the results we can avoid both the statisticians' term (interaction) and the epidemiologists' term (effect modification) and simply say, "There was no evidence that the association between being in the vaccine group and pneumonia varied by age group (P = .62)."
Accurate statements about statistical methods will sometimes require words that will be unfamiliar to some readers. We are not asking for clarity at the expense of accuracy, and we appreciate that sometimes part of the methods section will be beyond the general reader. The results section, however, must be written so that the average reader can understand the study findings.
Avoid use of the word "significant" unless you mean "statistically significant"; in that case, it is best to use both those words.
Do not confuse lack of a statistically significant difference with no difference.60 Imagine that the mean age is 38.3 years in group A and 37.9 years in group B, with a mean difference of 0.4 years (95% confidence interval, 2.4 to −1.6). Do not say that the 2 groups did not differ in regard to age; they clearly do differ, with a mean difference of 0.4 years. It might be reasonable to say that the 2 groups were similar with regard to age or that differences in mean age were not statistically significant.
Biostatistics, like the rest of medicine, is a changing field. Nothing we have said here is fixed in stone. Today, for example, we recommend confidence intervals as estimates of precision, but we would be quite willing to accept a manuscript with likelihood intervals instead.5,61-63 If authors think they have a good reason to ignore some of our recommendations, we encourage them to write their manuscript as they see fit and be prepared to persuade and educate reviewers and editors. If authors keep in mind the goals of clarity and accuracy, readers will be well served.
2.Mickey
RMGreenland
S The impact of confounder selection criteria on effect estimation.
Am J Epidemiol. 1989;129125- 137
Google Scholar 3.Maldonado
GGreenland
S Simulation study of confounder selection strategies.
Am J Epidemiol. 1993;138923- 936
Google Scholar 4.Kleinbaum
DG Logistic Regression: A Self-Learning Text. New York, NY Springer-Verlag1992;168
5.Rothman
KJGreenland
S Modern Epidemiology. Philadelphia, Pa Lippincott-Raven1998;195- 199255- 259410- 411
6.Greenland
SMorgenstern
H Confounding in health research.
Annu Rev Public Health. 2001;22189- 212
Google ScholarCrossref 7.Matthews
JNS An Introduction to Randomized Controlled Clinical Trials. London, England Arnold2000;64- 65
9.Greenland
SSchlesselman
JJCriqui
MH The fallacy of employing standardized regression coefficients and correlations as measures of effect.
Am J Epidemiol. 1986;123203- 208
Google Scholar 10.Greenland
SMaclure
MSchlesselman
JJPoole
CMorgenstern
H Standardized regression coefficients: a further critique and review of some alternatives.
Epidemiology. 1991;2387- 392
Google ScholarCrossref 11.Altman
DG Practical Statistics for Medical Research. New York, NY Chapman & Hall1991;396- 419
12.van Belle
G Statistical Rules of Thumb. New York, NY John Wiley & Sons2002;56- 68
13.Fleiss
JL Statistical Methods for Rates and Proportions. New York, NY John Wiley & Sons1981;212- 236
14.Nelson
JCPepe
MS Statistical description of interrater variability in ordinal ratings.
Stat Methods Med Res. 2000;9475- 496
Google ScholarCrossref 16.Armitage
PBerry
GMatthews
JNS Statistical Methods in Medical Research. Oxford, England Blackwell Science2002;704- 707
17.Lin
LI A concordance correlation coefficient to evaluate reproducibility.
Biometrics. 1989;45255- 268
Google ScholarCrossref 19.Bland
JMAltman
DG Statistical methods for assessing agreement between 2 methods of clinical measurement.
Lancet. 1986;1307- 310
Google ScholarCrossref 21.Hosmer
DWHjort
NL Goodness-of-fit processes for logistic regression: simulation results.
Stat Med. 2002;212723- 2738
Google ScholarCrossref 22.Gardner
MJAltman
DG Confidence intervals rather than
P values: estimation rather than hypothesis testing.
BMJ (Clin Res Ed). 1986;292746- 750
Google ScholarCrossref 25.Savitz
DA Is statistical significance testing useful in interpreting data?
Reprod Toxicol. 1993;795- 100
Google ScholarCrossref 26.International Committee of Medical Journal Editors, Uniform requirements for manuscripts submitted to biomedical journals.
N Engl J Med. 1991;324424- 428
Google ScholarCrossref 27.International Committee of Medical Journal Editors, Uniform requirements for manuscripts submitted to biomedical journals.
JAMA. 1997;277927- 934
Google ScholarCrossref 29.Poole
C Low
P-values or narrow confidence intervals: which are more durable?
Epidemiology. 2001;12291- 294
Google ScholarCrossref 31.Kelsey
JLWhittemore
ASEvans
ASThompson
WD Methods in Observational Epidemiology. New York, NY Oxford University Press1996;36
32.MacMahon
BTrichopoulos
D Epidemiology: Principles and Methods. Boston, Mass Little, Brown & Co1996;169- 170
33.Greenland
S Interpretation and choice of effect measures in epidemiologic analyses.
Am J Epidemiol. 1987;125761- 768
Google Scholar 35.Welch
HGKoepsell
TD Insurance and the risk of ruptured appendix.
N Engl J Med. 1995;332396- 397
Google Scholar 36.Altman
DGDeeks
JJSackett
DL Odds ratios should be avoided when events are common [letter].
BMJ. 1998;3171318
Google ScholarCrossref 37.Altman
DDeeks
JSackett
D Odds ratios revisited [letter].
Evid Based Med. 1998;371- 72
Google Scholar 38.Sackett
DLDeeks
JJAltman
DG Down with odds ratios!
Evid Based Med. 1996;1164- 166
Google Scholar 39.Schwartz
LMWoloshin
SWelch
HG Misunderstanding about the effects of race and sex on physicians' referrals for cardiac catheterization.
N Engl J Med. 1999;341279- 283
Google ScholarCrossref 40.Deeks
JJ Issues in the selection of a summary statistic for meta-analysis of clinical trials with binary outcomes.
Stat Med. 2002;211575- 1600
Google ScholarCrossref 41.Zhang
JYu
KF What's the relative risk? A method of correcting the odds ratio in cohort studies of common outcomes.
JAMA. 1998;2801690- 1691
Google ScholarCrossref 42.McNutt
LAHafner
JPXue
X Correcting the odds ratio in cohort studies of common outcomes [letter].
JAMA. 1999;282529
Google ScholarCrossref 44.Breslow
NEDay
NE
The Design and Analysis of Cohort Studies. Lyon, France International Agency for Research on Cancer1987;120- 176
Statistical Methods in Cancer Research vol 2
Google Scholar 45.McCullagh
PNelder
JA Generalized Linear Models. New York, NY Chapman & Hall1989;
46.Gourieroux
CMonfort
ATognon
C Pseudo-maximum likelihood methods: theory.
Econometrica. 1984;52681- 700
Google ScholarCrossref 47.Gourieroux
CMonfort
ATognon
C Pseudo-maximum likelihood methods: applications to Poisson models.
Econometrica. 1984;52701- 720
Google ScholarCrossref 48.Wacholder
S Binomial regression in GLIM: estimating risk ratios and risk differences.
Am J Epidemiol. 1986;123174- 184
Google Scholar 49.Gardner
WMulvey
EPShaw
EC Regression analyses of counts and rates: Poisson, overdispersed Poisson, and negative binomial models.
Psychol Bull. 1995;118392- 404
Google ScholarCrossref 50.Long
JS Regression Models for Categorical and Limited Dependent Variables. Thousand Oaks, Calif SAGE Publications1997;217- 250
51.Cameron
ACTrivedi
PK Regression Analysis of Count Data. New York, NY Cambridge University Press1998;
52.Lloyd
CJ Statistical Analysis of Categorical Data. New York, NY John Wiley & Sons1999;84- 87
53.Cummings
PNorton
RKoepsell
TDRivara
FPedCummings
PedKoepsell
TDedGrossman
DCedMaier
RVed Rates, rate denominators, and rate comparisons.
Injury Control: A Guide to Research and Program Evaluation New York, NY Cambridge University Press2001;64- 74
Google Scholar 54.Hardin
JHilbe
J Generalized Linear Models and Extensions. College Station, Tex Stata Press2001;
55.Wooldridge
JM Econometric Analysis of Cross Section and Panel Data. Cambridge, Mass MIT Press2002;646- 649
58.Smith
AHBates
MN Confidence limit analyses should replace power calculations in the interpretation of epidemiologic studies.
Epidemiology. 1992;3449- 452
Google ScholarCrossref 59.Goodman
SNBerlin
JA The use of predicted confidence intervals when planning experiments and the misuse of power when interpreting results.
Ann Intern Med. 1994;121200- 206
Google ScholarCrossref 61.Royall
R Statistical Evidence: A Likelihood Paradigm. Boca Raton, Fla CRC Press1997;
62.Goodman
SN Toward evidence-based medical statistics, 1: the
P value fallacy.
Ann Intern Med. 1999;130995- 1004
Google ScholarCrossref 63.Goodman
SN Toward evidence-based medical statistics, 2: the Bayes factor.
Ann Intern Med. 1999;1301005- 1013
Google ScholarCrossref