[Skip to Content]
Sign In
Individual Sign In
Create an Account
Institutional Sign In
OpenAthens Shibboleth
[Skip to Content Landing]
April 2003

Reporting Statistical Information in Medical Journal Articles

Arch Pediatr Adolesc Med. 2003;157(4):321-324. doi:10.1001/archpedi.157.4.321

STATISTICS IS not merely about distributions or probabilities, although these are part of the discipline. In the broadest sense, statistics is the use of numbers to quantify relationships in data and thereby answer questions. Statistical methods allow the researcher to reduce a spreadsheet of data to counts, means, proportions, rates, risk ratios, rate differences, and other quantities that convey information. We believe that the presentation of numerical information will be enhanced if authors keep in mind that their goal is to clarify and explain. We offer suggestions here for the presentation of statistical information to the readers of general medical journals.

Numbers that can be omitted

Most statistical software packages offer a cornucopia of output. Authors need to be judicious in selecting what should be presented. A chi-square test will typically produce the chi-square statistic, the degrees of freedom in the data, and the P value for the test. In general, chi-square statistics, t statistics, F statistics, and similar values should be omitted. Degrees of freedom are not needed.

Even the P value can usually be omitted. In a study that compares groups, it is customary to present a table that allows the reader to compare the groups with regard to variables such as age, sex, or health status. A case-control study typically compares the cases and controls, whereas a cohort study typically compares those exposed and not exposed. Sometimes authors use P values to compare these study groups. We suggest that these P values should be omitted. In a case-control or cohort study, there is no hypothesis that the 2 groups are similar. We are interested in a comparison because differences between groups may confound estimates of association. If the study sample is large, small differences that have little confounding influence may be statistically significant. If the study sample is small, large differences that are not statistically significant may be important confounders.1 The bias due to confounding cannot be judged by statistical significance; we usually judge this based on whether adjustment for the confounding variable changes the estimate of association.26

Even in a randomized trial, in which it is hoped that the compared groups will be similar as a result of randomization, the use of P values for baseline comparisons is not appropriate. If randomization was done properly, then the only reason for any differences must be chance.7 It is the magnitude of any difference, not its statistical significance, that may bias study results and that may need to be accounted for in the analysis.8

Much regression output serves little purpose in medical research publication; this usually includes the intercept coefficient, R2, log likelihood, standard errors, and P values. Estimates of variance explained (such as R2, correlation coefficients, and standardized regression coefficients (sometimes called effect size) are not useful measures of causal associations or agreement and should not be presented as the main results of an analysis.912 These measures depend not only on the size of any biological effect of an exposure, but also on the distribution of the exposure in the population. Because this distribution can be influenced by choice of study population and study design, it makes little sense to standardize on a measure that can be made smaller or larger by the investigator. Useful measures for causal associations, such as risk ratios and rate differences, are discussed below. Useful measures of agreement, such as kappa statistics, the intraclass correlation coefficient, the concordance coefficient, and other measures, are discussed in many textbooks and articles.1120

Global tests of regression model fit are not helpful in most articles. Investigators can use these tests to check that a model does not have major conflicts with the data, but they should be aware that these tests have low power to detect problems.21 If the test yields a small P value, which suggests a problem with the model, investigators need to consider what this means in the context of their study. But a large P value cannot reassure authors or readers that the model presented is correct.5

Complex formulas or mathematical notation, such as log likelihood expressions or symbolic expressions for regression models, are not useful for general medical readers.

Numbers that should be included

Several authors, including the International Committee of Medical Journal Editors, have urged that research articles present measures of association, such as risk ratios, risk differences, rate ratios, or differences in means, along with an estimate of the precision for these measures, such as a 95% confidence interval.1,2229

Imagine that we compared the outcomes of patients who received treatment A with the outcomes of other patients. If we find that the 2-sided P value for this comparison is .02, we conclude that the probability of obtaining the observed difference (or a greater difference) is 1 in 50 if, in the population from which we selected our study subjects, the treatment actually had no effect on the outcomes. But the P value does not tell readers if those who received treatment A did better or worse compared with those who did not. Nor does it tell readers how much better or worse one group did compared with the other. However, if we report that the risk ratio for a bad outcome was 0.5 among those who received treatment A, compared with others, readers can see both the direction (beneficial) and the size (50% reduction in the risk of a bad outcome) of treatment A's association with bad outcomes. It is also useful, when possible, to show the proportion of each group that had a bad outcome or to use something similar, such as a Kaplan-Meier survival curve. If we report that the 95% confidence interval around the risk ratio of 0.5 was 0.3 to 0.7, readers can see that the null hypothesis of no association (a risk ratio of 1.0) is unlikely and that risk ratios of 0.9 or 0.1 are also unlikely. If we report that the risk ratio was 0.5 (95% confidence interval, 0.2 to 1.3), a reader can see that the estimate of 0.5 is imprecise and the data are compatible with no association between treatment and outcome (a risk ratio of 1.0) and are even compatible with a harmful association (a risk ratio greater than 1.0). A point estimate and confidence interval convey more information than the P value for a test of the hypothesis of no association. Similarly, means can be compared by presenting their differences with a 95% confidence interval for the difference.

We acknowledge that sometimes P values may serve a useful purpose,30 but we recommend that point estimates and confidence intervals be used in preference to P values in most instances. If P values are given, please use 2 digits of precision (eg, P = .82). Give 3 digits for values between .01 and .001 and report smaller values as P<.001. Do not reduce P values to "not significant" or "NS."

Descriptive tables

In tables that compare study groups, it is usually helpful to include both counts (of patients or events) and column percentages (Table 1). In a case-control study, there are usually column headings for the cases and controls. For clinical trials or cohort studies, the column headings are typically the trial arms or the exposure categories. Listing column percentages allows the reader to easily compare the distribution of data between groups. Do not give row percentages.

Data From a Hypothetical Randomized Controlled Trial of Treatment A Compared With Standard Care
Data From a Hypothetical Randomized Controlled Trial of Treatment A Compared With Standard Care

In tables of column percentages, do not include a row for counts and percentages of missing data. Doing this will distort the other percentages in the same column, making it difficult for readers to compare known information in different columns. The records with missing data are best omitted for each variable. The investigator hopes that the distribution of information about those with missing data was similar to those with known data. The amount of missing data should be described in the methods section. If there is a lot of missing data for a variable, say more than 5%, a table footnote can point this out (Table 1).

Odds ratios vs risk ratios

In a case-control study, associations are commonly estimated using odds ratios. Because case-control studies are typically done when the study outcome is uncommon in the population from which the cases and controls arose, odds ratios will approximate risk ratios.31,32 Logistic regression is typically used to adjust odds ratios to control for potential confounding by other variables.

In clinical trials or cohort studies, however, the outcome may be common. If more than 10% of the study subjects have the outcome, or if the baseline hazard of disease is common in a subgroup that contributes a substantial portion of subjects with the outcome, then the odds ratio may be considerably further from 1.0 than the risk ratio.6,33 This may result in a misinterpretation of the study results by authors, editors, or readers.3440 One option is to do the analysis using logistic regression and convert the odds ratios to risk ratios.4143 Another option is to estimate a risk ratio using Poisson regression, negative binomial regression, or a generalized linear model with a log link and binomial error distribution.4455 Whatever choice is made, we urge authors not to interpret odds ratios as if they were risk ratios in studies where this interpretation is not warranted.

Power calculations after the results are known

Reporting of power calculations makes little sense once the study has been done.56,57 We think that reviewers who request such calculations are misguided. We can never know what a study would have found if it had been larger. If a study reported an association with a 95% confidence interval that excludes 1.0, then the study was not underpowered to reject the null hypothesis using a 2-sided significance level of .05. If the study reported an association with a 95% confidence interval that includes 1.0, then by that standard the data are compatible with the range of associations that fall within the confidence interval, including the possibility of no association. Point estimates and confidence intervals tell us more than any power calculation about the range of results that are compatible with the data.58 In a review of this topic, Goodman and Berlin wrote that " . . . we cannot cross back over the divide and use pre-experiment numbers to interpret the result. That would be like trying to convince someone that buying a lottery ticket was foolish (the before-experiment perspective) after they hit the lottery jackpot (the after-experiment perspective)".59(p201)

Citations for methods sections

In the methods section, authors should provide sufficient information so that a knowledgeable reader can understand how the quantitative information in the results section was generated. For common methods, such as the chi-square test, Fisher exact test, the 2-sample t test, linear regression, and logistic regression, we see no need for a citation. For proportional hazard models, Poisson regression, and other less common methods, we recommend that a textbook be cited so that an interested person could read further.

Authors sometimes state that their analytic method was a specific command in a software package. This is not helpful to persons without that software. Tell readers the method using statistical nomenclature and give appropriate citations to statistical textbooks and articles, so that they could implement the analysis in the software of their choice.

We see no reason to mention or cite the software used for common statistical methods. For uncommon methods, citing the software may be helpful because the reader may want to acquire software that implements the described method and, for some newer methods, results may be somewhat different depending on the software used. If you are in doubt, we suggest citing your software.

Clarity vs statistical terms

Because the readers of general medical journals are not usually statisticians, we urge that technical statistical terms be replaced with simpler terms whenever possible. Words such as "stochastic" or "Hessian matrix" are rarely appropriate in an article and are never appropriate in the results section.

As an example, imagine that we have done a randomized trial to estimate the risk ratio for pneumonia among those who received a vaccine compared with others. Study subjects ranged in age from 40 to 79 years. We used regression to estimate that the risk ratio for pneumonia was 0.5 among vaccine recipients compared with controls. As part of our analysis, we wanted to know if this association was different among those who were younger (40 to 59 years) compared with those who were older (60 to 79 years). To do this, we introduced what statisticians call an interaction term between treatment group and age group. It is fine to say this in the methods. But in the results we can avoid both the statisticians' term (interaction) and the epidemiologists' term (effect modification) and simply say, "There was no evidence that the association between being in the vaccine group and pneumonia varied by age group (P = .62)."

Accurate statements about statistical methods will sometimes require words that will be unfamiliar to some readers. We are not asking for clarity at the expense of accuracy, and we appreciate that sometimes part of the methods section will be beyond the general reader. The results section, however, must be written so that the average reader can understand the study findings.

Common language pitfalls

Avoid use of the word "significant" unless you mean "statistically significant"; in that case, it is best to use both those words.

Do not confuse lack of a statistically significant difference with no difference.60 Imagine that the mean age is 38.3 years in group A and 37.9 years in group B, with a mean difference of 0.4 years (95% confidence interval, 2.4 to −1.6). Do not say that the 2 groups did not differ in regard to age; they clearly do differ, with a mean difference of 0.4 years. It might be reasonable to say that the 2 groups were similar with regard to age or that differences in mean age were not statistically significant.

Dogma vs flexibility

Biostatistics, like the rest of medicine, is a changing field. Nothing we have said here is fixed in stone. Today, for example, we recommend confidence intervals as estimates of precision, but we would be quite willing to accept a manuscript with likelihood intervals instead.5,6163 If authors think they have a good reason to ignore some of our recommendations, we encourage them to write their manuscript as they see fit and be prepared to persuade and educate reviewers and editors. If authors keep in mind the goals of clarity and accuracy, readers will be well served.

Lang  JMRothman  KJCann  CI That confounded P value.  Epidemiology. 1998;97- 8Crossref
Mickey  RMGreenland  S The impact of confounder selection criteria on effect estimation.  Am J Epidemiol. 1989;129125- 137
Maldonado  GGreenland  S Simulation study of confounder selection strategies.  Am J Epidemiol. 1993;138923- 936
Kleinbaum  DG Logistic Regression: A Self-Learning Text.  New York, NY Springer-Verlag1992;168
Rothman  KJGreenland  S Modern Epidemiology.  Philadelphia, Pa Lippincott-Raven1998;195- 199255- 259410- 411
Greenland  SMorgenstern  H Confounding in health research.  Annu Rev Public Health. 2001;22189- 212Crossref
Matthews  JNS An Introduction to Randomized Controlled Clinical Trials.  London, England Arnold2000;64- 65
Rothman  KJ Statistics in nonrandomized studies.  Epidemiology. 1990;1417- 418Crossref
Greenland  SSchlesselman  JJCriqui  MH The fallacy of employing standardized regression coefficients and correlations as measures of effect.  Am J Epidemiol. 1986;123203- 208
Greenland  SMaclure  MSchlesselman  JJPoole  CMorgenstern  H Standardized regression coefficients: a further critique and review of some alternatives.  Epidemiology. 1991;2387- 392Crossref
Altman  DG Practical Statistics for Medical Research.  New York, NY Chapman & Hall1991;396- 419
van Belle  G Statistical Rules of Thumb.  New York, NY John Wiley & Sons2002;56- 68
Fleiss  JL Statistical Methods for Rates and Proportions.  New York, NY John Wiley & Sons1981;212- 236
Nelson  JCPepe  MS Statistical description of interrater variability in ordinal ratings.  Stat Methods Med Res. 2000;9475- 496Crossref
Kraemer  HCPeriyakoil  VSNoda  A Kappa coefficients in medical research.  Stat Med. 2002;212109- 2129Crossref
Armitage  PBerry  GMatthews  JNS Statistical Methods in Medical Research.  Oxford, England Blackwell Science2002;704- 707
Lin  LI A concordance correlation coefficient to evaluate reproducibility.  Biometrics. 1989;45255- 268Crossref
Lin  LI A note on the concordance correlation coefficient.  Biometrics. 2000;56324- 325Crossref
Bland  JMAltman  DG Statistical methods for assessing agreement between 2 methods of clinical measurement.  Lancet. 1986;1307- 310Crossref
Bland  JMAltman  DG Measurement error.  BMJ. 1996;313744Crossref
Hosmer  DWHjort  NL Goodness-of-fit processes for logistic regression: simulation results.  Stat Med. 2002;212723- 2738Crossref
Gardner  MJAltman  DG Confidence intervals rather than P values: estimation rather than hypothesis testing.  BMJ (Clin Res Ed). 1986;292746- 750Crossref
Rothman  KJ Significance questing.  Ann Intern Med. 1986;105445- 447Crossref
Walker  AM Reporting the results of epidemiologic studies.  Am J Public Health. 1986;76556- 558Crossref
Savitz  DA Is statistical significance testing useful in interpreting data?  Reprod Toxicol. 1993;795- 100Crossref
International Committee of Medical Journal Editors, Uniform requirements for manuscripts submitted to biomedical journals.  N Engl J Med. 1991;324424- 428Crossref
International Committee of Medical Journal Editors, Uniform requirements for manuscripts submitted to biomedical journals.  JAMA. 1997;277927- 934Crossref
Not Available, The value of P Epidemiology. 2001;12286Crossref
Poole  C Low P-values or narrow confidence intervals: which are more durable?  Epidemiology. 2001;12291- 294Crossref
Weinberg  CR It's time to rehabilitate the P-value.  Epidemiology. 2001;12288- 290Crossref
Kelsey  JLWhittemore  ASEvans  ASThompson  WD Methods in Observational Epidemiology.  New York, NY Oxford University Press1996;36
MacMahon  BTrichopoulos  D Epidemiology: Principles and Methods.  Boston, Mass Little, Brown & Co1996;169- 170
Greenland  S Interpretation and choice of effect measures in epidemiologic analyses.  Am J Epidemiol. 1987;125761- 768
Relman  AS Medical insurance and health: what about managed care?  N Engl J Med. 1994;331471- 472Crossref
Welch  HGKoepsell  TD Insurance and the risk of ruptured appendix.  N Engl J Med. 1995;332396- 397
Altman  DGDeeks  JJSackett  DL Odds ratios should be avoided when events are common [letter].  BMJ. 1998;3171318Crossref
Altman  DDeeks  JSackett  D Odds ratios revisited [letter].  Evid Based Med. 1998;371- 72
Sackett  DLDeeks  JJAltman  DG Down with odds ratios!  Evid Based Med. 1996;1164- 166
Schwartz  LMWoloshin  SWelch  HG Misunderstanding about the effects of race and sex on physicians' referrals for cardiac catheterization.  N Engl J Med. 1999;341279- 283Crossref
Deeks  JJ Issues in the selection of a summary statistic for meta-analysis of clinical trials with binary outcomes.  Stat Med. 2002;211575- 1600Crossref
Zhang  JYu  KF What's the relative risk? A method of correcting the odds ratio in cohort studies of common outcomes.  JAMA. 1998;2801690- 1691Crossref
McNutt  LAHafner  JPXue  X Correcting the odds ratio in cohort studies of common outcomes [letter].  JAMA. 1999;282529Crossref
Nelder  JA Statistics in medical journals: some recent trends [letter].  Stat Med. 2001;202205Crossref
Breslow  NEDay  NE The Design and Analysis of Cohort Studies.  Lyon, France International Agency for Research on Cancer1987;120- 176 Statistical Methods in Cancer Research vol 2
McCullagh  PNelder  JA Generalized Linear Models.  New York, NY Chapman & Hall1989;
Gourieroux  CMonfort  ATognon  C Pseudo-maximum likelihood methods: theory.  Econometrica. 1984;52681- 700Crossref
Gourieroux  CMonfort  ATognon  C Pseudo-maximum likelihood methods: applications to Poisson models.  Econometrica. 1984;52701- 720Crossref
Wacholder  S Binomial regression in GLIM: estimating risk ratios and risk differences.  Am J Epidemiol. 1986;123174- 184
Gardner  WMulvey  EPShaw  EC Regression analyses of counts and rates: Poisson, overdispersed Poisson, and negative binomial models.  Psychol Bull. 1995;118392- 404Crossref
Long  JS Regression Models for Categorical and Limited Dependent Variables.  Thousand Oaks, Calif SAGE Publications1997;217- 250
Cameron  ACTrivedi  PK Regression Analysis of Count Data.  New York, NY Cambridge University Press1998;
Lloyd  CJ Statistical Analysis of Categorical Data.  New York, NY John Wiley & Sons1999;84- 87
Cummings  PNorton  RKoepsell  TDRivara  FPedCummings  PedKoepsell  TDedGrossman  DCedMaier  RVed Rates, rate denominators, and rate comparisons.  Injury Control: A Guide to Research and Program Evaluation New York, NY Cambridge University Press2001;64- 74
Hardin  JHilbe  J Generalized Linear Models and Extensions.  College Station, Tex Stata Press2001;
Wooldridge  JM Econometric Analysis of Cross Section and Panel Data.  Cambridge, Mass MIT Press2002;646- 649
Bacchetti  P Author's thoughts on power calculations [letter].  BMJ. 2002;325491Crossref
Senn  SJ Power is indeed irrelevant in interpreting completed studies [letter].  BMJ. 2002;3251304Crossref
Smith  AHBates  MN Confidence limit analyses should replace power calculations in the interpretation of epidemiologic studies.  Epidemiology. 1992;3449- 452Crossref
Goodman  SNBerlin  JA The use of predicted confidence intervals when planning experiments and the misuse of power when interpreting results.  Ann Intern Med. 1994;121200- 206Crossref
Altman  DGBland  MJ Absence of evidence is not evidence of absence.  BMJ. 1995;311485Crossref
Royall  R Statistical Evidence: A Likelihood Paradigm.  Boca Raton, Fla CRC Press1997;
Goodman  SN Toward evidence-based medical statistics, 1: the P value fallacy.  Ann Intern Med. 1999;130995- 1004Crossref
Goodman  SN Toward evidence-based medical statistics, 2: the Bayes factor.  Ann Intern Med. 1999;1301005- 1013Crossref