Undue Influence: The P Value in Scientific Publishing and Health Policy | Health Policy | JAMA Health Forum | JAMA Network
[Skip to Navigation]
Curated health policy research and original commentary from across the JAMA Network
JAMA Forum

Undue Influence: The P Value in Scientific Publishing and Health Policy

One of the unfortunate adverse effects of the recent focus on the reproducibility of findings in research is that some people have developed less faith in the accuracy of science in general. This isn’t to say that a healthy skepticism isn’t warranted. In many ways, we’ve brought this problem on ourselves.

Aaron Carroll, MD, MS

This is especially true when we discuss statistical significance. It’s bad enough that so few in the lay media and the public seem to understand it. It’s even worse that so many in science and science publishing fail to grasp its meaning.

Earlier this year, the American Statistical Association reissued its statement on statistical significance and P values. The fact that the group felt the need to do this is concerning. After all, it’s not as if any huge advances or changes have occurred in basic statistics. It’s because we need to be reminded about what we’ve forgotten or never really grasped.

Too often, readers of research believe that the P value represents a probability that the null hypothesis  is true (that is, the chance that any observations are the result of random chance). They think that a P value of less than .05 means that that people are more than 95% confident that the finding is correct. But that isn’t the case.

As Steven N. Goodman, MD, PhD, MHS, of Stanford University School of Medicine, wrote recently in Science, “The formal definition of P value is the probability of an observed data summary (e.g., an average) and its more extreme values, given a specified mathematical model and hypothesis (usually the “null”).” It tells us something about how likely data are to fit a certain hypothesis, as specified by a (likely imperfect) mathematical formula.

The P value doesn’t measure a probability. It doesn’t tell us the anything about how likely a result could have occurred by chance alone (another misinterpretation). It also doesn’t set a bar for whether we should take something seriously.

Unfortunately, too often, that’s how journals have considered a P value, and it’s a view authors who wish to publish in journals seem to share. If your statistics yield a P value of .06, then it’s a negative study. If it’s .04, then it’s much more likely to be considered important and much more likely that people will care. In reality, there’s little difference between those findings.

P Values and the News

The news media follow this assessment as well. A “negative finding” often can be ignored or buried. But a new “positive finding” often will not only be touted and proclaimed loudly, it will be discussed as if it’s the last word, regardless of what other research has come before that contradicts it. This is how we can have headlines proclaiming that cell phones cause cancer because of a new small study, regardless of how much data and evidence that we already have that don’t fit with those findings.

It’s also how we can have study after study in certain fields that seemingly contradict each other. We can’t discuss each in a vacuum. They all need to be considered in the context of all other research that has come before and that might come after.

We also have to acknowledge that P values can also be “hacked.” Repeated analyses can be conducted until one finally crosses that magical barrier of .05. At that point, some will consider the findings “truth” and adjust their practice accordingly.

Goodman further relates how English statistician and biologist R. A. Fisher established much of the framework we still use for statistical thinking in science today. But Fisher considered “significance,” as characterized by a P value, to mean that a finding was worthy of further research and that only if further research had similar findings could work be considered truly to refute the null hypothesis.

But most of us don’t always adhere to Fisher’s standard for significance. I know that at times I’ve been guilty of this neglect in my writing. I can discuss one study as “significant” and another as “inconclusive,” as if just knowing the P value tells us all that we need to know about whether to believe the findings or not.

Real-world Implications

This may sound like an esoteric argument for scientists and statisticians to have, but unfortunately it has real-world implications. When physicians read studies in journals like JAMA, they often are deciding whether to change the ways they treat patients in order to achieve better outcomes. Relying on simple numbers for significance, and strict cut points as to whether to believe a result and put it into practice, can lead us down a path from which it’s hard to recover.

This has policy implications as well. Consider the Oregon Health Insurance Experiment, which examined the effects after 2 years of expanding Medicaid coverage in Oregon in 2008. Saying that it didn’t significantly improve health outcomes (because differences in the outcomes measured failed to achieve a P value of .05) might miss nuances of power and what changes in health outcomes or other factors might be reasonably expected from such a study. Misunderstanding “significance” when it comes to the effects of massive screening policies, like those for breast or prostate cancer, can have enormous consequences.

Relying solely on P values without context is how you wind up believing that all foods both cause and prevent cancer and that one of the most “sure” things in the world (with a collective P value <10−42) is that skipping breakfast is linked with obesity. (That association is far from proven, by the way).

There are many reasons that science has a reproducibility problem. There are also many reasons why policy fails to achieve the results we want and why published studies seem to contradict each other too often. But a misunderstanding of significance, especially P values, is a contributing cause to all of these issues. As scientists, researchers, and editors, we need to change that.

About the author: Aaron E. Carroll, MD, MS, is a health services researcher and the Vice Chair for Health Policy and Outcomes Research in the Department of Pediatrics at Indiana University School of Medicine. He blogs about health policy at The Incidental Economist and tweets at @aaronecarroll.
Limit 200 characters
Limit 25 characters
Conflicts of Interest Disclosure

Identify all potential conflicts of interest that might be relevant to your comment.

Conflicts of interest comprise financial interests, activities, and relationships within the past 3 years including but not limited to employment, affiliation, grants or funding, consultancies, honoraria or payment, speaker's bureaus, stock ownership or options, expert testimony, royalties, donation of medical equipment, or patents planned, pending, or issued.

Err on the side of full disclosure.

If you have no conflicts of interest, check "No potential conflicts of interest" in the box below. The information will be posted with your response.

Not all submitted comments are published. Please see our commenting policy for details.

Limit 140 characters
Limit 3600 characters or approximately 600 words