On p-values

Posted at — Oct 2, 2019

Earlier this year, the American Statistical Association published another set of position papers on p-values in the journal The American Statistician. They are worth the time it takes to read them, but to summarise it as tl;dr they essentially say to stop using p-values as sole, and binary, indicator of how meaningful a scientific result is. Instead of relying on single, artificially dichotomised, value, researchers should instead use a variety of measures and also consider the context of the data.

I find that it is rather important that this set of papers appeared and also that papers like these keep appearing until the over-focus on p < .05 in many fields of science disappears. Currently, even if you try to communicate results with a broad set of measures, essentially one out of two things happens: if p-values are present people focus on them, if they are not present people ask for them. This creates a false and dangerous over-focus on a single value that lowers the overall quality of scientific literature and conclusions drawn from it.

The issues with p-values

There are numerous issues attached to the use of p < .05 leading to the necessity of using alternatives. Below are some, but the list is probably not exhaustive:

Using p < .05 as threshold for statistical significance suggests a false dichotomy. The difference between data leading to p = .051 and p = .049 is negligible, while with the current practice the former is bad news while the latter is good news.
If your hypotheses are not correct, p < .05 will not be worth much. Given the noisiness of the data alone, a null hypothesis of zero differences will rarely be reasonable.
p-values scale with sample size. It is folklore in day to day science that at a large enough n everything is significant. The measure was developed with small-ish samples in mind and it shows.
Differences failing p < .05 might nonetheless have large practical consequences. If you do a drug trial and people show an improvement of symptoms, but you would need 50 subjects more to reach the magical threshold, focusing on p < .05 might do more harm than good.
Focusing solely on p < .05 shifts the focus to much away from everything else where p >= .05. Let’s say you build a generalised linear model with six predictors and two have p < .05. You then only report those two predictors in your paper, leading to a situation in which the model you report is incomplete. Unfortunately a situation where only significant predictors are reported happens regularly in the literature.
If using an artificial threshold to determine whether a result is worth being published or not, people will be doing much more than they should to reach that threshold. This practice is known as p-hacking and probably more widespread than people like to admit.

Change is uncomfortable

The issue with p-values have been communicated to a large extent and many people in science are aware of them. Why is it then still common practice to focus on p < .05 in so many disciplines?

I believe this is because the change toward as a more complex and nuanced approach is really uncomfortable. As of now, p < .05 suggests a comfortable clarity, something is either significant or not – there is a clear cut-off and a clear interpretation (or so it seems). Asking researchers to move away from that and instead use an approach using less clear-cut interpretations is of course being met with resistance and a feeling of helplessness.

Journals should start encouraging submitters to use a variety of measures (and to lay out in the methods section of their paper why the used measures are appropriate) and reviewers should equally be made aware of the editorial shift away from p < .05. If the upstream requirements change fast and in a large enough number of important journals, then researchers will no longer be tempted to write that a difference between groups was statistically significant (p < .05). However, it should also be kept in mind that a lot of researchers are no statisticians, maybe had a poor education on statistics and might feel lost if no longer able to use the previously used threshold. Thus, to encourage change away from p < .05 it is essential that a good number of practical primers appears in journals, giving guidance to alternatives.

When moving away from p-values as the dominant and sometimes only measures considered relevant to a more holistic approach everyone involved should take care to avoid creating the same situation over again, just with another measure. And actually this is to some extent already the case for effect sizes, where people often use cut-off values for characterising effect sizes as mild, moderate and pronounced, without taking the context of the data or the origin of these cut-off values into consideration.

Don’t blame the player, blame the game

Now, it seems easy to put the blame on researchers; those people in psychology, neuroscience and medicine (and probably a lot more), who don’t know anything about statistics. And why is psychology its own field anyway? (this paragraph is obviously not to be taken seriously, I am a psychologist myself).

One of the real problems that should not be left unmentioned is that it is close to impossible to publish a paper without having something with p < .05, obviously driving practices like p-hacking, overstating the impact of results and not reporting more ambiguous results. It is even worse if you try to publish null findings. You can have a reasonable hypothesis, well carried out methods, proper data analysis, if at the end you say we did not find any effects, then good luck getting this published in a proper journal.

It appears straightforward that people try to have some p < .05 somewhere before their work of years goes to waste, most likely also affecting their career. There is a lot of focus on the number of papers someone has published and if publishing is much easier if you can report something with p < .05, then the vicious cycle that results from that is obvious. People should not start playing the blame game now, as everyone involved is more or less subject to those long-grown conventions. Instead people in science should focus on making it better, starting today.

Concluding remarks

Science should move away from using p < .05 as a falsely dichotomous indicator of whether results are meaningful or not, as there are numerous issues with this approach. A top-down approach with journals nudging researchers towards alternatives will probably work best and the quickest when it comes to the adoption of different ways of data analysis.

Statistics, Science, Random Ramblings

A blog about data and other interesting things

On p-values

The issues with p-values

Change is uncomfortable

Don’t blame the player, blame the game

Concluding remarks