“Most patients using the new analgesia reported significantly reduced pain.”
Such research findings sound exciting because the word significant suggests important and large. But researchers often use the word with a narrow statistical meaning that has nothing to do with importance.
Consider this statement – a change is statistically significant if we are unlikely to get the observed results, assuming the treatment under study actually has no effect.
If you find that difficult to understand, you’re in good company. Statistical significance testing relies on weird backward logic, and there’s clear evidence that most students and many researchers don’t understand it.
Another problem is that statistical significance is very sensitive to how many people we observe. A small experiment studying only a few patients probably won’t identify even a large effect as statistically significant. On the other hand, a very large experiment is likely to label even a tiny, worthless effect as statistically significant.
For this and other reasons, it’s far better to avoid statistical significance as a measure and use estimation, an alternative statistical approach that’s well known, but sadly, little used.
The joys of estimation
Estimation tells us things such as “the average reduction in pain was 1.2 ± 0.5 points on the 10-point pain scale” (1.2 plus or minus 0.5). That’s far more informative than any statement about significance. And we can interpret the 1.2 (the average improvement) in clinical terms — in terms of how patients actually felt.
The “± 0.5” tells us the precision of our estimate. Instead of 1.2 ± 0.5 we could write 0.7 to 1.7. Such a range is called a confidence interval. The usual convention is to report 95% confidence intervals, which mean we can be 95% confident the interval includes the true average reduction in pain. That’s a highly informative summary of the findings.
We have published evidence that confidence intervals prompt better interpretation of research results than significance testing.
So why has statistical significance testing become entrenched in many disciplines, and why is it widely used in medicine and biosciences? One reason may be that saying something is significant strongly suggests importance, or even truth — even though statistical significance doesn’t tell us either.
Another possible reason is that confidence intervals are often embarrassingly wide. It’s hardly reassuring to report that the average improvement was 12 ± 9, or even 12 ± 15. But such wide intervals accurately report the large amount of uncertainty in research data.
Damning critiques of significance testing and its pernicious effects have been published over more than half a century and Rex Kline provides an excellent review.
My own “Dance of the p values” simulation below illustrates some of the problems of significance testing:
A break in the clouds
Now, however, there is hope. The new edition of the American Psychological Association Publication Manual states that researchers should “wherever possible, base discussion and interpretation of results on point and interval estimates.”
The manual is used by more than 1,000 journals across numerous disciplines, so its advice is influential. I hope it will prove a tipping point, and lead to a major shift from statistical significance testing to estimation.
I refer to estimation as the new statistics, not because the techniques are new, but because for most researchers it would be new, and a major change in thinking to switch from significance testing to estimation. At the heart of the new statistics are confidence intervals and meta-analyses, which apply estimation to multiple studies.
The power of meta-analyses
Meta-analyses integrate evidence from a number of studies on a single issue, so they can overcome the wide confidence intervals usually given by individual studies. This tool is becoming widely used, and forms the basis of reviews in the Cochrane Library, a wonderful online resource that integrates research in the medical and health sciences, and makes the results available to practitioners.
Statistical significance is virtually irrelevant to meta-analysis. In fact, it can damage meta-analysis because journals have often published results only if they were statistically significant, while non-significant results were ignored. Published results were thus a biased selection of all research, and meta-analyses based on published articles would give a biased result.
Meta-analyses can make sense of messy and disputed research literature. They have clearly established, for instance, that phonics are essential for an effective beginner reading program.
Why statistics matter
In the late 1970s, my wife and I followed the best advice on how to reduce the risk of SIDS, or cot death, by putting our young kids to sleep face down on a sheepskin. A recent review applied meta-analysis to the evidence available at various times, and found that, by 1970, there was reasonably clear evidence that back sleeping is safer.
The evidence strengthened over the years, although some parenting books still recommended front sleeping as late as 1988. The authors of the meta-analysis estimated that, if an analysis such as theirs had been available and used in 1970 – and the recommendation for back sleeping had been widely adopted – as many as 50,000 infant deaths may have been avoided across the Western world.
Who says the choice of statistical technique doesn’t make a difference?