Retire statistical significance (Amrhein et al., 2019)

Summary

The authors make a strong statement against classification when dealing with statistical significance. They refer to earlier comments made especially by Wasserstein et al. (2019) in a special issue of The American Statistician and show that many scientists agree with them.

The authors present a fictive example where they compare 2 studies. One study is less precise than the other one. The P-Value in this study is above the 5% threshold and its 95% confidence interval spans from 0.97 to 1.48. The more precise study has a narrower confidence interval with the same center.

They claim that in these situations the studies would often be mischaracterized as “contradicting”.

They call for an abandonment of the term statistical significance as it leads to dichotomous classification of results and a large non-justifiable bias in favor of “statistically significant” results. Furthermore it leads to so called P-hacking.

Later on they mention that categorization like this is bad practice not only in frequentist but also Bayesian statistics e.g. with Bayes factors.

The commenters propose using the term “compatibility interval” and stating that values inside this range are compatible with the data, while values outside are “less compatible”. The point estimate is to be described as the “most compatible”.

Towards the end they advocate for detailed description and discussion of data, models and findings, for honest review of weaknesses and strengths of the study. While they do not want to eradicate hypothesis testing or p-values, they want statistical significance gone.

Key Highlights

“A statistically non-significant result does not ‘prove’ the null hypothesis (the hypothesis that there is no difference between groups or no effect of a treatment on some measured outcome). Nor do statistically significant results ‘prove’ some other hypothesis.”

“Eradicating categorization will help to halt overconfident claims, unwarranted declarations of ‘no difference’ and absurd statements about ‘replication failure’.”

“Whether a P value is small or large, caution is warranted. We must learn to embrace uncertainty.”

“We are not calling for a ban on P values. Nor are we saying they cannot be used as a decision criterion in certain specialized applications.”

“Last, and most important of all, be humble: compatibility assessments hinge on the correctness of the statistical assumptions used to compute the interval.”

As stated in Statistical Rethinking, our methods are tools/golems that can yield unwanted results when used in the wrong situation or way.

“Factors such as background evidence, study design, data quality and understanding of underlying mechanisms are often more important than statistical measures such as P values or intervals.”

“Our call to retire statistical significance and to use confidence intervals as compatibility intervals is not a panacea. Although it will eliminate many bad practices, it could well introduce new ones.”

“P values, intervals and other statistical measures all have their place, but it’s time for statistical significance to go.”

Comment

I fully agree. Dichotomous decision making is taught in many statistics courses and students often ask for guidelines to classify results. We should avoid promoting this and rather encourage thinking about the results. I think that the need for classification of test results often stems from insufficient knowledge about the test mechanisms.

References

  1. Fisher, R. A. Nature 136, 474 (1935).
  2. Schmidt, M. & Rothman, K. J. Int. J. Cardiol. 177, 1089–1090 (2014).
  3. Wasserstein, R. L., Schirm, A. & Lazar, N. A. Am. Stat. https://doi.org/10.1080/00031305.2019.1583913 (2019).
  4. Hurlbert, S. H., Levine, R. A. & Utts, J. Am. Stat. https://doi.org/10.1080/00031305.2018.1543616 (2019).
  5. Lehmann, E. L. Testing Statistical Hypotheses 2nd edn 70–71 (Springer, 1986).
  6. Gigerenzer, G. Adv. Meth. Pract. Psychol. Sci. 1, 198–218 (2018).
  7. Greenland, S. Am. J. Epidemiol. 186, 639–645 (2017).
  8. McShane, B. B., Gal, D., Gelman, A., Robert, C. & Tackett, J. L. Am. Stat. https://doi.org/10.1080/00031305.2018.1527253 (2019).
  9. Gelman, A. & Loken, E. Am. Sci. 102, 460–465 (2014).
  10. Amrhein, V., Trafimow, D. & Greenland, S. Am. Stat. https://doi.org/10.1080/00031305.2018.1543137 (2019).