Redefining significance and reproducibility for medical research: A plea for higher P-value thresholds for diagnostic and prognostic models.
Ewout Willem SteyerbergBen Van CalsterPublished in: European journal of clinical investigation (2020)
The role of P-values for null hypothesis testing is under debate. We aim to explore the impact of the significance threshold on estimates for the strengths of associations ("effects") and the implications for different types of epidemiological research. We consider situations with normal distribution of a true effect, while varying the effect size. We confirm the occurrence of "testimation bias": estimating effect size only if the test was statistically significant leads to exaggerated results. The absolute bias is largest for true effects around 0.7 times the size of the standard error: +220% bias if effects are selected after testing with P < .05, and +335% if tested with P < .005. Less bias was found for testing with P < .20 (+130%) and larger true effect sizes. We conclude that a lower P-value threshold for declaring statistical significance implies more exaggeration in an estimated effect. This implies that if a low threshold is used, effect size estimation should not be attempted, for example in the context of selecting promising discoveries that need further validation. Confirmatory studies, such as randomized controlled trials, might stick to the 0.05 threshold if adequately powered, while prediction modelling studies should use an even higher threshold, such as 0.2, to avoid strongly biased effect estimates.