Call for “détente cordiale” between null hypothesis significance testing (NHST) and Bayesian a posteriori approaches
Ruberg SJ. Détente: A Practical Understanding of P values and Bayesian Posterior Probabilities. Clin Pharmacol Ther. 2021 Jun;109(6):1489-1498. {ABS-Link} Click for Details
» Null hypothesis significance testing (NHST) with its benchmark P value < 0.05 has long been a stalwart of scientific reporting and such statistically significant findings have been used to imply scientifically or clinically significant findings. Challenges to this approach have arisen over the past 6 decades, but they have largely been unheeded. There is a growing movement for using Bayesian statistical inference to quantify the probability that a scientific finding is credible. There have been differences of opinion between the frequentist (i.e., NHST) and Bayesian schools of inference, and warnings about the use or misuse of P values have come from both schools of thought spanning many decades. Controversies in this arena have been heightened by the American Statistical Association statement on P values and the further denouncement of the term “statistical significance” by others. My experience has been that many scientists, including many statisticians, do not have a sound conceptual grasp of the fundamental differences in these approaches, thereby creating even greater confusion and acrimony. If we let A represent the observed data, and B represent the hypothesis of interest, then the fundamental distinction between these two approaches can be described as the frequentist approach using the conditional probability pr(A | B) (i.e., the P value), and the Bayesian approach using pr(B | A) (the posterior probability). This paper will further explain the fundamental differences in NHST and Bayesian approaches and demonstrate how they can co-exist harmoniously to guide clinical trial design and inference. «
On the need for caution with P-values and SST
Stang A, Poole C, Kuss O. The ongoing tyranny of statistical significance testing in biomedical research. Eur J Epidemiol. 2010 Apr;25(4):225-30. {ABS-Link} Click for Details
» Since its introduction into the biomedical literature, statistical significance testing (abbreviated as SST) caused much debate. The aim of this perspective article is to review frequent fallacies and misuses of SST in the biomedical field and to review a potential way out of the fallacies and misuses associated with SSTs. Two frequentist schools of statistical inference merged to form SST as it is practised nowadays: the Fisher and the Neyman-Pearson school. The P-value is both reported quantitatively and checked against the alpha-level to produce a qualitative dichotomous measure (significant/nonsignificant). However, a P-value mixes the estimated effect size with its estimated precision. Obviously, it is not possible to measure these two things with one single number. For the valid interpretation of SSTs, a variety of presumptions and requirements have to be met. We point here to four of them: study size, correct statistical model, correct causal model, and absence of bias and confounding. It has been stated that the P-value is perhaps the most misunderstood statistical concept in clinical research. As in the social sciences, the tyranny of SST is still highly prevalent in the biomedical literature even after decades of warnings against SST. The ubiquitous misuse and tyranny of SST threatens scientific discoveries and may even impede scientific progress. In the worst case, misuse of significance testing may even harm patients who eventually are incorrectly treated because of improper handling of P-values. For a proper interpretation of study results, both estimated effect size and estimated precision are necessary ingredients «
Greenland S, Senn SJ, Rothman KJ, Carlin JB, Poole C, Goodman SN, Altman DG. Statistical tests, P values, confidence intervals, and power: a guide to misinterpretations. Eur J Epidemiol. 2016; 31: 337–350 {TXT-Link} Click for Details
» Misinterpretation and abuse of statistical tests, confidence intervals, and statistical power have been decried for decades, yet remain rampant. A key problem is that there are no interpretations of these concepts that are at once simple, intuitive, correct, and foolproof. Instead, correct use and interpretation of these statistics requires an attention to detail which seems to tax the patience of working scientists. This high cognitive demand has led to an epidemic of shortcut definitions and interpretations that are simply wrong, sometimes disastrously so—and yet these misinterpretations dominate much of the scientific literature. In light of this problem, we provide definitions and a discussion of basic statistics that are more general and critical than typically found in traditional introductory expositions. Our goal is to provide a resource for instructors, researchers, and consumers of statistics whose knowledge of statistical theory and technique may be limited but who wish to avoid and spot misinterpretations. We emphasize how violation of often unstated analysis protocols (such as selecting analyses for presentation based on the P values they produce) can lead to small P values even if the declared test hypothesis is correct, and can lead to large P values even if that hypothesis is incorrect. We then provide an explanatory list of 25 misinterpretations of P values, confidence intervals, and power. We conclude with guidelines for improving statistical interpretation and reporting «
Chavalarias D, Wallach JD, Li AH, Ioannidis JP. Evolution of Reporting P Values in the Biomedical Literature, 1990-2015. JAMA. 2016 Mar 15;315(11):1141-8 {TXT-Link} Click for Details
» In this analysis of P values reported in MEDLINE abstracts and in PMC articles from 1990-2015, more MEDLINE abstracts and articles reported P values over time, almost all abstracts and articles with P values reported statistically significant results, and, in a subgroup analysis, few articles included confidence intervals, Bayes factors, or effect sizes. Rather than reporting isolated P values, articles should include effect sizes and uncertainty metrics «
Vidgen B, Yasseri T. P-Values: Misunderstood and Misused. Frontiers Physics 2016; 4: Article 8 {TXT-Link} Click for Details
» P-values are widely used in both the social and natural sciences to quantify the statistical significance of observed results. The recent surge of big data research has made the p-value an even more popular tool to test the significance of a study. However, substantial literature has been produced critiquing how p-values are used and understood. In this paper we review this recent critical literature, much of which is routed in the life sciences, and consider its implications for social scientific research. We provide a coherent picture of what the main criticisms are, and draw together and disambiguate common themes. In particular, we explain how the False Discovery Rate is calculated, and how this differs from a p-value. We also make explicit the Bayesian nature of many recent criticisms, a dimension that is often underplayed or ignored. We conclude by identifying practical steps to help remediate some of the concerns identified. We recommend that (i) far lower significance levels are used, such as 0.01 or 0.001, and (ii) p-values are interpreted contextually, and situated within both the findings of the individual study and the broader field of inquiry (through, for example, meta-analyses) «
» The p-value quantifies the probability of observing results at least as extreme as the ones observed given that the null hypothesis is true. It is then compared against a pre-determined significance level (α). If the reported p-value is smaller than α the result is considered statistically significant. «
Wasserstein RL, Lazar NA. The ASA Statement on p-Values: Context, Process, and Purpose. The American Statistician 2016;70: 129-133 {TXT-Link} Click for Details
» Informally, a p-value is the probability under a specified statistical model that a statistical summary of the data (e.g. the sample mean difference between two compared groups) would be equal to or more extreme than its observed value «
- P-values can indicate how incompatible the data are with a specified statistical model.
- P-values do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone.
- Scientific conclusions and business or policy decisions should not be based only on whether a p-value passes a specific threshold.
- Proper inference requires full reporting and transparency.
- A p-value, or statistical significance, does not measure the size of an effect or the importance of a result.
- By itself, a p-value does not provide a good measure of evidence regarding a model or hypothesis.
Kang J, Hong J, Esie P, Bernstein KT, Aral S. An Illustration of Errors in Using the P Value to Indicate Clinical Significance or Epidemiological Importance of a Study Finding. Sex Transm Dis. 2017 Aug; 44(8): 495–497 {TXT-Link} Click for Details
» A simulation study was carried out to illustrate that P values can suggest but not confirm statistical significance; and they may not indicate epidemiological significance (importance). We recommend that researchers consider reporting effect sizes as P values in conjunction with confidence intervals or point estimates with standard errors to indicate precision (uncertainty) «
Amrhein V, Korner-Nievergelt F, Roth T. The earth is flat (p > 0.05): significance thresholds and the crisis of unreplicable research. Peer J. 2017; 5: e3544 {TXT-Link} Click for Details
» The widespread use of ‘statistical significance’ as a license for making a claim of a scientific finding leads to considerable distortion of the scientific process (according to the American Statistical Association). We review why degrading p-values into ‘significant’ and ‘nonsignificant’ contributes to making studies irreproducible, or to making them seem irreproducible. A major problem is that we tend to take small p values at face value, but mistrust results with larger p-values. In either case, p-values tell little about reliability of research, because they are hardly replicable even if an alternative hypothesis is true. Also significance (p≤0.05) is hardly replicable: at a good statistical power of 80%, two studies will be ‘conflicting’, meaning that one is significant and the other is not, in one third of the cases if there is a true effect. A replication can therefore not be interpreted as having failed only because it is nonsignificant. Many apparent replication failures may thus reflect faulty judgment based on significance thresholds rather than a crisis of unreplicable research. Reliable conclusions on replicability and practical importance of a finding can only be drawn using cumulative evidence from multiple independent studies. However, applying significance thresholds makes cumulative knowledge unreliable. One reason is that with anything but ideal statistical power, significant effect sizes will be biased upwards. Interpreting inflated significant results while ignoring nonsignificant results will thus lead to wrong conclusions. But current incentives to hunt for significance lead to selective reporting and to publication bias against nonsignificant findings. Data dredging, p-hacking, and publication bias should be addressed by removing fixed significance thresholds. Consistent with the recommendations of the late Ronald Fisher, p-values should be interpreted as graded measures of the strength of evidence against the null hypothesis. Also larger p-values offer some evidence against the null hypothesis, and they cannot be interpreted as supporting the null hypothesis, falsely concluding that ‘there is no effect’. Information on possible true effect sizes that are compatible with the data must be obtained from the point estimate, e.g., from a sample average, and from the interval estimate, such as a confidence interval. We review how confusion about interpretation of larger p values can be traced back to historical disputes among the founders of modern statistics. We further discuss potential arguments against removing significance thresholds, for example that decision rules should rather be more stringent, that sample sizes could decrease, or that p-values should better be completely abandoned. We conclude that whatever method of statistical inference we use, dichotomous threshold thinking must give way to non-automated informed judgment «
Vetter TR. Descriptive Statistics: Reporting the Answers to the 5 Basic Questions of Who, What, Why, When, Where, and a Sixth, So What? Anesth Analg. 2017 Nov;125(5):1797-1802 {TXT-Link} Click for Details
» Descriptive statistics are specific methods basically used to calculate, describe, and summarize collected research data in a logical, meaningful, and efficient way. Descriptive statistics are reported numerically in the manuscript text and/or in its tables, or graphically in its figures. This basic statistical tutorial discusses a series of fundamental concepts about descriptive statistics and their reporting. The mean, median, and mode are 3 measures of the center or central tendency of a set of data. In addition to a measure of its central tendency (mean, median, or mode), another important characteristic of a research data set is its variability or dispersion (ie, spread). In simplest terms, variability is how much the individual recorded scores or observed values differ from one another. The range, standard deviation, and interquartile range are 3 measures of variability or dispersion. The standard deviation is typically reported for a mean, and the interquartile range for a median. Testing for statistical significance, along with calculating the observed treatment effect (or the strength of the association between an exposure and an outcome), and generating a corresponding confidence interval are 3 tools commonly used by researchers (and their collaborating biostatistician or epidemiologist) to validly make inferences and more generalized conclusions from their collected data and descriptive statistics. A number of journals, including Anesthesia & Analgesia, strongly encourage or require the reporting of pertinent confidence intervals. A confidence interval can be calculated for virtually any variable or outcome measure in an experimental, quasi-experimental, or observational research study design. Generally speaking, in a clinical trial, the confidence interval is the range of values within which the true treatment effect in the population likely resides. In an observational study, the confidence interval is the range of values within which the true strength of the association between the exposure and the outcome (eg, the risk ratio or odds ratio) in the population likely resides. There are many possible ways to graphically display or illustrate different types of data. While there is often latitude as to the choice of format, ultimately, the simplest and most comprehensible format is preferred. Common examples include a histogram, bar chart, line chart or line graph, pie chart, scatterplot, and box-and-whisker plot. Valid and reliable descriptive statistics can answer basic yet important questions about a research data set, namely: “Who, What, Why, When, Where, How, How Much?” «
Kennedy-Shaffer L. When the Alpha is the Omega: P-Values, “Substantial Evidence,” and the 0.05 Standard at FDA. Food Drug Law J. 2017; 72(4): 595–635 {TXT-Link} Click for Details
» A test provides ranges of that test statistic for which the null hypothesis will be accepted or rejected. Within this hypothesis testing framework, the p-value, a number between 0 and 1, can be defined in several equivalent ways. The formulation most commonly used in the medical literature defines the p-value as the “probability, under the assumption of no effect or no difference, (the null hypothesis), of obtaining a result equal to or more extreme than what was actually observed.” If an event is only of interest if it is more extreme in the same direction as the observed results (compared to the null hypothesis), then we use only that one-sided probability. More commonly, however, a two-sided probability is calculated that is agnostic to whether the more extreme event is in the same or opposite direction as the observed results. A (one-sided or two-sided) p-value is generally then compared to some pre-specified alpha level or significance level. If it is below the alpha level, the null hypothesis is rejected; if it is above the alpha level, the null hypothesis is not rejected. One can equivalently define the p-value, then, as the value of alpha for which the data would be on the border between rejecting and not rejecting the null hypothesis «
Leek J, McShane BB, Gelman A, Colquhoun D, Nuijten MB, Goodman SN. Five ways to fix statistics. Nature 2017;551:557-559 {TXT-Link}
van Rijn MHC, Bech A, Bouyer J, van den Brand JAJG. Statistical significance versus clinical relevance. Nephrol Dial Transplant. 2017 Apr 1;32(suppl-2):ii6-ii12{TXT-Link} Click for Details
» In March this year, the American Statistical Association (ASA) posted a statement on the correct use of P-values, in response to a growing concern that the P-value is commonly misused and misinterpreted. We aim to translate these warnings given by the ASA into a language more easily understood by clinicians and researchers without a deep background in statistics. Moreover, we intend to illustrate the limitations of P-values, even when used and interpreted correctly, and bring more attention to the clinical relevance of study findings using two recently reported studies as examples. We argue that P-values are often misinterpreted. A common mistake is saying that P < 0.05 means that the null hypothesis is false, and P ≥0.05 means that the null hypothesis is true. The correct interpretation of a P-value of 0.05 is that if the null hypothesis were indeed true, a similar or more extreme result would occur 5% of the times upon repeating the study in a similar sample. In other words, the P-value informs about the likelihood of the data given the null hypothesis and not the other way around. A possible alternative related to the P-value is the confidence interval (CI). It provides more information on the magnitude of an effect and the imprecision with which that effect was estimated. However, there is no magic bullet to replace P-values and stop erroneous interpretation of scientific results. Scientists and readers alike should make themselves familiar with the correct, nuanced interpretation of statistical tests, P-values and CIs. «
Wellek S. A critical evaluation of the current “p-value controversy”. Biom J. 2017 Sep;59(5):854-872 {ABS-Link} Click for Details
» This article has been triggered by the initiative launched in March 2016 by the Board of Directors of the American Statistical Association (ASA) to counteract the current p‐value focus of statistical research practices that allegedly “have contributed to a reproducibility crisis in science.” It is pointed out that in the very wide field of statistics applied to medicine, many of the problems raised in the ASA statement are not as severe as in the areas the authors may have primarily in mind, although several of them are well‐known experts in biostatistics and epidemiology. This is mainly due to the fact that a large proportion of medical research falls under the realm of a well developed body of regulatory rules banning the most frequently occurring misuses of p‐values. Furthermore, it is argued that reducing the statistical hypotheses tests nowadays available to the class of procedures based on p‐values calculated under a traditional one‐point null hypothesis amounts to ignoring important developments having taken place and going on within the statistical sciences. Although hypotheses testing is still an indispensable part of the statistical methodology required in medical and other areas of empirical research, there is a large repertoire of methods based on different paradigms of inference that provide ample options for supplementing and enhancing the methods of data analysis blamed in the ASA statement for causing a crisis «
Gagnier JJ, Morgenstern H. Misconceptions, Misuses, and Misinterpretations of P Values and Significance Testing. J Bone Joint Surg Am. 2017 Sep 20;99(18):1598-1603 {ABS-Link} Click for Details
» The interpretation and reporting of p values and significance testing in biomedical research are fraught with misconceptions and inaccuracies. Publications of peer-reviewed research in orthopaedics are not immune to such problems. The American Statistical Association (ASA) recently published an official statement on the use, misuse, and misinterpretation of statistical testing and p values in applied research. The ASA statement discussed 6 principles: (1) “P-values can indicate how incompatible the data are with a specified statistical model.” (2) “P-values do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone.” (3) “Scientific conclusions and business or policy decisions should not be based only on whether a p-value passes a specific threshold.” (4) “Proper inference requires full reporting and transparency.” (5) “A p-value, or statistical significance, does not measure the size of an effect or the importance of a result.” (6) “By itself, a p-value does not provide a good measure of evidence regarding a model or hypothesis.” The purpose of this article was to discuss these principles. We make several recommendations for moving forward: (1) Authors should avoid statements such as “statistically significant” or “statistically nonsignificant.” (2) Investigators should report the magnitude of effect of all outcomes together with the appropriate measure of precision or variation. (3) Orthopaedic residents and surgeons must be educated in biostatistics, the ASA principles, and clinical epidemiology. (4) Journal editors and reviewers need to be familiar with and enforce the ASA principles «
Katz JN, Losina E. Uses and Misuses of the P Value in Reporting Results of Orthopaedic Research Studies. J Bone Joint Surg Am. 2017 Sep 20;99(18):1507-1508 {TXT-Link} Click for Details
» We provide context for the ASA report and the article by Gagnier and Morgenstern in the form of 2 (fictitious) research abstracts that base their conclusions solely on a p value «
Blume JD, D’Agostino McGowan L, Dupont WD, Greevy, Jr RA. Second-generation p-values: Improved rigor, reproducibility, & transparency in statistical analyses. PLoS One. 2018; 13(3): e0188299 {TXT-Link} Click for Details
» Verifying that a statistically significant result is scientifically meaningful is not only good scientific practice, it is a natural way to control the Type I error rate. Here we introduce a novel extension of the p-value—a second-generation p-value (pδ)–that formally accounts for scientific relevance and leverages this natural Type I Error control. The approach relies on a pre-specified interval null hypothesis that represents the collection of effect sizes that are scientifically uninteresting or are practically null. The second-generation p-value is the proportion of data-supported hypotheses that are also null hypotheses. As such, second-generation p-values indicate when the data are compatible with null hypotheses (pδ = 1), or with alternative hypotheses (pδ = 0), or when the data are inconclusive (0<pδ<1). Moreover, second-generation p-values provide a proper scientific adjustment for multiple comparisons and reduce false discovery rates. This is an advance for environments rich in data, where traditional p-value adjustments are needlessly punitive. Second-generation p-values promote transparency, rigor and reproducibility of scientific results by a priori specifying which candidate hypotheses are practically meaningful and by providing a more reliable statistical summary of when the data are compatible with alternative or null hypotheses «
Schober P, Bossers SM, Schwarte LA. Statistical Significance Versus Clinical Importance of Observed Effect Sizes: What Do P Values and Confidence Intervals Really Represent? Anesth Analg. 2018 Mar;126(3):1068-1072 {TXT-Link} Click for Details
» Effect size measures are used to quantify treatment effects or associations between variables. Such measures, of which >70 have been described in the literature, include unstandardized and standardized differences in means, risk differences, risk ratios, odds ratios, or correlations. While null hypothesis significance testing is the predominant approach to statistical inference on effect sizes, results of such tests are often misinterpreted, provide no information on the magnitude of the estimate, and tell us nothing about the clinically importance of an effect. Hence, researchers should not merely focus on statistical significance but should also report the observed effect size. However, all samples are to some degree affected by randomness, such that there is a certain uncertainty on how well the observed effect size represents the actual magnitude and direction of the effect in the population. Therefore, point estimates of effect sizes should be accompanied by the entire range of plausible values to quantify this uncertainty. This facilitates assessment of how large or small the observed effect could actually be in the population of interest, and hence how clinically important it could be. This tutorial reviews different effect size measures and describes how confidence intervals can be used to address not only the statistical significance but also the clinical significance of the observed effect or association. Moreover, we discuss what P values actually represent, and how they provide supplemental information about the significant versus nonsignificant dichotomy. This tutorial intentionally focuses on an intuitive explanation of concepts and interpretation of results, rather than on the underlying mathematical theory or concepts «
Gates S, Ealing E. Reporting and interpretation of results from clinical trials that did not claim a treatment difference: survey of four general medical journals. BMJ Open. 2019; 9(9): e024785 {TXT-Link}Click for Details
» The majority of trials (54.2%) inappropriately interpreted a result that was not statistically significant as indicating no treatment benefit. Very few studies interpreted the result as indicating a lack of evidence against the null hypothesis of zero difference between the trial arms « « »
Wasserstein RL, Schirm AL, Lazar NA. Moving to a World Beyond “p<0.05”. The American Statistician, 2019; 73(sup1): 1-19 {TXT-Link} Click for Details
Don’t’s
- Don’t base your conclusions solely on whether an association or effect was found to be “statistically significant” (i.e., the p-value passed some arbitrary threshold such as p < 0.05).
- Don’t believe that an association or effect exists just because it was statistically significant.
- Don’t believe that an association or effect is absent just because it was not statistically significant.
- Don’t believe that your p-value gives the probability that chance alone produced the observed association or effect or the probability that your test hypothesis is true.
- Don’t conclude anything about scientific or practical importance based on statistical significance (or lack thereof).
Do’s
- Accept Uncertainty (» In the real world, data provide a noisy signal. Variation, one of the causes of uncertainty, is everywhere. Exact replication is difficult to achieve. So it is time to get the right (statistical) gear and “move toward a greater acceptance of uncertainty and embracing of variation” (Gelman, 2016) «
- «
Amrhein V, Greenland S, McShane. Retire statistical significance. Nature 2019;567:305-307 {TXT-Link} Click for Details
» Our call to retire statistical significance and to use confidence intervals as compatibility intervals is not a panacea. Although it will eliminate many bad practices, it could well introduce new ones. Thus, monitoring the literature for statistical abuses should be an ongoing priority for the scientific community. But eradicating categorization will help to halt overconfident claims, unwarranted declarations of ‘no difference’ and absurd statements about ‘replication failure’ when the results from the original and replication studies are highly compatible. The misuse of statistical significance has done much harm to the scientific community and those who rely on scientific advice. P values, intervals and other statistical measures all have their place, but it’s time for statistical significance to go «
Dunkler D, Haller M, Oberbauer R, Heinze G. To test or to estimate? P-values versus effect sizes. Transpl Int. 2020 Jan;33(1):50-55 {TXT-Link} Click for Details
» The P‐value measures the compatibility of the observed data with the null hypothesis. Technically, it expresses the probability with which, given the null hypothesis was true, data with an effect size as extreme as the observed one or more extreme than the observed one can be obtained. The P‐value cannot separate implausibility of the null hypothesis from implausibility of any of the assumptions: A small P‐value gives evidence that the data are not compatible with the specified model – encompassing the null hypothesis and all assumptions. Hence, the P‐value should be viewed as a continuous measure of compatibility of the data to the model ranging from 0 (complete incompatibility) to 1 (complete compatibility). Consequently, precise P‐values should be presented (e.g., P = 0.07 and not P = NS or P > 0.05). «
Calin-Jageman RJ, Cumming G. The New Statistics for Better Science: Ask How Much, How Uncertain, and What Else is Known. Am Stat. 2019 ; 73(Suppl 1): 271–280 {TXT-Link} Click for Details
» The “New Statistics” emphasizes effect sizes, confidence intervals, meta-analysis, and the use of Open Science practices. We present 3 specific ways in which a New Statistics approach can help improve scientific practice: by reducing over-confidence in small samples, by reducing confirmation bias, and by fostering more cautious judgments of consistency. We illustrate these points through consideration of the literature on oxytocin and human trust, a research area that typifies some of the endemic problems that arise with poor statistical practice «
{18.Mai.2020 U: 06.Sep.2023 | ACPS-CdM}