"The results of t shows that P is between .02 and .05. The result must be judged significant, though barely so [...] we find... t=1.844 [with 13 df, P = 0.088]. The difference between the regression coefficients, though relatively large, cannot be regarded as significant." (Ronald A Fisher, "Statistical Methods for Research Workers", 1925)
"The value for which P=0.05, or 1 in 20, is 1.96 or nearly 2; it is convenient to take this point as a limit in judging whether a deviation ought to be considered significant or not. [...] If P is between .1 and .9 there is certainly no reason to suspect the hypothesis tested. If it is below .02 it is strongly indicated that the hypothesis fails to account for the whole of the facts. Belief in the hypothesis as an accurate representation of the population sampled is confronted by the logical disjunction: Either the hypothesis is untrue, or the value of χ2 has attained by chance an exceptionally high value. The actual value of P obtainable from the table by interpolation indicates the strength of the evidence against the hypothesis. A value of χ2 exceeding the 5 per cent. point is seldom to be disregarded." (Ronald A Fisher, "Statistical Methods for Research Workers", 1925)
"If one in twenty does not seem high enough odds, we may, if we prefer it, draw the line at one in fifty (the 2 per cent point), or one in a hundred (the 1 per centp oint). Personally, the writer prefers to set a low standard of significance at the 5 per cent point, and ignore entirely all results which fail to reach this level. A scientific fact should be regarded as experimentally established only if a properly designed experiment rarely fails to give this level of significance," (Ronald A Fisher, 1926)
"In the examples we have given [...] our judgment whether P was small enough to justify us in suspecting a significant difference [...] has been more or less intuitive. Most people would agree [...] that a probability of .0001 is so small that the evidence is very much in favour. . . . Suppose we had obtained P = 0.1. [...] Where, if anywhere, can we draw the line? The odds against the observed event which influence a decision one way or the other depend to some extent on the caution of the investigator. Some people (not necessarily statisticians) would regard odds of ten to one as sufficient. Others would be more conservative and reserve judgment until the odds were much greater. It is a matter of personal taste." (G U Yule & M G Kendall, "An introduction to the theoryof statistics" 14th ed., 1950)
"The attempts that have been made to explain the cogency of tests of significance in scientific research, by reference to hypothetical frequencies of possible statements, based on them, being right or wrong, thus seem to miss the essential nature of such tests. A man who 'rejects' a hypothesis provisionally, as a matter of habitual practice, when the significance is at the 1% level or higher, will certainly be mistaken in not more than 1% of such decisions. For when the hypothesis is correct he will be mistaken in just 1% of these cases, and when it is incorrect he will never be mistaken in rejection. This inequality statement can therefore be made. However, the calculation is absurdly academic, for in fact no scientific worker has a fixed level of significance at which from year to year, and in all circumstances, he rejects hypotheses; he rather gives his mind to each particular case in the light of his evidence and his ideas. Further, the calculation is based solely on a hypothesis, which, in the light of the evidence, is often not believed to be true at all, so that the actual probability of erroneous decision, supposing such a phrase to have any meaning, may be much less than the frequency specifying the level of significance." (Ronald A Fisher, "Statistical Methods and Scientific Inference", 1956)
"[...] blind adherence to the .05 level denies any consideration of alternative strategies, and it is a serious impediment to the interpretation of data." (James K Skipper Jr. et al, "The sacredness of .05: A note concerning the uses of statistical levels of significance in social science", The American Sociologist 2, 1967)
"The current obsession with .05 [...] has the consequence of differentiating significant research findings and those best forgotten, published studies from unpublished ones, and renewal of grants from termination. It would not be difficult to document the joy experienced by a social scientist when his F ratio or t value yields significance at .05, nor his horror when the table reads 'only' .10 or .06. One comes to internalize the difference between .05 and .06 as 'right' vs. 'wrong', 'creditable' vs. 'embarrassing', 'success' vs. 'failure'." (James K Skipper Jr. et al, "The sacredness of .05: A note concerning the uses of statistical levels of significance in social science", The American Sociologist 2, 1967)
"Rejection of a true null hypothesis at the 0.05 level will occur only one in 20 times. The overwhelming majority of these false rejections will be based on test statistics close to the borderline value. If the null hypothesis is false, the inter-ocular traumatic test ['hit between the eyes'] will often suffice to reject it; calculation will serve only to verify clear intuition." (Ward Edwards et al,Bayesian Statistical Inference for Psychological Research", 1992)
"[...] they [confidence limits] are rarely to be found in the literature. I suspect that the main reason they are not reported is that they are so embarrassingly large!" (Jacob Cohen,The earth is round" (p<.05)", American Psychologist 49, 1994)
"After four decades of severe criticism, the ritual of null hypothesis significance testing—mechanical dichotomous decisions around a sacred .05 criterion - still persist. This article reviews the problems with this practice [...] 'What’s wrong with [null hypothesis significance testing]? Well, among many other things, it does not tell us what we want to know, and we so much want to know what we want to know that, out of desperation, we nevertheless believe that it does!" (Jacob Cohen,The earth is round" (p<.05)", American Psychologist 49, 1994)
"It’s a commonplace among statisticians that a chi-squared test (and, really, any p-value) can be viewed as a crude measure of sample size: When sample size is small, it’s very difficult to get a rejection" (that is, a p-value below 0.05), whereas when sample size is huge, just about anything will bag you a rejection. With large n, a smaller signal can be found amid the noise. In general: small n, unlikely to get small p-values. Large n, likely to find something. Huge n, almost certain to find lots of small p-values." (Andrew Gelman,The sample size is huge, so a p-value of 0.007 is not that impressive", 2009)
"There is a growing realization that reported 'statistically significant' claims in statistical publications are routinely mistaken. Researchers typically express the confidence in their data in terms of p-value: the probability that a perceived result is actually the result of random variation. The value of p" (for 'probability') is a way of measuring the extent to which a data set provides evidence against a so-called null hypothesis. By convention, a p- value below 0.05 is considered a meaningful refutation of the null hypothesis; however, such conclusions are less solid than they appear." (Andrew Gelman & Eric Loken,The Statistical Crisis in Science", American Scientist Vol. 102(6), 2014)