Abstract
Fisher (1945a, 1945b, 1955, 1956, 1960) criticised the Neyman-Pearson approach to hypothesis testing by arguing that it relies on the assumption of “repeated sampling from the same population.” The present article considers the responses to this criticism provided by Pearson (1947) and Neyman (1977). Pearson interpreted alpha levels in relation to imaginary replications of the original test. This interpretation is appropriate when test users are sure that their replications will be equivalent to one another. However, by definition, scientific researchers do not possess sufficient knowledge about the relevant and irrelevant aspects of their tests and populations to be sure that their replications will be equivalent to one another. Pearson also interpreted the alpha level as a personal rule that guides researchers’ behavior during hypothesis testing. However, this interpretation fails to acknowledge that the same researcher may use different alpha levels in different testing situations. Addressing this problem, Neyman proposed that the average alpha level adopted by a particular researcher can be viewed as an indicator of that researcher’s typical Type I error rate. Researchers’ average alpha levels may be informative from a metascientific perspective. However, they are not useful from a scientific perspective. Scientists are more concerned with the error rates of specific tests of specific hypotheses, rather than the error rates of their colleagues. It is concluded that neither Neyman nor Pearson adequately rebutted Fisher’s “repeated sampling” criticism. Fisher’s significance testing approach is briefly considered as an alternative to the Neyman-Pearson approach.
Similar content being viewed by others
Notes
The concept of an exact replication can be defined as requiring the duplication of either (a) all possible testing conditions or (b) only those testing conditions that could potentially affect the results of the study. For example, Rubin (2019) defined exact replications in the second way, as requiring “the duplication of all of the aspects of an original study that could potentially affect the results of that study.” This second definition implies that researchers are sure about which aspects of their study are relevant (i.e., “could potentially affect the results”) and which are irrelevant. Hence, it is similar to the concept of an equivalent replication that I discuss later. In the present article, I adopt the first, more common, definition of an exact replication that requires the duplication of “all possible testing conditions,” including both relevant and irrelevant conditions.
Following Spanos (2006), we can distinguish between statistical and substantive adequacy. Statistical adequacy occurs when a statistical model’s assumptions (e.g., normal, independent, and identically distributed data for a simple normal model) are sufficiently consistent with the observed data. Substantive adequacy occurs when the characteristics of the statistical model, sample, and testing methodology (e.g., sampling procedure, measures, testing environment, etc.) are sufficiently consistent with a theoretical data generating process or “chance mechanism” (Neyman 1977, p. 99).
References
Barrett, L. F. (2015). Psychology is not in crisis. The New York Times, A23. https://www.nytimes.com/2015/09/01/opinion/psychology-is-not-in-crisis.html
Box, G. E. P., Hunter, J. S., & Hunter, W.G. (2005). Statistics for experimenters: Design, innovation and discovery (2nd ed.). Wiley.
Dennis, B., Ponciano, J. M., Taper, M. L., & Lele, S. R. (2019). Errors in statistical inference under model misspecification: Evidence, hypothesis testing, and AIC. Frontiers in Ecology and Evolution, 7, 372. https://doi.org/10.3389/fevo.2019.00372.
Fisher, R. A. (1945a). The logical inversion of the notion of the random variable. Sankhyā: The Indian Journal of Statistics, 7(2), 129–132 https://www.jstor.org/stable/25047836.
Fisher, R. A. (1945b). A new test for 2× 2 tables. Nature, 156(3961), 388. https://doi.org/10.1038/156388a0.
Fisher, R. A. (1955). Statistical methods and scientific induction. Journal of the Royal Statistical Society: Series B: Methodological, 17(1), 69–78. https://doi.org/10.1111/j.2517-6161.1955.tb00180.x.
Fisher, R. A. (1956). Statistical methods and scientific inference. Oliver & Boyd.
Fisher, R. A. (1958). The nature of probability. The Centennial Review, 2, 261–274 https://www.jstor.org/stable/23737535.
Fisher, R. A. (1960). Scientific thought and the refinement of human reasoning. Journal of the Operations Research Society of Japan, 3, 1–10 http://hdl.handle.net/2440/15278.
Hoijtink, H., Mulder, J., van Lissa, C., & Gu, X. (2019). A tutorial on testing hypotheses using the Bayes factor. Psychological Methods, 24(5), 539–556. https://doi.org/10.1037/met0000201.
Hubbard, R. (2004). Alphabet soup: Blurring the distinctions between p’s and α’s in psychological research. Theory & Psychology, 14(3), 295–327. https://doi.org/10.1177/0959354304043638.
Hurlbert, S. H., & Lombardi, C. M. (2009). Final collapse of the Neyman-Pearson decision theoretic framework and rise of the neoFisherian. Annales Zoologici Fennici, 46(5), 311–349. https://doi.org/10.5735/086.046.0501.
Johnstone, D. J. (1987). Tests of significance following R A Fisher. The British Journal for the Philosophy of Science, 38(4), 481–499. https://doi.org/10.1093/bjps/38.4.481.
Lakens, D., Adolfi, F. G., Albers, C. J., Anvari, F., Apps, M. A., Argamon, S. E., et al. (2018). Justify your alpha. Nature Human Behaviour, 2(3), 168–171. https://doi.org/10.1038/s41562-018-0311-x.
Lehmann, E. L. (2008). Reminiscences of a statistician: The company I kept. Springer Science & Business Media.
Machery, E. (2019). What is a replication?. https://doi.org/10.31234/osf.io/8x7yn.
Neyman, J. (1937). X—Outline of a theory of statistical estimation based on the classical theory of probability. Philosophical Transactions of the Royal Society of London. Series A, Mathematical and Physical Sciences, 236(767), 333–380. https://doi.org/10.1098/rsta.1937.0005.
Neyman, J. (1952). Lectures and conferences on mathematical statistics and probability. U.S. Department of Agriculture. http://hdl.handle.net/2027/mdp.39015007297982
Neyman, J. (1955). The problem of inductive inference. Communications on Pure and Applied Mathematics, 8, 13–46.
Neyman, J. (1977). Frequentist probability and frequentist statistics. Synthese, 36, 97–131. https://doi.org/10.1007/BF00485695.
Neyman, J., & Pearson, E. S. (1933). IX. On the problem of the most efficient tests of statistical hypotheses. Philosophical Transactions of the Royal Society A, 231(694–706), 289–337. https://doi.org/10.1098/rsta.1933.0009.
Nosek, B. A., & Errington, T. M. (2020). What is replication? PLoS Biology, 18(3), e3000691. https://doi.org/10.1371/journal.pbio.3000691.
Pearson, E. S. (1947). The choice of statistical tests illustrated on the interpretation of data classed in a 2 X 2 table. Biometrika, 34(1/2), 139–167. https://doi.org/10.2307/2332518.
Perezgonzalez, J. D. (2015). Confidence intervals and tests are two sides of the same research question. Frontiers in Psychology, 6, 34. https://doi.org/10.3389/fpsyg.2015.00034.
Redish, D. A., Kummerfeld, E., Morris, R. L., & Love, A. C. (2018). Reproducibility failures are essential to scientific inquiry. Proceedings of the National Academy of Sciences, 115(20), 5042–5046. https://doi.org/10.1073/pnas.1806370115.
Rubin, M. (2017). An evaluation of four solutions to the forking paths problem: Adjusted alpha, preregistration, sensitivity analyses, and abandoning the Neyman-Pearson approach. Review of General Psychology, 21, 321–329. https://doi.org/10.1037/gpr0000135.
Rubin, M. (2019). What type of Type I error? Contrasting the Neyman-Pearson and Fisherian approaches in the context of exact and direct replications. Synthese. https://doi.org/10.1007/s11229-019-02433-0.
Schmidt, S. (2009). Shall we really do it again? The powerful concept of replication is neglected in the social sciences. Review of General Psychology, 13(2), 90–100. https://doi.org/10.1037/a0015108.
Shrout, P. E., & Rodgers, J. L. (2018). Psychology, science, and knowledge construction: Broadening perspectives from the replication crisis. Annual Review of Psychology, 69, 487–510. https://doi.org/10.1146/annurev-psych-122216-011845.
Spanos, A. (2006). Where do statistical models come from? Revisiting the problem of specification. Optimality, 49, 98–119. https://doi.org/10.1214/074921706000000419.
Stroebe, W., & Strack, F. (2014). The alleged crisis and the illusion of exact replication. Perspectives on Psychological Science, 9(1), 59–71. https://doi.org/10.1177/1745691613514450.
Zwaan, R. A., Etz, A., Lucas, R. E., & Donnellan, M. B. (2018). Making replication mainstream. Behavioral and Brain Sciences, 41, e120. https://doi.org/10.1017/s0140525x17001972.
Author information
Authors and Affiliations
Corresponding author
Additional information
This article belongs to the Topical Collection: Philosophical Perspectives on the Replicability Crisis
Guest Editors: Mattia Andreoletti, Jan Sprenger
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Rubin, M. “Repeated sampling from the same population?” A critique of Neyman and Pearson’s responses to Fisher. Euro Jnl Phil Sci 10, 42 (2020). https://doi.org/10.1007/s13194-020-00309-6
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s13194-020-00309-6