当前位置: X-MOL 学术Psychol. Sci. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Swan Song Editorial.
Psychological Science ( IF 10.172 ) Pub Date : 2019-12-03 , DOI: 10.1177/0956797619893653
D Stephen Lindsay

Early in 2012, Geoff Cumming blew my mind with a talk that led me to realize that I had been conducting underpowered experiments for decades. In some lines of research in my lab, a predicted effect would come booming through in one experiment but melt away in the next. My students and I kept trying to find conditions that yielded consistent statistical significance—tweaking items, instructions, exclusion rules—but we sometimes eventually threw in the towel because results were maddeningly inconsistent. For example, a chapter by Lindsay and Kantner (2011) reported 16 experiments with an on-again/off-again effect of feedback on recognition memory. Cumming’s talk explained that p values are very noisy. Moreover, when between-subjects designs are used to study small- to medium-sized effects, statistical tests often yield nonsignificant outcomes (sometimes with huge p values) unless samples are very large. For example, if Cohen’s d equals 0.50 for a between-subjects comparison and there are 20 subjects in each group, then about two thirds of the time, p will be greater than .05, a Type II error. Even if the sample size is 50 per condition, nearly one in three between-subjects experiments with an effect size (d) of 0.50 would yield nonsignificant results. Experiments in my lab were trying to detect a small mixed-model interaction with two or three dozen subjects per group. No wonder our results were inconsistent.



2012年初,杰夫·卡明(Geoff Cumming)的演讲震惊了我,使我意识到几十年来我一直在进行性能欠佳的实验。在我实验室的某些研究领域中,一个实验的预期效果会迅速发展,但在下一个实验中却会消失。我和我的学生一直在努力寻找具有一致统计意义的条件(例如调整项目,说明,排除规则),但有时有时最终还是会放弃,因为结果令人发疯地不一致。例如,Lindsay and Kantner(2011)的一章报告了16个实验,这些实验对识别记忆的反馈具有反复控制的作用。卡明的讲话解释说p值很吵。此外,当使用受试者之间的设计来研究中小型效果时,除非样本非常大,否则统计检验通常不会产生显着的结果(有时具有很大的p值)。例如,如果科恩的d在受试者之间进行比较时等于0.50,并且每组中有20名受试者,那么大约三分之二的时间,p会大于0.05,这是II型错误。即使每种条件下的样本量为50个,在受试者之间进行的三项实验中,几乎有三分之一的效应量为(d)(0.50)将不会产生明显的结果。我实验室中的实验试图检测每个小组中有两到三打受试者的小型混合模型交互。难怪我们的结果不一致。