当前位置: X-MOL 学术Comput. Linguist. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Statistical Significance Testing for Natural Language Processing
Computational Linguistics ( IF 3.7 ) Pub Date : 2020-10-20 , DOI: 10.1162/coli_r_00388
Edwin D. Simpson 1
Affiliation  

Like any other science, research in natural language processing (NLP) depends on the ability to draw correct conclusions from experiments. A key tool for this is statistical significance testing: We use it to judge whether a result provides meaningful, generalizable findings or should be taken with a pinch of salt. When comparing new methods against others, performance metrics often differ by only small amounts, so researchers turn to significance tests to show that improved models are genuinely better. Unfortunately, this reasoning often fails because we choose inappropriate significance tests or carry them out incorrectly, making their outcomes meaningless. Or, the test we use may fail to indicate a significant result when a more appropriate test would find one. NLP researchers must avoid these pitfalls to ensure that their evaluations are sound and ultimately avoid wasting time and money through incorrect conclusions. This book guides NLP researchers through the whole process of significance testing, making it easy to select the right kind of test by matching canonical NLP tasks to specific significance testing procedures. As well as being a handbook for researchers, the book provides theoretical background on significance testing, includes new methods that solve problems with significance tests in the world of deep learning and multidataset benchmarks, and describes the open research problems of significance testing for NLP. The book focusses on the task of comparing one algorithm with another. At the core of this is the p-value, the probability that a difference at least as extreme as the one we observed could occur by chance. If the p-value falls below a predetermined threshold, the result is declared significant. Leaving aside the fundamental limitation of turning the validity of results into a binary question with an arbitrary threshold, to be a valid statistical significance test, the p-value must be computed in the right way. The book describes the two crucial properties of an appropriate significance test: The test must be both valid and powerful. Validity refers to the avoidance of type 1 errors, in which the result is incorrectly declared significant. Common mistakes that lead to type 1 errors include deploying tests that make incorrect assumptions, such as independence between data points. The power of a test refers to its ability to detect a significant result and therefore to avoid type 2 errors. Here, knowledge of the data and experiment must be used to choose a test that makes the correct assumptions. There is a trade-off between validity and power, but for the most common NLP tasks (language modeling, sequence labeling, translation, etc.), there are clear choices of tests that provide a good balance.

中文翻译:

自然语言处理的统计显着性测试

与任何其他科学一样,自然语言处理 (NLP) 的研究取决于从实验中得出正确结论的能力。一个关键的工具是统计显着性测试:我们用它来判断一个结果是否提供了有意义的、可概括的结果,或者是否应该采取一些措施。在将新方法与其他方法进行比较时,性能指标通常只有很小的差异,因此研究人员转向显着性测试以证明改进的模型确实更好。不幸的是,这种推理经常失败,因为我们选择了不适当的显着性检验或执行不正确,使它们的结果毫无意义。或者,当更合适的测试可以找到一个结果时,我们使用的测试可能无法表明一个重要的结果。NLP 研究人员必须避免这些陷阱,以确保他们的评估是合理的,并最终避免因错误的结论而浪费时间和金钱。本书指导 NLP 研究人员完成显着性测试的整个过程,通过将规范的 NLP 任务与特定的显着性测试程序相匹配,可以轻松选择正确的测试类型。除了作为研究人员的手册外,本书还提供了显着性测试的理论背景,包括解决深度学习和多数据集基准领域中显着性测试问题的新方法,并描述了 NLP 显着性测试的开放研究问题。本书侧重于将一种算法与另一种算法进行比较的任务。其核心是 p 值,至少与我们观察到的差异一样极端的差异可能偶然发生的概率。如果 p 值低于预定阈值,则宣布结果显着。撇开将结果的有效性转换为具有任意阈值的二元问题的基本限制,要成为有效的统计显着性检验,必须以正确的方式计算 p 值。这本书描述了适当的显着性测试的两个关键属性: 测试必须既有效又强大。有效性是指避免类型 1 错误,即结果被错误地声明为重要。导致第 1 类错误的常见错误包括部署做出错误假设的测试,例如数据点之间的独立性。测试的功效是指其检测重要结果并因此避免第 2 类错误的能力。在这里,必须使用数据和实验的知识来选择做出正确假设的测试。有效性和功效之间存在权衡,但对于最常见的 NLP 任务(语言建模、序列标记、翻译等),有明确的测试选择,可以提供良好的平衡。
更新日期:2020-10-20
down
wechat
bug