当前位置: X-MOL 学术Lang. Learn. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Open Research in Artificial Intelligence and the Search for Common Ground in Reproducibility: A Commentary on “(Why) Are Open Research Practices the Future for the Study of Language Learning?”
Language Learning ( IF 5.240 ) Pub Date : 2023-05-22 , DOI: 10.1111/lang.12582
Odd Erik Gundersen 1 , Kevin Coakley 1, 2
Affiliation  

Open research has a long tradition in the field of artificial intelligence (AI), which is our primary area of expertise. Richard Stallman, who has been affiliated with the AI laboratory at Massachusetts Institute of Technology since the early 1970s, launched the GNU project in 1983 and the Free Software Foundation in 1985. The goal of the free software movement has been to secure freedoms for software users to run, study, modify, and share software. GNU software grants these rights in licenses that enable anyone to read the code but also restrict anyone from changing the software without sharing these changes. The open data movement in AI was spearheaded by the Machine Learning Repository created in 1987 by David Aha and fellow graduate students at the University of California Irvine. This repository still hosts a collection of datasets that can be used for machine learning. One of the first digital-first scientific journals was the Journal of Artificial Intelligence Research (JAIR), established in 1993 on the initiative of Steven Minton. The journal is an open access, peer-reviewed scientific publication and has been community driven since its inception. It has no publishing fees, and all expenses have been covered by donations. Since it is hosted online, it supports publishing digital source material, such as code and data.

AI research is a young science that is continuously seeking to improve research methodology and the quality of the published research. Although there currently is a movement towards publishing research in journals, a substantial number of scientific articles in AI are still published through conference proceedings. The conferences with the highest impact, such as those of the Association for the Advancement of Artificial Intelligence, Neural Information Processing Systems, International Conference on Machine Learning, and International Joint Conference on Artificial Intelligence, are community driven, and the articles presented and published in these venues are open access. Some of the proceedings are published by the Journal of Machine Learning Research, established as an open access alternative to the journal Machine Learning in 2001 to allow authors to publish for free and retain copyright. All these venues also promote and facilitate public sharing of research artifacts.

Among many open research practices in our field of expertise, some of the most impactful have targeted research reproducibility. In this commentary, we have therefore focused on reproducibility, in the hopes that researchers in language sciences might benefit from the experience of AI scholars. One recent initiative in AI research involved reproducibility checklists introduced at all the most impactful AI conferences to improve the rigor of the research presented and published there. These checklists must be completed by all authors when submitting articles to conferences, and they cover various aspects of research methodology, including whether data and code are shared. The checklists have been introduced as a response to the reproducibility crisis and in recognition of the field's challenges with methodological rigor. Reproducibility badges have also been introduced at several conferences and journals, and soon in JAIR as well (Gundersen, Helmert, & Hoos, 2023). The badges indicate whether the research artifacts, such and data and code, that are required for reproducing the research have been shared. In some cases, reviewers evaluate the artifacts as well, which could earn the authors another badge if the reviewers are able to reproduce the research. However, this is a considerable task, recognized by many as too much to ask of reviewers. Instead, AI scholars now organize reproducibility challenges, with the idea of designating a separate track at a conference or a workshop where the goal is to attempt to reproduce a scientific article of choice and write a report on this effort. Some of these reports have been published in the community driven open access journal ReScience C. One issue with these initiatives is that the results of the replication efforts are not linked to the original scientific article. To address this shortcoming, a new procedure is currently being introduced at JAIR, where reports documenting the effort by third parties to reproduce research are published in the journal alongside the article that is being reproduced. This closes the gap between the reproducibility effort and the original work, in the sense that high quality research that is easy to reproduce will get credit, and readers will be made aware of research that is not easily reproducible. JAIR ensures that the original authors get to provide feedback on reproducibility reports and that any mistakes or misunderstandings by the third-party researchers are corrected.

One challenge to reproducibility is conceptual. The term reproducibility has been called confused by Plesser (2018), and we agree. Our belief is that the reason for this confusion is caused by the term being defined without trying to operationalize it at the same time. Hence, we have tried to define reproducibility in such a way that the concept becomes operationalizable. We use machine learning to illustrate our reasoning because it is our domain, which we know well, and because machine learning is a computer science, so most of the experiments can be fully described in code and automated. We believe this is a strength as it allows us to be specific about what an experiment is and what reproducibility then should mean. However, we think that this definition of reproducibility is generalizable to all sciences.

In Gundersen (2021), reproducibility is defined as “the ability of independent investigators to draw the same conclusions from an experiment by following the documentation shared by the original investigators” (p. 10). The documentation used to conduct a reproducibility experiment defines to which reproducibility type this experiment belongs, and the way the conclusion is reached determines to which degree the experiment reproduces the conclusion.

An experiment can be documented in many ways. Traditionally, experiments have been documented only as text, and this is still the case for a large portion of all published studies, because this is the only way to document research in many settings. However, experiments do not have to be documented only as text; for example, if data collection is carried out automatically or if data analysis is performed computationally, both the data and the analysis can be shared. In computer science and AI research, most experiments can be fully automated and executed by a computer, which means that the complete experiments can be shared. The reproducibility type of an experiment is defined by which of these artifacts are shared with independent investigators replicating the initial study.

Based on which documentation is shared with, we propose four different types of reproducibility experiments:
  • R1 description: when only the textual description of the experiment is made available;
  • R2 code: when the textual description and the code used to conduct the experiment are shared;
  • R3 data: when the textual description and the data used in the experiment are published; and finally
  • R4 experiment: when the textual description, code, and data are all made available by the original investigators.

We will emphasize two important points. First, sharing of code would be different for a machine learning experiment, where the complete research protocol can be reproduced (from data collection to analysis) and the study's conclusion can be reached if all statistical and analytical criteria are satisfied, and for a medical experiment, where often only the digitized data and analysis can be shared. Second, the textual description is important. Although the code and data could be shared without sharing a study's textual description, this is not enough. To validate whether the right experiment is carried out, independent investigators need the textual description. Validation includes but is not limited to evaluating whether the experiment tests the hypothesis and whether the results are analyzed in an appropriate way. If the textual description is lacking, only some verification can be done, such as checking whether the code produces the expected result given the data.

We propose three different degrees of reproducibility experiments depending on how the conclusion is reached: (a) outcome reproducible (OR) is when a reproducibility experiment produces the exact same outcome (finding) as the initial experiment, (b) analysis reproducible (AR) is when the same analysis is carried out as in the initial experiment to reach the same conclusion but with a different (nonidentical) finding, and (c) interpretation reproducible (IR) is when the interpretation of the analysis is the same so that the same conclusion is reached even though the analysis is different from the initial one. Also here, let us emphasize three essential points. First, we do not distinguish between the terms reproducibility and replicability but instead cover the same concepts through introducing degrees of reproducibility. Second, in many scientific areas, a study's products or outcomes are described as data. However, in other areas, especially in machine learning, data are often an input to a study. To avoid ambiguity, we use the term outcome to refer to the finding of a study, and we use data to describe a study's input or stimuli, such as, for example, the images containing objects (data) which are classified in a visual perception task. This use of the term “data” appears to map loosely onto “materials” in language-focused studies. Third, outcome reproducibility is basically impossible in noncomputational experiments, and if achieved, is only spurious. This is often the case for highly complex computational experiments as well. However, to properly understand the concept of reproducibility, the distinction is important.

Marsden and Morgan-Short raise the issue of replicating highly influential older studies that used methods and analytical practices that do not reflect the current field-specific standards. The degrees of reproducibility, as described here, illustrate this situation. Let us explain. When trying to reproduce a highly influential older study, one could choose to use the out-of-date methods or analytical practices to validate the conclusions of the initial experiment, or one could opt for new methods and analytical practices. The experiment would be analysis reproducible (AR) if researchers reach the same conclusions by relying on old analytical practices. In contrast, the experiment would be interpretation reproducible (IR) if researchers reach the same conclusion by modernizing their analytical practices.

Furthermore, Marsden and Morgan-Short also remark on the difficulty of reproducing the initial study's finding when the full materials are not provided by the original author. The full materials may not be required to reproduce an experiment. This is captured through the various types of reproducibility. If only a textual description is made available by the original investigators (i.e., in a R1 description reproducibility experiment), but the independent investigators use the same analytical methods to reach the same conclusion, then this reproducibility experiment is analysis reproducible (R1AR). However, if new analytical practices are used in the same situation, then the reproducibility experiment will be classified as interpretation reproducible (R1IR).

Marsden and Morgan-Short explained that, in studies on language learning, replication results were less likely to support the initial finding when materials were not provided. This could mean that the conclusions of the initial experiment do not generalize well. The type of reproducibility study also represents the generalizability of an experiment, going from R4 (least generalizable) to R1 (most generalizable). For instance, the two situations described above, namely, analysis and interpretation reproducible experiments based on a textual description only, would be classified as R1 (most generalizable). In contrast, when an experiment can only be reproduced with full materials, then its conclusions might not be as generalizable as those from an experiment whose findings can be reproduced through only a textual description. In AI research, the original investigators are in fact incentivized to share fewer study materials because this increases other researchers’ efforts to reproduce those findings with highest degree of generalization possible. Whereas this strategy might attract the attention of individual researchers, it ultimately represents an antisocial practice with respect to the research community, in the sense that this practice, of course, makes third parties less likely to reproduce a given finding, so it is a net loss for the community (for more detail, see Gundersen, 2019).

To further increase the understanding of reproducibility, we have not only surveyed existing literature for variables that can lead to a lack of reproducibility but also analyzed how these variables affect various degrees of reproducibility (Gundersen, Coakley, et al., 2023). For instance, among various sources of irreproducibility, we have identified study design variables, algorithmic variables, implementation variables, observation variables, evaluation variables, and documentation variables. Understanding these sources of irreproducibility will help researchers to operationalize reproducibility research by highlighting links between a given study's degree of reproducibility and the various design decisions that allow the study to achieve that reproducibility. For example, if researchers try to reproduce an experiment and cannot achieve the degree of analysis reproducibility when evaluating a study's outcomes, those researchers could identify various potential sources of irreproducibility affecting their analysis. We believe that it could be very useful for scholars in other sciences, including language sciences, to identify various variables that can cause experiments to be irreproducible. This will not only help increase researchers’ methodological rigor but enhance their understanding of why reproducibility experiments sometimes fail.



中文翻译:

人工智能的开放研究和可重复性中寻找共同点:评论“(为什么)开放研究实践是语言学习研究的未来?”

开放研究在人工智能 (AI) 领域有着悠久的传统,这是我们的主要专业领域。Richard Stallman 自 1970 年代初以来一直隶属于麻省理工学院人工智能实验室,他于 1983 年发起了 GNU 项目,并于 1985 年发起了自由软件基金会。自由软件运动的目标一直是确保软件用户的自由运行、研究、修改和共享软件。GNU 软件在许可中授予这些权利,使任何人都可以阅读代码,但也限制任何人在不共享这些更改的情况下更改软件。AI 中的开放数据运动是由 David Aha 和加州大学欧文分校的研究生于 1987 年创建的机器学习库带头的。该存储库仍然托管可用于机器学习的数据集集合。最早的数字优先科学期刊之一是Journal of Artificial Intelligence Research (JAIR),在 Steven Minton 的倡议下于 1993 年成立。该期刊是一份开放获取、同行评审的科学出版物,自创刊以来一直受到社区的推动。它没有出版费,所有费用都由捐款支付。由于它是在线托管的,因此它支持发布数字源材料,例如代码和数据。

人工智能研究是一门年轻的科学,它不断寻求改进研究方法和已发表研究的质量。尽管目前有在期刊上发表研究的趋势,但仍有大量人工智能科学文章通过会议论文集发表。影响力最大的会议,例如人工智能促进协会、神经信息处理系统、机器学习国际会议和人工智能国际联合会议等会议,都是社区驱动的,发表和发表的文章在这些场地是开放的。部分会议记录由Journal of Machine Learning Research发表, 于 2001 年作为机器学习期刊的开放获取替代品而成立,允许作者免费发表并保留版权。所有这些场所还促进和促进研究成果的公共共享。

在我们专业领域的许多开放研究实践中,一些最具影响力的研究以研究的可重复性为目标。因此,在这篇评论中,我们将重点放在可再现性上,希望语言科学研究人员可以从 AI 学者的经验中受益。AI 研究中最近的一项举措涉及在所有最具影响力的 AI 会议上引入的可重复性清单,以提高在那里展示和发表的研究的严谨性。这些清单必须由所有作者在向会议提交文章时完成,它们涵盖研究方法的各个方面,包括是否共享数据和代码。清单的引入是为了应对可重复性危机,并承认该领域在方法论上面临的挑战。2023 年)。徽章表明是否共享了复制研究所需的研究工件,例如数据和代码。在某些情况下,审稿人也会评估工件,如果审稿人能够重现研究,这可能会为作者赢得另一个徽章。然而,这是一项艰巨的任务,许多人认为对审稿人的要求太多了。相反,人工智能学者现在组织了可重复性挑战,其想法是在会议或研讨会上指定一个单独的轨道,其目标是尝试重现所选的科学文章并就此努力撰写报告。其中一些报告已发表在社区驱动的开放获取期刊ReScience C上. 这些举措的一个问题是复制工作的结果与原始科学文章无关。为了解决这个缺点,JAIR 目前正在引入一种新程序,其中记录第三方为复制研究所做的努力的报告与被复制的文章一起发表在期刊上。这缩小了可复制性努力与原创作品之间的差距,从某种意义上说,易于复制的高质量研究将获得信誉,并且读者将意识到不易复制的研究。JAIR 确保原始作者能够对可重复性报告提供反馈,并纠正第三方研究人员的任何错误或误解。

再现性的一个挑战是概念上的。Plesser ( 2018 )将再现性一词称为混淆,我们同意这一点。我们认为,造成这种混淆的原因是定义术语时没有尝试同时对其进行操作。因此,我们试图以概念变得可操作的方式来定义可重复性。我们使用机器学习来说明我们的推理,因为它是我们的领域,我们对此非常了解,并且因为机器学习是一门计算机科学,所以大多数实验都可以用代码和自动化进行完整描述。我们相信这是一种优势,因为它使我们能够具体说明什么是实验以及可重复性应该意味着什么。然而,我们认为这种可重复性的定义可以推广到所有科学领域。

在 Gundersen ( 2021 ) 中,可重复性被定义为“独立调查人员通过遵循原始调查人员共享的文件从实验中得出相同结论的能力”(第 10 页)。用于进行再现性实验的文件定义了该实验属于哪种再现性类型,得出结论的方式决定了实验再现结论的程度。

可以通过多种方式记录实验。传统上,实验仅以文本形式记录,而且所有已发表研究的很大一部分仍然如此,因为这是在许多情况下记录研究的唯一方式。但是,实验不必仅以文本形式记录;例如,如果数据收集是自动进行的,或者数据分析是通过计算进行的,那么数据和分析都可以共享。在计算机科学和人工智能研究中,大多数实验都可以完全自动化并由计算机执行,这意味着可以共享完整的实验。实验的可重复性类型由这些工件中的哪些与复制初始研究的独立研究人员共享来定义。

根据与哪些文档共享,我们提出了四种不同类型的再现性实验:
  • R1 描述:只有实验的文字描述可用;
  • R2 代码:当文本描述和用于进行实验的代码共享时;
  • R3 数据:当文本描述和实验中使用的数据被发表时;最后
  • R4 实验:文本描述、代码和数据均由原始研究人员提供。

我们将强调两个要点。首先,代码共享对于机器学习实验来说是不同的,在机器学习实验中,可以复制完整的研究方案(从数据收集到分析),如果满足所有统计和分析标准,就可以得出研究结论,而对于医学实验,通常只能共享数字化数据和分析。其次,文字描述很重要。尽管可以在不共享研究文本描述的情况下共享代码和数据,但这还不够。为了验证是否进行了正确的实验,独立调查人员需要文字描述。验证包括但不限于评估实验是否检验了假设以及是否以适当的方式分析了结果。

我们根据得出结论的方式提出三种不同程度的可重复性实验:(a) 结果可重复性 (OR) 是指可重复性实验产生与初始实验完全相同的结果(发现),(b) 分析可重复性 (AR)是当进行与初始实验相同的分析以得出相同的结论但具有不同(不相同)的发现时,并且(c)解释可再现(IR)是当分析的解释相同时即使分析与最初的分析不同,也会得出结论。同样在这里,让我们强调三个要点。首先,我们不区分可再现性和可复制性这两个术语,而是通过引入可再现性的程度来涵盖相同的概念。第二,在许多科学领域,研究的产品或结果被描述为数据。然而,在其他领域,尤其是在机器学习领域,数据通常是研究的输入。为了避免歧义,我们使用术语结果来指代研究的发现,并且我们使用数据来描述研究的输入或刺激,例如包含对象(数据)的图像,这些对象(数据)被分类为视觉感知任务。在以语言为中心的研究中,术语“数据”的这种使用似乎松散地映射到“材料”。第三,结果可重复性在非计算实验中基本上是不可能的,即使实现了,也只是虚假的。对于高度复杂的计算实验,情况通常也是如此。但是,要正确理解再现性的概念,区分很重要。产品或成果被描述为数据。然而,在其他领域,尤其是在机器学习领域,数据通常是研究的输入。为了避免歧义,我们使用术语结果来指代研究的发现,并且我们使用数据来描述研究的输入或刺激,例如包含对象(数据)的图像,这些对象(数据)被分类为视觉感知任务。在以语言为中心的研究中,术语“数据”的这种使用似乎松散地映射到“材料”。第三,结果可重复性在非计算实验中基本上是不可能的,即使实现了,也只是虚假的。对于高度复杂的计算实验,情况通常也是如此。但是,要正确理解再现性的概念,区分很重要。产品或成果被描述为数据。然而,在其他领域,尤其是在机器学习领域,数据通常是研究的输入。为了避免歧义,我们使用术语结果来指代研究的发现,并且我们使用数据来描述研究的输入或刺激,例如包含对象(数据)的图像,这些对象(数据)被分类为视觉感知任务。在以语言为中心的研究中,术语“数据”的这种使用似乎松散地映射到“材料”。第三,结果可重复性在非计算实验中基本上是不可能的,即使实现了,也只是虚假的。对于高度复杂的计算实验,情况通常也是如此。但是,要正确理解再现性的概念,区分很重要。数据通常是研究的输入。为了避免歧义,我们使用术语结果来指代研究的发现,并且我们使用数据来描述研究的输入或刺激,例如包含对象(数据)的图像,这些对象(数据)被分类为视觉感知任务。在以语言为中心的研究中,术语“数据”的这种使用似乎松散地映射到“材料”。第三,结果可重复性在非计算实验中基本上是不可能的,即使实现了,也只是虚假的。对于高度复杂的计算实验,情况通常也是如此。但是,要正确理解再现性的概念,区分很重要。数据通常是研究的输入。为了避免歧义,我们使用术语结果来指代研究的发现,并且我们使用数据来描述研究的输入或刺激,例如包含对象(数据)的图像,这些对象(数据)被分类为视觉感知任务。在以语言为中心的研究中,术语“数据”的这种使用似乎松散地映射到“材料”。第三,结果可重复性在非计算实验中基本上是不可能的,即使实现了,也只是虚假的。对于高度复杂的计算实验,情况通常也是如此。但是,要正确理解再现性的概念,区分很重要。例如,包含在视觉感知任务中分类的对象(数据)的图像。在以语言为中心的研究中,术语“数据”的这种使用似乎松散地映射到“材料”。第三,结果可重复性在非计算实验中基本上是不可能的,即使实现了,也只是虚假的。对于高度复杂的计算实验,情况通常也是如此。但是,要正确理解再现性的概念,区分很重要。例如,包含在视觉感知任务中分类的对象(数据)的图像。在以语言为中心的研究中,术语“数据”的这种使用似乎松散地映射到“材料”。第三,结果可重复性在非计算实验中基本上是不可能的,即使实现了,也只是虚假的。对于高度复杂的计算实验,情况通常也是如此。但是,要正确理解再现性的概念,区分很重要。对于高度复杂的计算实验,情况通常也是如此。但是,要正确理解再现性的概念,区分很重要。对于高度复杂的计算实验,情况通常也是如此。但是,要正确理解再现性的概念,区分很重要。

Marsden 和 Morgan-Short 提出了复制极具影响力的旧研究的问题,这些研究使用的方法和分析实践并未反映当前特定领域的标准。如此处所述,可重复性的程度说明了这种情况。让我们解释一下。当试图重现一项极具影响力的旧研究时,可以选择使用过时的方法或分析实践来验证初始实验的结论,或者可以选择新的方法和分析实践。如果研究人员依靠旧的分析实践得出相同的结论,则该实验将是分析可重复的 (AR)。相比之下,如果研究人员通过现代化分析实践得出相同的结论,则该实验将具有解释可重复性 (IR)。

此外,Marsden 和 Morgan-Short 还评论了在原作者未提供完整材料的情况下难以重现最初研究的发现。重现实验可能不需要完整的材料。这是通过各种类型的再现性捕获的。如果原始研究者仅提供文本描述(即在 R1 描述再现性实验中),但独立研究者使用相同的分析方法得出相同的结论,则此再现性实验是分析可再现的 (R1AR)。但是,如果在相同情况下使用新的分析实践,则再现性实验将被归类为解释再现性 (R1IR)。

Marsden 和 Morgan-Short 解释说,在语言学习研究中,如果没有提供材料,复制结果不太可能支持最初的发现。这可能意味着初始实验的结论不能很好地概括。再现性研究的类型也代表了实验的普遍性,从 R4(最不普遍)到 R1(最普遍)。例如,上述两种情况,即仅基于文本描述的分析和解释可重复实验,将被归类为 R1(最普遍)。相反,当一个实验只能用完整的材料复制时,那么它的结论可能不像那些只能通过文本描述复制结果的实验​​那样具有普遍性。在人工智能研究中,事实上,最初的研究人员被激励分享更少的研究材料,因为这会增加其他研究人员的努力,以尽可能高的概括程度重现这些发现。尽管这种策略可能会引起个别研究人员的注意,但它最终代表了研究界的一种反社会做法,从某种意义上说,这种做法当然会使第三方不太可能重现给定的发现,因此它是一个网络社区损失(有关详细信息,请参阅 Gundersen,2019 年)。

为了进一步加深对可重复性的理解,我们不仅调查了现有文献中可能导致缺乏可重复性的变量,还分析了这些变量如何影响不同程度的可重复性(Gundersen、Coakley 等人,2023 年). 例如,在不可重复性的各种来源中,我们已经确定了研究设计变量、算法变量、实施变量、观察变量、评估变量和文档变量。了解这些不可重复性的来源将有助于研究人员通过突出给定研究的可重复性程度与允许该研究实现该可重复性的各种设计决策之间的联系来实施可重复性研究。例如,如果研究人员试图重现实验,但在评估研究结果时无法达到分析重现性的程度,则这些研究人员可以确定影响其分析的各种潜在的不可重现性来源。我们相信它对其他科学领域的学者非常有用,包括语言科学,以确定可能导致实验无法重现的各种变量。这不仅有助于提高研究人员方法论的严谨性,还可以增强他们对再现性实验有时失败的原因的理解。

更新日期:2023-05-23
down
wechat
bug