A Benchmark Study of the Contemporary Toxicity Detectors on Software Engineering Interactions,arXiv - CS - Software Engineering

当前位置： X-MOL 学术 › arXiv.cs.SE › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

A Benchmark Study of the Contemporary Toxicity Detectors on Software Engineering Interactions
arXiv - CS - Software Engineering Pub Date : 2020-09-20 , DOI: arxiv-2009.09331
Jaydeb Sarker, Asif Kamal Turzo, Amiangshu Bosu

Automated filtering of toxic conversations may help an Open-source software (OSS) community to maintain healthy interactions among the project participants. Although, several general purpose tools exist to identify toxic contents, those may incorrectly flag some words commonly used in the Software Engineering (SE) context as toxic (e.g., 'junk', 'kill', and 'dump') and vice versa. To encounter this challenge, an SE specific tool has been proposed by the CMU Strudel Lab (referred as the `STRUDEL' hereinafter) by combining the output of the Perspective API with the output from a customized version of the Stanford's Politeness detector tool. However, since STRUDEL's evaluation was very limited with only 654 SE text, its practical applicability is unclear. Therefore, this study aims to empirically evaluate the Strudel tool as well as four state-of-the-art general purpose toxicity detectors on a large scale SE dataset. On this goal, we empirically developed a rubric to manually label toxic SE interactions. Using this rubric, we manually labeled a dataset of 6,533 code review comments and 4,140 Gitter messages. The results of our analyses suggest significant degradation of all tools' performances on our datasets. Those degradations were significantly higher on our dataset of formal SE communication such as code review than on our dataset of informal communication such as Gitter messages. Two of the models from our study showed significant performance improvements during 10-fold cross validations after we retrained those on our SE datasets. Based on our manual investigations of the incorrectly classified text, we have identified several recommendations for developing an SE specific toxicity detector.

中文翻译：

当代软件工程交互毒性检测器的基准研究

有害对话的自动过滤可以帮助开源软件 (OSS) 社区保持项目参与者之间的健康互动。尽管存在多种用于识别有毒内容的通用工具，但这些工具可能会错误地将软件工程 (SE) 上下文中常用的某些词标记为有毒（例如，“垃圾”、“杀死”和“转储”），反之亦然。为了应对这一挑战，CMU Strudel 实验室（以下称为“STRUDEL”）提出了一种 SE 特定工具，将 Perspective API 的输出与斯坦福大学礼貌检测器工具的定制版本的输出相结合。但是，由于STRUDEL的评估非常有限，只有654 SE文本，其实际适用性尚不清楚。所以，本研究旨在在大规模 SE 数据集上凭经验评估 Strudel 工具以及四个最先进的通用毒性检测器。为了这个目标，我们凭经验开发了一个量规来手动标记有毒的 SE 相互作用。使用这个标准，我们手动标记了一个包含 6,533 条代码审查评论和 4,140 条 Gitter 消息的数据集。我们的分析结果表明，所有工具在我们的数据集上的性能都显着下降。与我们的非正式通信数据集（例如 Gitter 消息）相比，这些退化在我们的正式 SE 通信（例如代码审查）数据集上明显更高。在我们对 SE 数据集重新训练模型后，我们研究中的两个模型在 10 倍交叉验证中显示出显着的性能改进。

更新日期：2020-09-22

点击分享查看原文

点击收藏

阅读更多本刊最新论文