Hate speech detection is not as easy as you may think: A closer look at model validation (extended version),Information Systems

当前位置： X-MOL 学术 › Inform. Syst. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Hate speech detection is not as easy as you may think: A closer look at model validation (extended version)
Information Systems ( IF 3.0 ) Pub Date : 2020-06-30 , DOI: 10.1016/j.is.2020.101584
Aymé Arango , Jorge Pérez , Barbara Poblete

Hate speech is an important problem that is seriously affecting the dynamics and usefulness of online social communities. Large scale social platforms are currently investing important resources into automatically detecting and classifying hateful content, without much success. On the other hand, the results reported by state-of-the-art systems indicate that supervised approaches achieve almost perfect performance but only within specific datasets, most of them in English language. In this work, we analyze this apparent contradiction between existing literature and actual applications. We study closely the experimental methodology used in prior work and their generalizability to other datasets. Our findings evidence methodological issues, as well as an important dataset bias. As a consequence, performance claims of the current state-of-the-art have become significantly overestimated. The problems that we have found are mostly related to data overfitting and sampling issues. We discuss the implications for current research and re-conduct experiments to give a more accurate picture of the current state-of-the art methods. Moreover, we design some baseline approaches to perform cross-lingual experiments, using English and Spanish datasets.

中文翻译：

讨厌的语音检测并不像您想的那么容易：仔细查看模型验证（扩展版）

仇恨言论是一个重要问题，严重影响在线社交社区的动态和实用性。当前，大型社交平台正在将重要资源投入到自动检测和分类可恨内容方面，但收效甚微。另一方面，最新系统报告的结果表明，受监督的方法几乎可以实现完美的性能，但只能在特定的数据集中使用，大多数是英语。在这项工作中，我们分析了现有文献与实际应用之间的这种明显矛盾。我们仔细研究了先前工作中使用的实验方法及其对其他数据集的一般性。我们的发现证明了方法论上的问题以及重要的数据集偏差。作为结果，当前最新技术的性能要求已被大大高估。我们发现的问题主要与数据过拟合和抽样问题有关。我们讨论了对当前研究和再进行实验的意义，以便更准确地了解当前最先进的方法。此外，我们使用英语和西班牙语数据集设计了一些基准方法来执行跨语言实验。

更新日期：2020-07-01

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11