Roman Urdu toxic comment classification,Language Resources and Evaluation

当前位置： X-MOL 学术 › Lang. Resour. Eval. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Roman Urdu toxic comment classification
Language Resources and Evaluation ( IF 1.7 ) Pub Date : 2021-01-29 , DOI: 10.1007/s10579-021-09530-y
Hafiz Hassaan Saeed , Muhammad Haseeb Ashraf , Faisal Kamiran , Asim Karim , Toon Calders

With the increasing popularity of user-generated content on social media, the number of toxic texts is also on the rise. Such texts cause adverse effects on users and society at large, therefore, the identification of toxic comments is a growing need of the day. While toxic comment classification has been studied for resource-rich languages like English, no work has been done for Roman Urdu despite being a widely used language on social media in South Asia. This paper addresses the challenge of Roman Urdu toxic comment detection by developing a first-ever large labeled corpus of toxic and non-toxic comments. The developed corpus, called RUT (Roman Urdu Toxic), contains over 72 thousand comments collected from popular social media platforms and has been labeled manually with a strong inter-annotator agreement. With this dataset, we train several classification models to detect Roman Urdu toxic comments, including classical machine learning models with the bag-of-words representation and some recent deep models based on word embeddings. Despite the success of the latter in classifying toxic comments in English, the absence of pre-trained word embeddings for Roman Urdu prompted to generate different word embeddings using Glove, Word2Vec and FastText techniques, and compare them with task-specific word embeddings learned inside the classification task. Finally, we propose an ensemble approach, reaching our best F1-score of 86.35%, setting the first-ever benchmark for toxic comment classification in Roman Urdu.

中文翻译：

罗马乌尔都语有毒评论分类

随着用户生成的内容在社交媒体上的日益普及，有毒文本的数量也在增加。这样的文本会对用户和整个社会造成不利影响，因此，对有毒评论的识别已成为当今日益增长的需求。尽管针对资源丰富的语言（例如英语）研究了有毒评论分类，但是，尽管罗马乌尔都语已成为南亚社交媒体上广泛使用的语言，但尚未进行任何工作。本文通过开发有史以来第一个大号的有毒和无毒评论语料库，解决了罗马乌尔都语有毒评论的挑战。发达的语料库称为RUT（罗马乌尔都语毒药），包含从流行的社交媒体平台收集的72,000多条评论，并已通过强有力的注释者之间的手动协议进行标记。有了这个数据集，我们训练了几种分类模型来检测Roman Urdu有害评论，包括带有词袋表示法的经典机器学习模型以及基于词嵌入的一些最新深度模型。尽管后者成功地对英语中的有毒评论进行了分类，但由于缺少针对罗马乌尔都语的预训练词嵌入，仍提示使用Glove，Word2Vec和FastText技术生成不同的词嵌入，并将其与在任务列表中学习的特定于任务的词嵌入进行比较分类任务。最后，我们提出一种整体方法，达到我们的最佳F1分数86.35％，为Roman Urdu建立有毒评论分类的第一个基准。包括带有词袋表示法的经典机器学习模型以及基于词嵌入的一些最新深度模型。尽管后者成功地对英语中的有毒评论进行了分类，但由于缺少针对罗马乌尔都语的预训练词嵌入，仍提示使用Glove，Word2Vec和FastText技术生成不同的词嵌入，并将其与在任务列表中学习的特定于任务的词嵌入进行比较分类任务。最后，我们提出一种整体方法，达到我们的最佳F1分数86.35％，为Roman Urdu建立有毒评论分类的第一个基准。包括带有词袋表示法的经典机器学习模型以及基于词嵌入的一些最新深度模型。尽管后者成功地对英语中的有毒评论进行了分类，但由于缺少针对罗马乌尔都语的预训练词嵌入，仍提示使用Glove，Word2Vec和FastText技术生成不同的词嵌入，并将其与在任务列表中学习的特定于任务的词嵌入进行比较分类任务。最后，我们提出一种整体方法，达到我们的最佳F1分数86.35％，为Roman Urdu建立有毒评论分类的第一个基准。Word2Vec和FastText技术，并将它们与在分类任务中学习的特定于任务的单词嵌入进行比较。最后，我们提出一种整体方法，达到我们的最佳F1分数86.35％，为Roman Urdu建立有毒评论分类的第一个基准。Word2Vec和FastText技术，并将它们与在分类任务中学习的特定于任务的单词嵌入进行比较。最后，我们提出一种整体方法，达到我们的最佳F1分数86.35％，为Roman Urdu建立有毒评论分类的第一个基准。

更新日期：2021-01-31

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11