Fine-grained analysis of language varieties and demographics,Natural Language Engineering

当前位置： X-MOL 学术 › Nat. Lang. Eng. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Fine-grained analysis of language varieties and demographics
Natural Language Engineering ( IF 2.5 ) Pub Date : 2020-03-10 , DOI: 10.1017/s1351324920000108
Francisco Rangel , Paolo Rosso , Wajdi Zaghouani , Anis Charfi

The rise of social media empowers people to interact and communicate with anyone anywhere in the world. The possibility of being anonymous avoids censorship and enables freedom of expression. Nevertheless, this anonymity might lead to cybersecurity issues, such as opinion spam, sexual harassment, incitement to hatred or even terrorism propaganda. In such cases, there is a need to know more about the anonymous users and this could be useful in several domains beyond security and forensics such as marketing, for example. In this paper, we focus on a fine-grained analysis of language varieties while considering also the authors’ demographics. We present a Low-Dimensionality Statistical Embedding method to represent text documents. We compared the performance of this method with the best performing teams in the Author Profiling task at PAN 2017. We obtained an average accuracy of 92.08% versus 91.84% for the best performing team at PAN 2017. We also analyse the relationship of the language variety identification with the authors’ gender. Furthermore, we applied our proposed method to a more fine-grained annotated corpus of Arabic varieties covering 22 Arab countries and obtained an overall accuracy of 88.89%. We have also investigated the effect of the authors’ age and gender on the identification of the different Arabic varieties, as well as the effect of the corpus size on the performance of our method.

中文翻译：

语言种类和人口统计的细粒度分析

社交媒体的兴起使人们能够与世界任何地方的任何人进行互动和交流。匿名的可能性避免了审查并实现了言论自由。然而，这种匿名性可能会导致网络安全问题，例如垃圾评论、性骚扰、煽动仇恨甚至恐怖主义宣传。在这种情况下，需要更多地了解匿名用户，这可能在安全和取证之外的多个领域（例如营销）中很有用。在本文中，我们专注于对语言种类的细粒度分析，同时还考虑了作者的人口统计数据。我们提出了一种低维统计嵌入方法来表示文本文档。我们将这种方法的性能与 PAN 2017 的作者分析任务中表现最好的团队进行了比较。我们获得了 92.08% 的平均准确率，而 PAN 2017 表现最佳团队的平均准确率为 91.84%。我们还分析了语言品种识别与作者性别的关系。此外，我们将我们提出的方法应用于覆盖 22 个阿拉伯国家的更细粒度的阿拉伯品种注释语料库，总体准确率为 88.89%。我们还研究了作者的年龄和性别对识别不同阿拉伯品种的影响，以及语料库大小对我们方法性能的影响。我们将我们提出的方法应用于覆盖 22 个阿拉伯国家的更细粒度的阿拉伯品种注释语料库，并获得了 88.89% 的整体准确率。我们还研究了作者的年龄和性别对识别不同阿拉伯品种的影响，以及语料库大小对我们方法性能的影响。我们将我们提出的方法应用于覆盖 22 个阿拉伯国家的更细粒度的阿拉伯品种注释语料库，并获得了 88.89% 的整体准确率。我们还研究了作者的年龄和性别对识别不同阿拉伯品种的影响，以及语料库大小对我们方法性能的影响。

更新日期：2020-03-10

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>