当前位置: X-MOL 学术Inform. Syst. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
A real-time deep-learning approach for filtering Arabic low-quality content and accounts on Twitter
Information Systems ( IF 3.0 ) Pub Date : 2021-02-12 , DOI: 10.1016/j.is.2021.101740
Reem Alharthi , Areej Alhothali , Kawthar Moria

Social networks have generated immense amounts of data that have been successfully utilized for research and business purposes. The approachability and immediacy of social media have also allowed ill-intentioned users to perform several harmful activities that include spamming, promoting, and phishing. These activities generate massive amounts of low-quality content that often exhibits duplicate, automated, inappropriate, or irrelevant content that subsequently affects users’ satisfaction and imposes a significant challenge for other social media-based systems. Several real-time systems were developed to tackle this problem by focusing on filtering a specific kind of low-quality content. In this paper, we present a fine-grained real-time classification approach to identify several types of low-quality tweets (i.e., phishing, promoting, and spam tweets) written in Arabic. The system automatically extracts textual features using deep learning techniques without relying on hand-crafted features that are often time-consuming to be obtained and are tailored for a single type of low-quality content. This paper also proposes a lightweight model that utilizes a subset of the textual features to identify spamming Twitter accounts in a real-time setting. The proposed methods are evaluated on a real-world dataset (40, 000 tweets and 1, 000 accounts), showing superior performance in both models with accuracy and F1-scores of 0.98. The proposed system classifies a tweet in less than five milliseconds and an account in less than a second.



中文翻译:

一种实时深度学习方法,用于过滤Twitter上的阿拉伯语低质量内容和帐户

社交网络产生了大量数据,这些数据已成功用于研究和商业目的。社交媒体的可访问性和即时性还允许恶意用户进行多种有害活动,包括垃圾邮件,宣传和网络钓鱼。这些活动会生成大量的低质量内容,这些内容通常会显示重复的,自动化的,不适当的或不相关的内容,这些内容随后会影响用户的满意度,并对其他基于社交媒体的系统构成重大挑战。通过专注于过滤特定种类的低质量内容,开发了一些实时系统来解决此问题。在本文中,我们提出了一种细粒度的实时分类方法,以识别几种类型的低质量推文(例如,网络钓鱼,推广,和垃圾邮件推文)(用阿拉伯语撰写)。该系统使用深度学习技术自动提取文本特征,而无需依赖通常需要花费大量时间才能获得的,为单一类型的低质量内容量身定制的手工特征。本文还提出了一种轻量级模型,该模型利用文本功能的子集来实时识别垃圾邮件Twitter帐户。所提出的方法在真实的数据集(40、000条推文和1,000个帐户)上进行了评估,显示出在两种模型中均具有出色的性能,其准确性和F1得分均为0.98。提议的系统在不到5毫秒内对一条推文进行分类,并在不到1秒内对一个帐户进行分类。该系统使用深度学习技术自动提取文本特征,而无需依赖通常需要花费大量时间才能获得的,为单一类型的低质量内容量身定制的手工特征。本文还提出了一种轻量级模型,该模型利用文本功能的子集来实时识别垃圾邮件Twitter帐户。所提出的方法在真实的数据集(40、000条推文和1,000个帐户)上进行了评估,显示出在两种模型中均具有出色的性能,其准确性和F1得分均为0.98。提议的系统在不到5毫秒内对一条推文进行分类,并在不到1秒内对一个帐户进行分类。该系统使用深度学习技术自动提取文本特征,而无需依赖通常需要花费大量时间才能获得的,为单一类型的低质量内容量身定制的手工特征。本文还提出了一种轻量级模型,该模型利用文本功能的子集来实时识别垃圾邮件Twitter帐户。所提出的方法在真实的数据集(40、000条推文和1,000个帐户)上进行了评估,显示出在两种模型中均具有出色的性能,其准确性和F1得分均为0.98。提议的系统在不到5毫秒内对一条推文进行分类,并在不到1秒内对一个帐户进行分类。本文还提出了一种轻量级模型,该模型利用文本功能的子集来实时识别垃圾邮件Twitter帐户。所提出的方法在真实的数据集(40、000条推文和1,000个帐户)上进行了评估,显示出在两种模型中均具有出色的性能,其准确性和F1得分均为0.98。提议的系统在不到5毫秒内对一条推文进行分类,并在不到1秒内对一个帐户进行分类。本文还提出了一种轻量级模型,该模型利用文本功能的子集来实时识别垃圾邮件Twitter帐户。所提出的方法在真实的数据集(40、000条推文和1,000个帐户)上进行了评估,显示出在两种模型中均具有出色的性能,其准确性和F1得分均为0.98。提议的系统在不到5毫秒内对一条推文进行分类,并在不到1秒内对一个帐户进行分类。

更新日期:2021-02-19
down
wechat
bug