当前位置: X-MOL 学术Expert Syst. Appl. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Hybrid ensemble approaches to online harassment detection in highly imbalanced data
Expert Systems with Applications ( IF 8.5 ) Pub Date : 2021-02-25 , DOI: 10.1016/j.eswa.2021.114751
Marwa Tolba , Salima Ouadfel , Souham Meshoul

Online harassment is a major threat to users of social media platforms, especially young adults and women. It can cause mental illnesses and impacts deeply and negatively economic institutions experiencing cyberbully attacks by losing their credibility and business. This makes automatic detection of online harassment extremely important. Most of current studies within this context apply machine-learning algorithms that assume balanced class distribution. However, this assumption does not hold for most real datasets. This research provides a comprehensive investigation of various approaches that combine diverse techniques under three dimensions: feature representation, imbalanced data handling, and supervised learning. For the first dimension, three word-embedding models have been considered, namely: word2vec, Glove, and SSWE. For the other two dimensions, nine techniques for balancing skewed class distributions have been employed to feed several learning models. In particular, resampling methods, cost-sensitive learning, and Weight-Selection strategy-based methods have been used with deep neural networks. The ultimate goal of this study is to evaluate the potential of using such hybrid approaches to handle the online harassment detection task efficiently using highly-imbalanced Twitter data and to select the best combination concerning the intended purpose. An extensive comparative study has been conducted, and the results have been discussed in terms of three evaluation metrics widely used for imbalanced classification. As main findings, Glove has been found as the best feature representation and some combinations as the best performing most notably LSTM and BLSTM with cost-sensitive learning and VL strategy.



中文翻译:

高度不平衡数据中在线骚扰检测的混合集成方法

在线骚扰是对社交媒体平台用户尤其是年轻人和女性的主要威胁。它可能会导致精神疾病,并会因丧失信誉和业务而对遭受网络欺凌攻击的经济机构造成深远的负面影响。这使得自动检测在线骚扰非常重要。在这种情况下,当前的大多数研究都采用假定平衡的班级分布的机器学习算法。但是,此假设不适用于大多数真实数据集。这项研究对在以下三个方面结合了多种技术的各种方法进行了全面研究:特征表示,不平衡的数据处理和监督学习。对于第一维,已经考虑了三个词嵌入模型,即:word2vec,Glove和SSWE。对于其他两个维度,已经采用了九种平衡偏斜类分布的技术来提供几种学习模型。特别是,深层神经网络已使用了重采样方法,成本敏感型学习和基于权重选择策略的方法。这项研究的最终目标是评估使用这种混合方法利用高度不平衡的Twitter数据有效处理在线骚扰检测任务的潜力,并选择与预期目的相关的最佳组合。进行了广泛的比较研究,并针对广泛用于不平衡分类的三个评估指标对结果进行了讨论。作为主要发现,

更新日期:2021-03-23
down
wechat
bug