当前位置: X-MOL 学术Inf. Process. Manag. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Improving classifier training efficiency for automatic cyberbullying detection with Feature Density
Information Processing & Management ( IF 8.6 ) Pub Date : 2021-05-13 , DOI: 10.1016/j.ipm.2021.102616
Juuso Eronen , Michal Ptaszynski , Fumito Masui , Aleksander Smywiński-Pohl , Gniewosz Leliwa , Michal Wroczynski

We study the effectiveness of Feature Density (FD) using different linguistically-backed feature preprocessing methods in order to estimate dataset complexity, which in turn is used to comparatively estimate the potential performance of machine learning (ML) classifiers prior to any training. We hypothesize that estimating dataset complexity allows for the reduction of the number of required experiments iterations. This way we can optimize the resource-intensive training of ML models which is becoming a serious issue due to the increases in available dataset sizes and the ever rising popularity of models based on Deep Neural Networks (DNN). The problem of constantly increasing needs for more powerful computational resources is also affecting the environment due to alarmingly-growing amount of CO2 emissions caused by training of large-scale ML models. The research was conducted on multiple datasets, including popular datasets, such as Yelp business review dataset used for training typical sentiment analysis models, as well as more recent datasets trying to tackle the problem of cyberbullying, which, being a serious social problem, is also a much more sophisticated problem form the point of view of linguistic representation. We use cyberbullying datasets collected for multiple languages, namely English, Japanese and Polish. The difference in linguistic complexity of datasets allows us to additionally discuss the efficacy of linguistically-backed word preprocessing.



中文翻译:

通过特征密度提高分类器训练效率,以进行自动网络欺凌检测

为了估计数据集的复杂性,我们使用不同的语言支持的特征预处理方法研究了特征密度(FD)的有效性,从而反过来用于在进行任何训练之前比较地估计机器学习(ML)分类器的潜在性能。我们假设估计数据集的复杂度可以减少所需的实验迭代次数。这样,由于可用数据集大小的增加以及基于深度神经网络(DNN)的模型的日益普及,我们可以优化ML模型的资源密集型训练,这已成为一个严重的问题。由于CO的数量惊人地增长,对更强大的计算资源的需求不断增加的问题也正在影响环境。2个大规模机器学习模型训练导致的排放。这项研究是在多个数据集上进行的,其中包括流行的数据集,例如用于训练典型情感分析模型的Yelp业务评论数据集,以及旨在解决网络欺凌问题(也是严重的社会问题)的最新数据集。从语言表达的角度来看,这是一个更为复杂的问题。我们使用为多种语言(英语,日语和波兰语)收集的网络欺凌数据集。数据集语言复杂性的差异使我们可以进一步讨论语言支持的单词预处理的功效。

更新日期:2021-05-13
down
wechat
bug