当前位置: X-MOL 学术Knowl. Based Syst. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
CS-BPSO: Hybrid feature selection based on chi-square and binary PSO algorithm for Arabic email authorship analysis
Knowledge-Based Systems ( IF 8.8 ) Pub Date : 2021-06-12 , DOI: 10.1016/j.knosys.2021.107224
Wojdan BinSaeedan , Salwa Alramlawi

Email authorship analysis is a challenging task involving the detection of an author’s style to help determine their identity. Emails represent a widespread application of big data, and email authorship analysis is widely performed in the forensic linguistics field. However, the high-dimensional feature space encountered in authorship analysis affects the classification performance. Moreover, the Arabic language is highly inflected and involves certain unique characteristics, which pose critical challenges in identifying the context. Therefore, the selection of prominent features is a critical step in realizing authorship analysis. Swarm intelligence (SI) algorithms are widely adopted to address such feature selection problems. In this study, an efficient hybrid feature selection algorithm based on binary particle swarm optimization (BPSO) and chi-square BPSO (CS-BPSO) was developed to enhance the performance of Arabic email authorship analysis. Static and dynamic features were considered. Experiments were conducted on Arabic email messages collected from a sample population to test the algorithm performance using three popular classifiers: support vector machine (SVM), K-nearest neighbour (KNN), and naïve Bayes (NB) classifiers. Different metrics, specifically, the accuracy, precision, recall, and f1-score, were considered as performance measures. The results showed that the CS-BPSO method achieves impressive results using dynamic features. The findings were quite satisfactory in terms of solving multiple types of difficulties, e.g., imbalanced dataset, small dataset, and short text.



中文翻译:

CS-BPSO:基于卡方和二进制 PSO 算法的混合特征选择,用于阿拉伯电子邮件作者身份分析

电子邮件作者身份分析是一项具有挑战性的任务,涉及检测作者的风格以帮助确定他们的身份。电子邮件代表了大数据的广泛应用,电子邮件作者身份分析在法庭语言学领域得到广泛应用。然而,作者身份分析中遇到的高维特征空间会影响分类性能。此外,阿拉伯语是高度屈折的,涉及某些独特的特征,这对识别语境提出了严峻的挑战。因此,突出特征的选择是实现作者分析的关键步骤。群智能 (SI) 算法被广泛用于解决此类特征选择问题。在这项研究中,开发了一种基于二进制粒子群优化 (BPSO) 和卡方 BPSO (CS-BPSO) 的高效混合特征选择算法,以提高阿拉伯电子邮件作者身份分析的性能。考虑了静态和动态特征。对从样本群体中收集的阿拉伯电子邮件消息进行了实验,以使用三种流行的分类器测试算法性能:支持向量机 (SVM)、K-最近邻 (KNN) 和朴素贝叶斯 (NB) 分类器。不同的指标,特别是准确度、精确度、召回率和 f1 分数,被视为性能指标。结果表明,CS-BPSO 方法使用动态特征取得了令人印象深刻的结果。在解决不平衡数据集、小数据集和短文本等多种类型的困难方面,结果令人满意。

更新日期:2021-06-16
down
wechat
bug