当前位置: X-MOL 学术Future Gener. Comput. Syst. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Textual analysis of traitor-based dataset through semi supervised machine learning
Future Generation Computer Systems ( IF 7.5 ) Pub Date : 2021-06-25 , DOI: 10.1016/j.future.2021.06.036
Faisal Janjua , Asif Masood , Haider Abbas , Imran Rashid , Malik M. Zaki Murtaza Khan

Insider threats are one of the most challenging and growing security threats which the government agencies, organizations, and institutions face. In such scenarios, malicious (red) activities are performed by the authorized individuals within the company. Because of which, an insider threat has become a taxing and difficult task to identify among other attacks. Along with other monitoring parameters; email logs play a vital role in many research areas such as stalking Insider Threat involving Collaborating Traitors, Textual Analysis, and Social Media exploration. This paper presents a semi-supervised machine learning framework which embraces the pre-processing and classification techniques together for unlabeled dataset i.e. emails. Enron Corporation dataset has been used for experiments and TWOS for evaluation of the proposed framework. Initially, dataset is transformed into vector form using Term Frequency–Inverse Document Frequency (TF–IDF). Thereafter, K-Means is used to classify emails based on message content. Finally, Machine Learning algorithm Decision Tree (DT) is applied to classify the malicious activities. The proposed framework has also been tested with other algorithms such as Logistic Regression (LR), Naive Bayes (NB), KNN, Support Vector Machine (SVM), Random Forest (RF) and Neural Network (NN). However, Decision Tree (DT) combined with pre-processing steps has given the desired results with 99.96% Accuracy and 0.994 AUC for identification of malicious content.



中文翻译:

通过半监督机器学习对基于叛徒的数据集进行文本分析

内部威胁是政府机构、组织和机构面临的最具挑战性和日益增长的安全威胁之一。在这种情况下,恶意(红色)活动由公司内的授权人员执行。因此,在其他攻击中识别内部威胁已成为一项繁重而艰巨的任务。连同其他监测参数;电子邮件日志在许多研究领域都发挥着至关重要的作用,例如跟踪涉及合作叛徒的内部威胁、文本分析和社交媒体探索。本文提出了一种半监督机器学习框架,该框架将预处理和分类技术结合在一起,用于未标记的数据集,即电子邮件。安然公司数据集已用于实验和 TWOS 用于评估提议的框架。最初,使用术语频率-逆文档频率(TF-IDF)将数据集转换为向量形式。此后,K-Means 用于根据消息内容对电子邮件进行分类。最后,应用机器学习算法决策树(DT)对恶意活动进行分类。所提出的框架还使用其他算法进行了测试,例如逻辑回归 (LR)、朴素贝叶斯 (NB)、KNN、支持向量机 (SVM)、随机森林 (RF) 和神经网络 (NN)。然而,决策树 (DT) 与预处理步骤相结合,给出了预期的结果,识别恶意内容的准确度为 99.96%,AUC 为 0.994。机器学习算法决策树(DT)用于对恶意活动进行分类。所提出的框架还使用其他算法进行了测试,例如逻辑回归 (LR)、朴素贝叶斯 (NB)、KNN、支持向量机 (SVM)、随机森林 (RF) 和神经网络 (NN)。然而,决策树 (DT) 与预处理步骤相结合,给出了预期的结果,识别恶意内容的准确度为 99.96%,AUC 为 0.994。机器学习算法决策树(DT)用于对恶意活动进行分类。所提出的框架还使用其他算法进行了测试,例如逻辑回归 (LR)、朴素贝叶斯 (NB)、KNN、支持向量机 (SVM)、随机森林 (RF) 和神经网络 (NN)。然而,决策树 (DT) 与预处理步骤相结合,给出了预期的结果,识别恶意内容的准确度为 99.96%,AUC 为 0.994。

更新日期:2021-07-23
down
wechat
bug