当前位置: X-MOL 学术Eng. Appl. Artif. Intell. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Towards benchmark datasets for machine learning based website phishing detection: An experimental study
Engineering Applications of Artificial Intelligence ( IF 8 ) Pub Date : 2021-06-16 , DOI: 10.1016/j.engappai.2021.104347
Abdelhakim Hannousse , Salima Yahiouche

The increasing popularity of the Internet led to a substantial growth of e-commerce. However, such activities have main security challenges primary caused by cyberfraud and identity theft. Therefore, checking the legitimacy of visited web pages is a crucial task to secure costumers’ identities and prevent phishing attacks. The use of machine learning is widely recognized as a promising solution. The literature is rich with studies that use machine learning techniques for website phishing detection. However, their findings are dataset dependent and are far away from generalization. Two main reasons for this unfortunate state are the impracticable replication and absence of appropriate benchmark datasets for fair evaluation of systems. Moreover, phishing tactics are continuously evolving and proposed systems are not following those rapid changes. In this paper, we present a general scheme for building reproducible and extensible datasets for website phishing detection. The aim is to (1) enable comparison of systems adopting different features, (2) overtake the short-lived nature of phishing websites, and (3) keep track of the evolution of phishing tactics. For experimenting the proposed scheme, we start by adopting a refined categorization of website phishing features and we systematically select a total of 87 commonly recognized ones, we categorize them, and we made them subjects for relevance and runtime analysis. We use the collected set of features to build a dataset in light of the proposed scheme. Thereafter, we use a conceptual replication approach to check the genericity of former findings for the built dataset. Specifically, we evaluate the performance of classifiers on individual and combined categories of features, we investigate different combinations of models, and we explore the effects of filter and wrapper methods on the selection of discriminative features. The results show that Random Forest is the most predictive classifier. Features gathered from external services are the most discriminative where features extracted from web page contents are less distinguishing. Besides external service based features, some web page content features are found not suitable for runtime detection. The use of hybrid features provided the best accuracy score of 96.61%. By investigating different feature selection methods, filter-based ranking with incremental removal of less important features improved the performance up to 96.83% better than wrapper methods.



中文翻译:

面向基于机器学习的网站网络钓鱼检测基准数据集:一项实验研究

互联网的日益普及导致电子商务的大幅增长。然而,此类活动的主要安全挑战主要是由网络欺诈和身份盗用造成的。因此,检查访问网页的合法性是保护客户身份和防止网络钓鱼攻击的关键任务。机器学习的使用被广泛认为是一种有前途的解决方案。文献中包含大量使用机器学习技术进行网站网络钓鱼检测的研究。然而,他们的发现是依赖于数据集的,并且远离泛化。这种不幸状态的两个主要原因是不切实际的复制和缺乏用于公平评估系统的适当基准数据集。此外,网络钓鱼策略在不断发展,提议的系统并没有跟随这些快速变化。在本文中,我们提出了一种构建可复制和可扩展数据集的通用方案,用于网站网络钓鱼检测。目的是 (1) 能够比较采用不同功能的系统,(2) 克服网络钓鱼网站的短暂性,以及 (3) 跟踪网络钓鱼策略的演变。为了试验所提出的方案,我们首先对网站网络钓鱼特征进行了细化分类,我们系统地选择了总共 87 个公认的特征,对它们进行了分类,并将它们作为相关性和运行时分析的主题。根据所提出的方案,我们使用收集到的一组特征来构建数据集。此后,我们使用概念复制方法来检查已构建数据集的先前发现的通用性。具体来说,我们评估分类器在单个和组合特征类别上的性能,研究模型的不同组合,并探索过滤器和包装方法对判别特征选择的影响。结果表明,随机森林是最具预测性的分类器。从外部服务收集的特征是最具辨别力的,其中从网页内容中提取的特征不太具有区分性。除了基于外部服务的功能外,还发现一些网页内容功能不适合运行时检测。混合特征的使用提供了 96.61% 的最佳准确率。通过研究不同的特征选择方法,基于过滤器的排序和逐步去除不太重要的特征,比包装方法提高了 96.83% 的性能。

更新日期:2021-06-17
down
wechat
bug