Cross-dataset email classification,Journal of Intelligent & Fuzzy Systems

当前位置： X-MOL 学术 › J. Intell. Fuzzy Syst. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Cross-dataset email classification
Journal of Intelligent & Fuzzy Systems ( IF 1.7 ) Pub Date : 2020-06-12 , DOI: 10.3233/jifs-179890
Valentin Morales ₁ , Juan Carlos Gomez ₁ , Saskia Van Amerongen ₁

Affiliation

Email is one of the most popular ways of communication. Nevertheless, it is also a potential tool to deceive and fill users with unwanted publicity, which reduces productivity. To alleviate such fact, a common solution has been building machine learning models based on the content of emails to automatically separate emails (spam vs ham). In this work, a study of a set of machine learning models and content-based features for the problem of cross-dataset email classification is presented. This problem consists in training and testing the models using different datasets; considering the fact that the datasets were collected under different independent setups. This has the purpose of simulating future variable or unpredictable conditions in the emails content distributions as could happen in a real setting, where models are trained using emails from a certain period of time, group of users or accounts, but tested with emails from other users or accounts. Experiments were conducted with the models and features using different datasets and two setups, same-dataset, and cross-dataset, to show the complexity of the later. The performance was evaluated using the Area Under the ROC Curve, a common metric in email classification. The results show interesting insights for the problem.

中文翻译：

跨数据集电子邮件分类

电子邮件是最流行的通信方式之一。尽管如此，它还是一种潜在的工具，可以欺骗用户并给他们带来不必要的宣传，从而降低了生产率。为了缓解这种情况，一种常见的解决方案是基于电子邮件的内容构建机器学习模型，以自动分离电子邮件（垃圾邮件还是垃圾邮件）。在这项工作中，提出了针对跨数据集电子邮件分类问题的一组机器学习模型和基于内容的功能的研究。这个问题在于使用不同的数据集训练和测试模型。考虑到数据集是在不同的独立设置下收集的事实。这样做的目的是模拟电子邮件内容分发中将来可能发生的可变或不可预测的情况，就像在实际环境中可能发生的情况一样，其中使用来自特定时间段，用户组或帐户的电子邮件来训练模型，但使用来自其他用户或帐户的电子邮件进行测试。使用不同的数据集和两个设置（相同数据集和交叉数据集）对模型和功能进行了实验，以显示后者的复杂性。使用ROC曲线下的面积（电子邮件分类中的常用指标）评估了性能。结果显示了对该问题的有趣见解。

更新日期：2020-06-19

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11