当前位置: X-MOL 学术Found. Trends Inf. Ret. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Email Spam Filtering: A Systematic Review
Foundations and Trends in Information Retrieval ( IF 10.4 ) Pub Date : 2008-6-22 , DOI: 10.1561/1500000006
Gordon V. Cormack

Spam is information crafted to be delivered to a large number of recipients, in spite of their wishes. A spam filter is an automated tool to recognize spam so as to prevent its delivery. The purposes of spam and spam filters are diametrically opposed: spam is effective if it evades filters, while a filter is effective if it recognizes spam. The circular nature of these definitions, along with their appeal to the intent of sender and recipient make them difficult to formalize. A typical email user has a working definition no more formal than "I know it when I see it." Yet, current spam filters are remarkably effective, more effective than might be expected given the level of uncertainty and debate over a formal definition of spam, more effective than might be expected given the state-of-the-art information retrieval and machine learning methods for seemingly similar problems. But are they effective enough? Which are better? How might they be improved? Will their effectiveness be compromised by more cleverly crafted spam?

We survey current and proposed spam filtering techniques with particular emphasis on how well they work. Our primary focus is spam filtering in email; Similarities and differences with spam filtering in other communication and storage media — such as instant messaging and the Web — are addressed peripherally. In doing so we examine the definition of spam, the user's information requirements and the role of the spam filter as one component of a large and complex information universe. Well-known methods are detailed sufficiently to make the exposition self-contained, however, the focus is on considerations unique to spam. Comparisons, wherever possible, use common evaluation measures, and control for differences in experimental setup. Such comparisons are not easy, as benchmarks, measures, and methods for evaluating spam filters are still evolving. We survey these efforts, their results and their limitations. In spite of recent advances in evaluation methodology, many uncertainties (including widely held but unsubstantiated beliefs) remain as to the effectiveness of spam filtering techniques and as to the validity of spam filter evaluation methods. We outline several uncertainties and propose experimental methods to address them.



中文翻译:

电子邮件垃圾邮件过滤:系统评价

垃圾邮件是经过精心设计的信息,尽管有其意愿,也可以传递给大量收件人。垃圾邮件过滤器是识别垃圾邮件以防止其传播的自动化工具。垃圾邮件和垃圾邮件过滤器的用途截然相反:垃圾邮件在逃避过滤器时有效,而过滤器在识别垃圾邮件时才有效。这些定义的循环性质,以及它们对发送者和接收者意图的吸引力,使得它们很难形式化。一个典型的电子邮件用户具有一个有效的定义,其定义比“当我看到它时就知道”更正式。但是,鉴于不确定性水平和对垃圾邮件正式定义的争论,目前的垃圾邮件过滤器非常有效,比预期的更有效。考虑到类似问题的最新信息检索和机器学习方法,这种方法比预期的要有效。但是它们足够有效吗?哪个更好?如何改善它们?巧妙地制作垃圾邮件是否会损害其有效性?

我们调查了当前和提议的垃圾邮件过滤技术,并特别强调了它们的运行状况。我们的主要重点是电子邮件中的垃圾邮件过滤;外围解决了其他通讯和存储媒体(例如即时消息传递和Web)中垃圾邮件过滤的异同。在此过程中,我们检查了垃圾邮件的定义,用户的信息要求以及垃圾邮件过滤器作为大型复杂信息世界的组成部分的作用。众所周知的方法足够详细,可以使博览会自成一体,但是重点是垃圾邮件特有的注意事项。比较尽可能使用通用的评估方法,并控制实验设置中的差异。作为基准,衡量标准,垃圾邮件过滤器的评估方法仍在不断发展。我们调查了这些努力,其结果和局限性。尽管评估方法方面有最新进展,但对于垃圾邮件过滤技术的有效性和垃圾邮件过滤器评估方法的有效性,仍然存在许多不确定性(包括广泛持有但未经证实的信念)。我们概述了一些不确定性,并提出了解决这些问题的实验方法。

更新日期:2008-06-22
down
wechat
bug