PathMarker: protecting web contents against inside crawlers,Cybersecurity

当前位置： X-MOL 学术 › Cybersecurity › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

PathMarker: protecting web contents against inside crawlers
Cybersecurity Pub Date : 2019-02-20 , DOI: 10.1186/s42400-019-0023-1
Shengye Wan , Yue Li , Kun Sun

Web crawlers have been misused for several malicious purposes such as downloading server data without permission from the website administrator. Moreover, armoured crawlers are evolving against new anti-crawler mechanisms in the arm races between crawler developers and crawler defenders. In this paper, based on one observation that normal users and malicious crawlers have different short-term and long-term download behaviours, we develop a new anti-crawler mechanism called PathMarker to detect and constrain persistent distributed crawlers. By adding a marker to each Uniform Resource Locator (URL), we can trace the page that leads to the access of this URL and the user identity who accesses this URL. With this supporting information, we can not only perform more accurate heuristic detection using the path related features, but also develop a Support Vector Machine based machine learning detection model to distinguish malicious crawlers from normal users via inspecting their different patterns of URL visiting paths and URL visiting timings. In addition to effectively detecting crawlers at the earliest stage, PathMarker can dramatically suppress the scraping efficiency of crawlers before they are detected. We deploy our approach on an online forum website, and the evaluation results show that PathMarker can quickly capture all 6 open-source and in-house crawlers, plus two external crawlers (i.e., Googlebots and Yahoo Slurp).

中文翻译：

PathMarker：保护网络内容免受内部爬虫的侵害

网络爬虫已被滥用于多种恶意目的，例如未经网站管理员许可下载服务器数据。此外，在爬虫开发者和爬虫防御者之间的军备竞赛中，装甲爬虫正在进化以对抗新的反爬虫机制。在本文中，基于对正常用户和恶意爬虫具有不同的短期和长期下载行为的观察，我们开发了一种名为 PathMarker 的新型反爬虫机制来检测和约束持久性分布式爬虫。通过为每个统一资源定位器（URL）添加一个标记，我们可以跟踪导致访问该 URL 的页面以及访问该 URL 的用户身份。有了这些支持信息，我们不仅可以使用路径相关的特征执行更准确的启发式检测，同时还开发了一个基于支持向量机的机器学习检测模型，通过检查恶意爬虫的不同模式的 URL 访问路径和 URL 访问时间来区分恶意爬虫和普通用户。PathMarker除了能在最早阶段有效检测爬虫外，还可以在爬虫被发现之前大幅抑制爬虫的抓取效率。我们将我们的方法部署在一个在线论坛网站上，评估结果表明 PathMarker 可以快速捕获所有 6 个开源和内部爬虫，以及两个外部爬虫（即 Googlebots 和 Yahoo Slurp）。PathMarker除了能在最早阶段有效检测爬虫外，还可以在爬虫被发现之前大幅抑制爬虫的抓取效率。我们将我们的方法部署在一个在线论坛网站上，评估结果表明 PathMarker 可以快速捕获所有 6 个开源和内部爬虫，以及两个外部爬虫（即 Googlebots 和 Yahoo Slurp）。PathMarker除了能在最早阶段有效检测爬虫外，还可以在爬虫被发现之前大幅抑制爬虫的抓取效率。我们将我们的方法部署在一个在线论坛网站上，评估结果表明 PathMarker 可以快速捕获所有 6 个开源和内部爬虫，以及两个外部爬虫（即 Googlebots 和 Yahoo Slurp）。

更新日期：2019-02-20

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文

全部期刊列表>>