Unsupervised Domain Ranking in Large-Scale Web Crawls,ACM Transactions on the Web

当前位置： X-MOL 学术 › ACM Trans. Web › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Unsupervised Domain Ranking in Large-Scale Web Crawls
ACM Transactions on the Web ( IF 3.5 ) Pub Date : 2018-09-28 , DOI: 10.1145/3182180
Yi Cui ₁ , Clint Sparkman ₂ , Hsin-Tsang Lee ₃ , Dmitri Loguinov ₁

Affiliation

With the proliferation of web spam and infinite autogenerated web content, large-scale web crawlers require low-complexity ranking methods to effectively budget their limited resources and allocate bandwidth to reputable sites. In this work, we assume crawls that produce frontiers orders of magnitude larger than RAM, where sorting of pending URLs is infeasible in real time. Under these constraints, the main objective is to quickly compute domain budgets and decide which of them can be massively crawled. Those ranked at the top of the list receive aggressive crawling allowances, while all other domains are visited at some small default rate. To shed light on Internet-wide spam avoidance, we study topology-based ranking algorithms on domain-level graphs from the two largest academic crawls: a 6.3B-page IRLbot dataset and a 1B-page ClueWeb09 exploration. We first propose a new methodology for comparing the various rankings and then show that in-degree BFS-based techniques decisively outperform classic PageRank-style methods, including TrustRank. However, since BFS requires several orders of magnitude higher overhead and is generally infeasible for real-time use, we propose a fast, accurate, and scalable estimation method called TSE that can achieve much better crawl prioritization in practice. It is especially beneficial in applications with limited hardware resources.

中文翻译：

大规模网络爬虫中的无监督域排名

随着网络垃圾邮件和无限自动生成的网络内容的激增，大型网络爬虫需要低复杂度的排名方法来有效地预算其有限的资源并将带宽分配给信誉良好的网站。在这项工作中，我们假设抓取产生的边界数量级大于 RAM，其中对待处理的 URL 进行实时排序是不可行的。在这些限制条件下，主要目标是快速计算域预算并决定其中哪些可以被大规模爬取。排在列表顶部的那些获得了积极的爬网许可，而所有其他域的访问率都较低。为了阐明互联网范围内的垃圾邮件避免，我们研究了两个最大学术爬网的域级图上基于拓扑的排名算法：a 6。3B 页 IRLbot 数据集和 1B 页 ClueWeb09 探索。我们首先提出了一种比较各种排名的新方法，然后表明基于 BFS 的度内技术明显优于经典的 PageRank 风格的方法，包括 TrustRank。然而，由于 BFS 需要高出几个数量级的开销并且通常不适合实时使用，因此我们提出了一种称为 TSE 的快速、准确和可扩展的估计方法，该方法可以在实践中实现更好的爬取优先级。它在硬件资源有限的应用程序中特别有用。由于 BFS 需要高出几个数量级的开销，并且通常不适合实时使用，因此我们提出了一种称为 TSE 的快速、准确和可扩展的估计方法，它可以在实践中实现更好的爬取优先级。它在硬件资源有限的应用程序中特别有用。由于 BFS 需要高出几个数量级的开销，并且通常不适合实时使用，因此我们提出了一种称为 TSE 的快速、准确和可扩展的估计方法，它可以在实践中实现更好的爬取优先级。它在硬件资源有限的应用程序中特别有用。

更新日期：2018-09-28

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>