当前位置: X-MOL 学术arXiv.cs.IR › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Online Algorithms for Estimating Change Rates of Web Pages
arXiv - CS - Information Retrieval Pub Date : 2020-09-17 , DOI: arxiv-2009.08142
Konstantin Avrachenkov, Kishor Patil, Gugan Thoppe

For providing quick and accurate search results, a search engine maintains a local snapshot of the entire web. And, to keep this local cache fresh, it employs a crawler for tracking changes across various web pages. It would have been ideal if the crawler managed to update the local snapshot as soon as a page changed on the web. However, finite bandwidth availability and server restrictions mean that there is a bound on how frequently the different pages can be crawled. This then brings forth the following optimisation problem: maximise the freshness of the local cache subject to the crawling frequency being within the prescribed bounds. Recently, tractable algorithms have been proposed to solve this optimisation problem under different cost criteria. However, these assume the knowledge of exact page change rates, which is unrealistic in practice. We address this issue here. Specifically, we provide three novel schemes for online estimation of page change rates. All these schemes only need partial information about the page change process, i.e., they only need to know if the page has changed or not since the last crawl instance. Our first scheme is based on the law of large numbers, the second on the theory of stochastic approximation, while the third is an extension of the second and involves an additional momentum term. For all of these schemes, we prove convergence and, also, provide their convergence rates. As far as we know, the results concerning the third estimator is quite novel. Specifically, this is the first convergence type result for a stochastic approximation algorithm with momentum. Finally, we provide some numerical experiments (on real as well as synthetic data) to compare the performance of our proposed estimators with the existing ones (e.g., MLE).

中文翻译:

用于估计网页变化率的在线算法

为了提供快速准确的搜索结果,搜索引擎会维护整个网络的本地快照。而且,为了保持这个本地缓存的新鲜度,它使用了一个爬虫来跟踪各种网页的变化。如果网络上的页面发生变化,爬虫能够立即更新本地快照,那将是理想的。但是,有限的带宽可用性和服务器限制意味着可以抓取不同页面的频率是有限的。这就带来了如下优化问题:在规定的范围内,使本地缓存的新鲜度最大化。最近,已经提出了易于处理的算法来解决不同成本标准下的优化问题。然而,这些假设知道确切的页面变化率,这在实践中是不现实的。我们在这里解决这个问题。具体来说,我们提供了三种新颖的页面变化率在线估计方案。所有这些方案都只需要页面变化过程的部分信息,即它们只需要知道自上次抓取实例以来页面是否发生了变化。我们的第一个方案基于大数定律,第二个方案基于随机近似理论,而第三个方案是第二个方案的扩展,并涉及额外的动量项。对于所有这些方案,我们证明了收敛性,并且还提供了它们的收敛率。据我们所知,关于第三个估计量的结果是相当新颖的。具体来说,这是具有动量的随机逼近算法的第一个收敛类型结果。最后,
更新日期:2020-09-18
down
wechat
bug