Website replica detection with distant supervision,Information Retrieval Journal

当前位置： X-MOL 学术 › Inf. Retrieval J. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Website replica detection with distant supervision
Information Retrieval Journal ( IF 1.7 ) Pub Date : 2017-11-29 , DOI: 10.1007/s10791-017-9320-z
Cristiano Carvalho , Edleno Silva de Moura , Adriano Veloso , Nivio Ziviani

Duplicate content on the Web occurs within the same website or across multiple websites. The latter is mainly associated with the existence of website replicas—sites that are perceptibly similar. Replication may be accidental, intentional or malicious, but no matter the reason, search engines suffer greatly either from unnecessarily storing and moving duplicate data, or from providing search results that do not offer real value to the users. In this paper, we model the detection of website replicas as a pairwise classification problem with distant supervision. That is, (heuristically) finding obvious replica and non-replica cases is trivial, but learning effective classifiers requires a representative set of non-obvious labeled examples, which are hard to obtain. We employ efficient Expectation-Maximization (EM) algorithms in order to find non-obvious examples from obvious ones, enlarging the training-set and improving the classifiers iteratively. Our classifiers employ association rules, being thus incrementally updated as the EM process iterates, making our algorithms time-efficient. Experiments show that: (1) replicas are fully eliminated at a false-positive rate lower than 0.005, incurring in + 19% reduction in the number of duplicate URLs, (2) reduction increases to + 21% by using our site-level algorithms in conjunction with existing URL-level algorithms, and (3) our classifiers are more than two orders of magnitude faster than semi-supervised alternative solutions.

中文翻译：

远程监督下的网站副本检测

Web上的重复内容发生在同一网站内或跨多个网站。后者主要与网站复制品的存在有关，网站复制品的外观相似。复制可能是偶然的，有意的或恶意的，但无论出于何种原因，搜索引擎都会因不必要地存储和移动重复数据或提供无法为用户提供真正价值的搜索结果而遭受重大损失。在本文中，我们将在远程监管下将网站副本的检测建模为成对分类问题。也就是说，（启发式地）发现明显的复制品和非复制品案例是微不足道的，但是学习有效的分类器需要一组代表性的非明显标记的示例，这些示例很难获得。我们采用有效的期望最大化（EM）算法，以便从明显的示例中找出不明显的示例，扩大训练集并迭代地改进分类器。我们的分类器采用关联规则，因此随着EM流程的迭代而不断更新，从而使我们的算法省时。实验表明：（1）完全消除重复项的误报率低于0.005，从而导致重复URL的数量减少了+19％，（2）使用我们的站点级算法，减少了+ 21％结合现有的URL级算法，以及（3）我们的分类器比半监督替代解决方案快两个数量级。我们的分类器采用关联规则，因此随着EM流程的迭代而不断更新，从而使我们的算法省时。实验表明：（1）完全消除重复项的误报率低于0.005，从而导致重复URL的数量减少了+19％，（2）使用我们的站点级算法，减少了+ 21％结合现有的URL级算法，以及（3）我们的分类器比半监督替代解决方案快两个数量级。我们的分类器采用关联规则，因此随着EM流程的迭代而不断更新，从而使我们的算法省时。实验表明：（1）完全消除重复项的误报率低于0.005，从而导致重复URL的数量减少了+19％，（2）使用我们的站点级算法，减少了+ 21％结合现有的URL级算法，以及（3）我们的分类器比半监督替代解决方案快两个数量级。

更新日期：2017-11-29

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11