当前位置: X-MOL 学术ACM Trans. Web › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Combining URL and HTML Features for Entity Discovery in the Web
ACM Transactions on the Web ( IF 2.6 ) Pub Date : 2019-12-05 , DOI: 10.1145/3365574
Edimar Manica 1 , Carina Friedrich Dorneles 2 , Renata Galante 3
Affiliation  

The web is a large repository of entity-pages. An entity-page is a page that publishes data representing an entity of a particular type, for example, a page that describes a driver on a website about a car racing championship. The attribute values published in the entity-pages can be used for many data-driven companies, such as insurers, retailers, and search engines. In this article, we define a novel method, called SSUP , which discovers the entity-pages on the websites. The novelty of our method is that it combines URL and HTML features in a way that allows the URL terms to have different weights depending on their capacity to distinguish entity-pages from other pages, and thus the efficacy of the entity-page discovery task is increased. SSUP determines the similarity thresholds on each website without human intervention. We carried out experiments on a dataset with different real-world websites and a wide range of entity types. SSUP achieved a 95% rate of precision and 85% recall rate. Our method was compared with two state-of-the-art methods and outperformed them with a precision gain between 51% and 66%.

中文翻译:

结合 URL 和 HTML 功能在 Web 中进行实体发现

网络是实体页面的大型存储库。实体页面是发布表示特定类型实体的数据的页面,例如,描述网站上关于赛车锦标赛的车手的页面。实体页面中发布的属性值可用于许多数据驱动的公司,例如保险公司、零售商和搜索引擎。在本文中,我们定义了一种新方法,称为SSUP,它发现网站上的实体页面。我们方法的新颖之处在于它结合了 URL 和 HTML 特征,允许 URL 术语根据它们区分实体页面和其他页面的能力而具有不同的权重,因此实体页面发现任务的功效是增加。SSUP无需人工干预即可确定每个网站上的相似度阈值。我们在具有不同真实世界网站和广泛实体类型的数据集上进行了实验。SSUP达到了95%的准确率和85%的召回率。我们的方法与两种最先进的方法进行了比较,并以 51% 到 66% 的精度增益优于它们。
更新日期:2019-12-05
down
wechat
bug