当前位置: X-MOL 学术World Wide Web › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Closed sequential pattern mining for sitemap generation
World Wide Web ( IF 2.7 ) Pub Date : 2020-09-27 , DOI: 10.1007/s11280-020-00839-2
Michelangelo Ceci , Pasqua Fabiana Lanotte

A sitemap represents an explicit specification of the design concept and knowledge organization of a website and is therefore considered as the website’s basic ontology. It not only presents the main usage flows for users, but also hierarchically organizes concepts of the website. Typically, sitemaps are defined by webmasters in the very early stages of the website design. However, during their life websites significantly change their structure, their content and their possible navigation paths. Even if this is not the case, webmasters can fail to either define sitemaps that reflect the actual website content or, vice versa, to define the actual organization of pages and links which do not reflect the intended organization of the content coded in the sitemaps. In this paper we propose an approach which automatically generates sitemaps. Contrary to other approaches proposed in the literature, which mainly generate sitemaps from the textual content of the pages, in this work sitemaps are generated by analyzing the Web graph of a website. This allows us to: i) automatically generate a sitemap on the basis of possible navigation paths, ii) compare the generated sitemaps with either the sitemap provided by the Web designer or with the intended sitemap of the website and, consequently, iii) plan possible website re-organization. The solution we propose is based on closed frequent sequence extraction and only concentrates on hyperlinks organized in “Web lists”, which are logical lists embedded in the pages. These “Web lists” are typically used for supporting users in Web site navigation and they include menus, navbars and content tables. Experiments performed on three real datasets show that the extracted sitemaps are much more similar to those defined by website curators than those obtained by competitor algorithms.



中文翻译:

封闭式顺序模式挖掘以生成站点地图

网站地图代表网站设计概念和知识组织的明确规范,因此被视为网站的基本本体。它不仅为用户提供了主要的使用流程,而且还分层次地组织了网站的概念。通常,站点地图是由网站管理员在网站设计的早期阶段定义的。但是,网站在其一生中会极大地改变其结构,内容和可能的导航路径。即使不是这种情况,网站管理员也可能无法定义反映实际网站内容的站点地图,反之亦然,即无法定义页面和链接的实际组织,而这些页面和链接并不反映站点地图中编码的内容的预期组织。在本文中,我们提出了一种自动生成站点地图的方法。与文献中提出的主要从页面的文本内容生成站点地图的其他方法相反,本工作中通过分析网站的Web图来生成站点地图。这使我们能够:i)根据可能的导航路径自动生成站点地图,ii)将生成的站点地图与We​​b设计人员提供的站点地图或网站的预期站点地图进行比较,并因此,iii)规划可能的网站重组。我们提出的解决方案基于封闭的频繁序列提取,仅集中于“ Web列表”中组织的超链接,这些超链接是嵌入页面中的逻辑列表。这些“ Web列表”通常用于在网站导航中为用户提供支持,其中包括菜单,导航栏和内容表。在三个真实数据集上进行的实验表明,与竞争对手算法获得的站点地图相比,提取的站点地图与网站策展人定义的站点地图更加相似。

更新日期:2020-09-28
down
wechat
bug