当前位置: X-MOL 学术ACM Trans. Knowl. Discov. Data › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Page-Level Main Content Extraction From Heterogeneous Webpages
ACM Transactions on Knowledge Discovery from Data ( IF 4.0 ) Pub Date : 2021-06-28 , DOI: 10.1145/3451168
Julián Alarte 1 , Josep Silva 1
Affiliation  

The main content of a webpage is often surrounded by other boilerplate elements related to the template, such as menus, advertisements, copyright notices, and comments. For crawlers and indexers, isolating the main content from the template and other noisy information is an essential task, because processing and storing noisy information produce a waste of resources such as bandwidth, storage space, and computing time. Besides, the detection and extraction of the main content is useful in different areas, such as data mining, web summarization, and content adaptation to low resolutions. This work introduces a new technique for main content extraction. In contrast to most techniques, this technique not only extracts text, but also other types of content, such as images, and animations. It is a Document Object Model-based page-level technique, thus it only needs to load one single webpage to extract the main content. As a consequence, it is efficient enough as to be used online (in real-time). We have empirically evaluated the technique using a suite of real heterogeneous benchmarks producing very good results compared with other well-known content extraction techniques.

中文翻译:

从异构网页中提取页面级主要内容

网页的主要内容通常被与模板相关的其他样板元素包围,例如菜单、广告、版权声明和评论。对于爬虫和索引器来说,将主要内容与模板和其他噪声信息隔离是一项必不可少的任务,因为处理和存储噪声信息会产生带宽、存储空间和计算时间等资源的浪费。此外,主要内容的检测和提取在数据挖掘、网络摘要和低分辨率内容适应等不同领域都很有用。这项工作介绍了一种用于主要内容提取的新技术。与大多数技术相比,这种技术不仅可以提取文本,还可以提取其他类型的内容,例如图像和动画。它是一种基于文档对象模型的页面级技术,因此它只需要加载一个网页即可提取主要内容。因此,它足够高效,可以在线(实时)使用。与其他著名的内容提取技术相比,我们使用一套真实的异构基准对这项技术进行了经验性评估,产生了非常好的结果。
更新日期:2021-06-28
down
wechat
bug