当前位置: X-MOL 学术arXiv.cs.IR › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Extraction of Relevant Images for Boilerplate Removal in Web Browsers
arXiv - CS - Information Retrieval Pub Date : 2019-12-17 , DOI: arxiv-2001.04338
Joy Bose

Boilerplate refers to unwanted and repeated parts of a webpage (such as ads or table of contents) that distracts the user from reading the core content of the webpage, such as a news article. Accurate detection and removal of boilerplate content from a webpage can enable the users to have a clutter free view of the webpage or news article. This can be useful in features like reader mode in web browsers. Current implementations of reader mode in web browsers such as Firefox, Chrome and Edge perform reasonably well for textual content in webpages. However, they are mostly heuristic based and not flexible when the webpage content is dynamic. Also they often do not perform well for removing boilerplate content in the form of images and multimedia in webpages. For detection of boilerplate images, one needs to have knowledge of the actual layout of the images in the webpage, which is only possible when the webpage is rendered. In this paper we discuss some of the issues in relevant image extraction. We also present the design of a testing framework to measure accuracy and a classifier to extract relevant images by leveraging a headless browser solution that gives the rendering information for images.

中文翻译:

提取相关图像以在 Web 浏览器中移除样板

样板文件是指网页中不需要和重复的部分(例如广告或目录),这些部分会分散用户阅读网页核心内容(例如新闻文章)的注意力。从网页中准确检测和删除样板内容可以使用户能够清晰地查看网页或新闻文章。这在 Web 浏览器中的阅读器模式等功能中很有用。当前在 Firefox、Chrome 和 Edge 等网络浏览器中实现的阅读器模式对于网页中的文本内容表现相当不错。然而,当网页内容是动态的时,它们大多是基于启发式的并且不灵活。此外,它们通常不能很好地删除网页中图像和多媒体形式的样板内容。为了检测样板图像,人们需要了解网页中图像的实际布局,这只有在网页呈现时才有可能。在本文中,我们讨论了相关图像提取中的一些问题。我们还展示了一个测试框架的设计来测量准确性和分类器通过利用一个无头浏览器解决方案来提取相关图像,该解决方案提供图像的渲染信息。
更新日期:2020-01-15
down
wechat
bug