Identifying Documents In-Scope of a Collection from Web Archives,arXiv - CS - Digital Libraries

当前位置： X-MOL 学术 › arXiv.cs.DL › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Identifying Documents In-Scope of a Collection from Web Archives
arXiv - CS - Digital Libraries Pub Date : 2020-09-02 , DOI: arxiv-2009.00611
Krutarth Patel, Cornelia Caragea, Mark Phillips, Nathaniel Fox

Web archive data usually contains high-quality documents that are very useful for creating specialized collections of documents, e.g., scientific digital libraries and repositories of technical reports. In doing so, there is a substantial need for automatic approaches that can distinguish the documents of interest for a collection out of the huge number of documents collected by web archiving institutions. In this paper, we explore different learning models and feature representations to determine the best performing ones for identifying the documents of interest from the web archived data. Specifically, we study both machine learning and deep learning models and "bag of words" (BoW) features extracted from the entire document or from specific portions of the document, as well as structural features that capture the structure of documents. We focus our evaluation on three datasets that we created from three different Web archives. Our experimental results show that the BoW classifiers that focus only on specific portions of the documents (rather than the full text) outperform all compared methods on all three datasets.

中文翻译：

从网络档案中识别集合范围内的文件

Web 档案数据通常包含高质量文档，这些文档对于创建专门的文档集合非常有用，例如科学数字图书馆和技术报告库。在这样做时，非常需要能够从网络归档机构收集的大量文档中区分集合的感兴趣文档的自动方法。在本文中，我们探索了不同的学习模型和特征表示，以确定从网络存档数据中识别感兴趣的文档的最佳性能。具体来说，我们研究机器学习和深度学习模型以及从整个文档或文档的特定部分提取的“词袋”（BoW）特征，以及捕获文档结构的结构特征。我们将评估重点放在我们从三个不同的 Web 档案中创建的三个数据集上。我们的实验结果表明，仅关注文档特定部分（而不是全文）的 BoW 分类器在所有三个数据集上都优于所有比较方法。

更新日期：2020-09-03

点击分享查看原文

点击收藏

阅读更多本刊最新论文