当前位置: X-MOL 学术International Journal on Digital Libraries › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Towards extracting event-centric collections from Web archives
International Journal on Digital Libraries ( IF 1.6 ) Pub Date : 2018-10-27 , DOI: 10.1007/s00799-018-0258-6
Gerhard Gossen , Thomas Risse , Elena Demidova

Web archives constitute an increasingly important source of information for computer scientists, humanities researchers and journalists interested in studying past events. However, currently there are no access methods that help Web archive users to efficiently access event-centric information in large-scale archives that go beyond the retrieval of individual disconnected documents. In this article, we tackle the novel problem of extracting interlinked event-centric document collections from large-scale Web archives to facilitate an efficient and intuitive access to information regarding past events. We address this problem by: (1) facilitating users to define event-centric document collections in an intuitive way through a Collection Specification; (2) development of a specialised extraction method that adapts focused crawling techniques to the Web archive settings; and (3) definition of a function to judge the relevance of the archived documents with respect to the Collection Specification taking into account the topical and temporal relevance of the documents. Our extended experiments on the German Web archive (covering a time period of 19 years) demonstrate that our method enables efficient extraction of event-centric collections for different event types.

中文翻译:

旨在从Web档案中提取以事件为中心的集合

对于对过去事件感兴趣的计算机科学家,人文研究人员和新闻工作者而言,网络档案馆构成了越来越重要的信息来源。但是,当前没有访问方法可以帮助Web存档用户有效地访问大型存档中以事件为中心的信息,而不仅仅是检索单个断开的文档。在本文中,我们解决了一个新颖的问题,即从大型Web档案中提取以事件为中心的互连文档集合,以促进对过去事件信息的高效,直观访问。我们通过以下方式解决此问题:(1)促使用户通过“集合规范”以直观的方式定义以事件为中心的文档集合;(2)开发一种专门的提取方法,使集中的爬网技术适应Web存档设置;(3)考虑到文档的主题和时间相关性,定义一种功能来判断已归档文档与“收集规范”的相关性。我们在德国网络档案馆上进行的扩展实验(涵盖了19年的时间)表明,我们的方法可以针对不同事件类型有效提取以事件为中心的集合。
更新日期:2018-10-27
down
wechat
bug