当前位置: X-MOL 学术Archival Science › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Keeping it under lock and keywords: exploring new ways to open up the web archives with notebooks
Archival Science Pub Date : 2022-07-04 , DOI: 10.1007/s10502-022-09391-6
Leontien Talboom 1, 2 , Mark Bell 2
Affiliation  

The UK Government Web Archive (UKGWA) has been archiving government websites since 1996 and now holds regular snapshots of over 5000 sites. Currently, this material can be accessed through browsing or a simple keyword search interface on their website and has also been catalogued in The National Archives’ online catalogue, Discovery. However, the scale of the UKGWA exposes the limits of the current search interface, and there is no facility to understand the archive in aggregate. This article seeks to go beyond the simple keyword search by exploring the data sources available, from APIs to web crawling, for computational analysis of the UKGWA. The article is accompanied by two Python Notebooks which present examples of analysis using each data source. Notebooks lower the technical barriers for the reader to explore and interpret the UKGWA as data, while surfacing the challenges around making web material computationally accessible.



中文翻译:

保持锁定和关键字:探索使用笔记本打开网络档案的新方法

英国政府网络档案馆 (UKGWA) 自 1996 年以来一直在归档政府网站,现在定期保存 5000 多个网站的快照。目前,可以通过浏览或在其网站上的简单关键字搜索界面访问该材料,并且该材料也已被编入国家档案馆的在线目录 Discovery。但是,UKGWA 的规模暴露了当前搜索界面的局限性,并且没有工具可以全面了解档案。本文试图超越简单的关键字搜索,探索可用的数据源,从 API 到网络爬虫,用于 UKGWA 的计算分析。这篇文章附有两本 Python Notebook,其中介绍了使用每个数据源进行分析的示例。笔记本降低了读者探索和解释 UKGWA 作为数据的技术障碍,

更新日期:2022-07-05
down
wechat
bug