当前位置: X-MOL 学术arXiv.cs.DL › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
The Case For Alternative Web Archival Formats To Expedite The Data-To-Insight Cycle
arXiv - CS - Digital Libraries Pub Date : 2020-03-31 , DOI: arxiv-2003.14046
Xinyue Wang, Zhiwu Xie

The WARC file format is widely used by web archives to preserve collected web content for future use. With the rapid growth of web archives and the increasing interest to reuse these archives as big data sources for statistical and analytical research, the speed to turn these data into insights becomes critical. In this paper we show that the WARC format carries significant performance penalties for batch processing workload. We trace the root cause of these penalties to its data structure, encoding, and addressing method. We then run controlled experiments to illustrate how severe these problems can be. Indeed, performance gain of one to two orders of magnitude can be achieved simply by reformatting WARC files into Parquet or Avro formats. While these results do not necessarily constitute an endorsement for Avro or Parquet, the time has come for the web archiving community to consider replacing WARC with more efficient web archival formats.

中文翻译:

使用替代 Web 存档格式加快数据到洞察周期的案例

WARC 文件格式被网络档案馆广泛使用,以保存收集的网络内容以备将来使用。随着网络档案的快速增长以及将这些档案重新用作统计和分析研究的大数据源的兴趣越来越大,将这些数据转化为洞察力的速度变得至关重要。在本文中,我们展示了 WARC 格式对批处理工作负载带来了显着的性能损失。我们将这些惩罚的根本原因追溯到其数据结构、编码和寻址方法。然后我们运行受控实验来说明这些问题的严重程度。实际上,只需将 WARC 文件重新格式化为 Parquet 或 Avro 格式,就可以实现一到两个数量级的性能提升。虽然这些结果并不一定代表对 Avro 或 Parquet 的认可,
更新日期:2020-05-28
down
wechat
bug