当前位置: X-MOL 学术Empir. Software Eng. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
A study of the performance of general compressors on log files
Empirical Software Engineering ( IF 4.1 ) Pub Date : 2020-08-12 , DOI: 10.1007/s10664-020-09822-x
Kundi Yao , Heng Li , Weiyi Shang , Ahmed E. Hassan

Large-scale software systems and cloud services continue to produce a large amount of log data. Such log data is usually preserved for a long time (e.g., for auditing purposes). General compressors, like the LZ77 compressor used in gzip, are usually used in practice to compress log data to reduce the cost of long-term storage. However, such general compressors do not consider the unique nature of log data. In this paper, we study the performance of general compressors on compressing log data relative to their performance on compressing natural language data. We used 12 widely used general compressors to compress nine log files that are collected based on surveying prior literature on text compression, log compression and log analysis. We observe that log data is more repetitive than natural language data, and that log data can be compressed and decompressed faster with higher compression ratios. Besides, the compressor with the highest compression ratio for natural language data is rarely the one for log data. Nevertheless, the compressors with the highest compression ratio for log data are rarely adopted in practice by current logging libraries and log management tools. We also observe that the peak compression and decompression speeds of general compressors on log data is often achieved with a small data size, while such size may not be used by log management tools. Finally, we observe that the optimal compression performance (measured by a combined compression performance score) of log data usually requires the compression level to be configured higher than the default level. Our findings call for careful consideration of choosing general compressors and their associated compression levels for log data in practice. In addition, our findings shed lights on the opportunities for future research on compressors that better suit the characteristics of log data.

中文翻译:

通用压缩器对日志文件的性能研究

大规模的软件系统和云服务不断产生大量的日志数据。此类日志数据通常会保留很长时间(例如,出于审计目的)。一般的压缩器,比如gzip中使用的LZ77压缩器,在实践中通常用于压缩日志数据,以降低长期存储的成本。但是,这种通用压缩器没有考虑日志数据的独特性。在本文中,我们研究了一般压缩器在压缩日志数据方面的性能相对于它们在压缩自然语言数据方面的性能。我们使用了 12 种广泛使用的通用压缩器来压缩九个日志文件,这些文件是基于调查有关文本压缩、日志压缩和日志分析的先前文献而收集的。我们观察到日志数据比自然语言数据更重复,并且可以使用更高的压缩率更快地压缩和解压缩日志数据。此外,自然语言数据压缩率最高的压缩器很少是日志数据的压缩器。尽管如此,目前的日志库和日志管理工具在实践中很少采用日志数据压缩率最高的压缩器。我们还观察到,一般压缩器对日志数据的峰值压缩和解压速度通常是在较小的数据大小下实现的,而这样的大小可能不会被日志管理工具使用。最后,我们观察到日志数据的最佳压缩性能(由组合压缩性能分数衡量)通常需要将压缩级别配置为高于默认级别。我们的发现要求在实践中仔细考虑选择通用压缩器及其相关的日志数据压缩级别。此外,我们的发现为未来研究更适合日志数据特征的压缩机提供了机会。
更新日期:2020-08-12
down
wechat
bug