当前位置: X-MOL 学术Program. Comput. Softw. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Model of Pseudo-Random Sequences Generated by Encryption and Compression Algorithms
Programming and Computer Software ( IF 0.7 ) Pub Date : 2021-07-30 , DOI: 10.1134/s0361768821040058
A. V. Kozachok 1 , A. A. Spirin 1
Affiliation  

Classification of high-entropy data sources is one of the key problems in the field of information security. Currently, there are many methods for classification of encrypted and compressed sequences; however, they mostly use digital signatures or service information found in the headers of the containers used to store or transfer data. This paper analyzes the state of research in the field of classification of encrypted and compressed data and develops a model of encrypted and compressed sequences. Our experiments demonstrate a high accuracy of the proposed approach, which allows us to conclude that the methods for classifying encrypted and compressed data used in our study have been improved. The approach can be implemented in data leak prevention systems or corporate email systems to analyze the attachments sent outside the controlled perimeter of a government agency or enterprise.

Purpose of the research – develop a model of pseudo-random sequences generated by data encryption and compression algorithms that most accurately reflects statistical properties of these sequences.

Methods of the research – statistical data analysis, mathematical statistics, and machine learning.

Result of the research – An analysis of the studies aimed at solving the problem of classification for encrypted and compressed sequences in the field of information security is carried out. A model of pseudo-random sequences generated by encryption and compression algorithms is developed taking into account their statistical features: distribution of bytes and distribution of subsequences of limited length, which constitute a new probabilistic space. The choice of the statistical features used in the pseudo-random sequence model is justified. Experiments for determining the hyperparameters of the classifier on a dataset generated from encrypted and compressed files without taking their headers into account are carried out. The constraints used in the pseudo-random sequence model, namely, the length of pseudo-random sequences (approximately 600 Kb), are defined. Experiments for determining the effect of the statistical features used in the model on classification accuracy are conducted. The proposed approach allows encrypted and compressed data to be classified with an accuracy of 0.97.



中文翻译:

由加密和压缩算法生成的伪随机序列模型

高熵数据源的分类是信息安全领域的关键问题之一。目前,加密和压缩序列的分类方法有很多种;然而,它们大多使用在用于存储或传输数据的容器的标头中找到的数字签名或服务信息。本文分析了加密和压缩数据分类领域的研究现状,并开发了一个加密和压缩序列模型。我们的实验证明了所提出方法的高精度,这使我们能够得出结论,我们研究中使用的对加密和压缩数据进行分类的方法已经得到改进。

研究目的- 开发由数据加密和压缩算法生成的伪随机序列模型,最准确地反映这些序列的统计特性。

研究方法——统计数据分析、数理统计和机器学习。

研究结果– 对旨在解决信息安全领域加密和压缩序列分类问题的研究进行了分析。考虑到加密和压缩算法生成的伪随机序列的统计特征:字节的分布和有限长度的子序列的分布,建立了一个新的概率空间。伪随机序列模型中使用的统计特征的选择是合理的。进行了在从加密和压缩文件生成的数据集上确定分类器超参数的实验,而不考虑它们的标头。定义了伪随机序列模型中使用的约束,即伪随机序列的长度(约600 Kb)。进行了确定模型中使用的统计特征对分类精度的影响的实验。所提出的方法允许以 0.97 的准确度对加密和压缩数据进行分类。

更新日期:2021-07-30
down
wechat
bug