Abstract
Classification of high-entropy data sources is one of the key problems in the field of information security. Currently, there are many methods for classification of encrypted and compressed sequences; however, they mostly use digital signatures or service information found in the headers of the containers used to store or transfer data. This paper analyzes the state of research in the field of classification of encrypted and compressed data and develops a model of encrypted and compressed sequences. Our experiments demonstrate a high accuracy of the proposed approach, which allows us to conclude that the methods for classifying encrypted and compressed data used in our study have been improved. The approach can be implemented in data leak prevention systems or corporate email systems to analyze the attachments sent outside the controlled perimeter of a government agency or enterprise.
Purpose of the research – develop a model of pseudo-random sequences generated by data encryption and compression algorithms that most accurately reflects statistical properties of these sequences.
Methods of the research – statistical data analysis, mathematical statistics, and machine learning.
Result of the research – An analysis of the studies aimed at solving the problem of classification for encrypted and compressed sequences in the field of information security is carried out. A model of pseudo-random sequences generated by encryption and compression algorithms is developed taking into account their statistical features: distribution of bytes and distribution of subsequences of limited length, which constitute a new probabilistic space. The choice of the statistical features used in the pseudo-random sequence model is justified. Experiments for determining the hyperparameters of the classifier on a dataset generated from encrypted and compressed files without taking their headers into account are carried out. The constraints used in the pseudo-random sequence model, namely, the length of pseudo-random sequences (approximately 600 Kb), are defined. Experiments for determining the effect of the statistical features used in the model on classification accuracy are conducted. The proposed approach allows encrypted and compressed data to be classified with an accuracy of 0.97.
Similar content being viewed by others
Notes
https://www.openssl.org. Accessed February 8, 2021.
https://www.win-rar.com. Accessed February 8, 2021.
https://www.7-zip.org. Accessed February 8, 2021.
A statistical test suite for random and pseudorandom number generators for cryptographic applications, National Institute of Standards and Technology (NIST), 2010. https://csrc.nist.gov/ publications/detail/sp/800-22/rev-1a/final. Accessed February 8, 2021.
REFERENCES
Le, D.C., Zincir-Heywood, N., and Heywood, M.I., Analyzing data granularity levels for insider threat detection using machine learning, IEEE Trans. Network Serv. Manage., 2020, vol. 17, no. 1, pp. 30–44.
Bhatiaa, A., Bahugunaa, A.A., Tiwaria, K., Haribabua, K., and Vishwakarmab, D., A survey on analyzing encrypted network traffic of mobile devices, arXiv preprint, 2020.
Mamun, M.S.I., Ghorbani, A.A., and Stakhanova, N., An entropy based encrypted traffic classifier, Lecture Notes Comput. Sci. 2015, vol. 9543. https://doi.org/10.1007/978-3-319-29814-6_23
Shen, M., Wei, M., Zhu, L., and Wang, M., Classification of encrypted traffic with second-order Markov chains and application attribute bigrams, IEEE Trans. Inf. Forensics Secur., 2017, vol. 12, no. 8, pp. 1830–1843. https://doi.org/10.1109/TIFS.2017.2692682
Zhang, Z., Kang, C., Fu, P., Cao, Z., Li, Z., and Xiong, G., Metric learning with statistical features for network traffic classification, Proc. IEEE 36th Int. Performance Computing and Communications Conf. (IPCCC), San Diego, 2017, pp. 1–7. https://doi.org/10.1109/PCCC.2017.8280467
Yang, Y., Kang, C., Gou, G., Li, Z., and Xiong, G., TLS/SSL encrypted traffic classification with autoencoder and convolutional neural network, Proc. IEEE 20th Int. Conf. High Performance Computing and Communications; Proc. IEEE 16th Int. Conf. Smart City; Proc. IEEE 4th Int. Conf. Data Science and Systems (HPCC/SmartCity/DSS), Exeter, United Kingdom, 2018, pp. 362–369. https://doi.org/10.1109/HPCC/SmartCity/DSS.2018.00079
Chen, Y., Zang, T., Zhang, Y., Zhouz, Y., and Wang, Y., Rethinking encrypted traffic classification: A multi-attribute associated fingerprint approach, Proc. IEEE 27th Int. Conf. Network Protocols (ICNP), Chicago, 2019, pp. 1–11. https://doi.org/10.1109/ICNP.2019.8888043
Wang, P., Chen, X., Ye, F., and Sun, Z., A survey of techniques for mobile service encrypted traffic classification using deep learning, IEEE Access, 2019, vol. 7, pp. 54024–54033. https://doi.org/10.1109/ACCESS.2019.2912896
Tang, Z., Zeng, X., and Sheng, Y., Entropy-based feature extraction algorithm for encrypted and non-encrypted compressed traffic classification, Int. J. ICIC, 2019, vol. 15, no. 3, pp. 845–860. https://doi.org/10.24507/ijicic.15.03.845
Obasi, T.C., Encrypted network traffic classification using ensemble learning techniques, PhD Dissertation, Carleton Univ., 2020. https://doi.org/10.22215/etd/2020-14171
Choudhury, P., Kumar, K.P., Nandi, S., and Athithan, G., An empirical approach towards characterization of encrypted and unencrypted VoIP traffic, Multimedia Tools Appl., 2020, vol. 79, nos. 1–2, pp. 603–631. https://doi.org/10.1007/s11042-019-08088-w
Yao, Z., Ge, J., Wu, Y., Lin, X., He, R., and Ma, Y., Encrypted traffic classification based on Gaussian mixture models and hidden Markov models, J. Network Comput. Appl., 2020, vol. 166, p. 102711. https://doi.org/10.1016/j.jnca.2020.102711
Baldini, G., Hernandez-Ramos, J.L., Nowak, S., Neisse, R., and Nowak, M., Mitigation of privacy threats due to encrypted traffic analysis through a policy-based framework and mud profiles, Symmetry, 2020, vol. 12, no. 9, p. 1576. https://doi.org/10.3390/sym12091576
Shen, M., Liu, Y., Zhu, L., Xu, K., Du, X., and Guizani, N., Optimizing feature selection for efficient encrypted traffic classification: A systematic approach, IEEE Network, 2020, vol. 34, no. 4, pp. 20–27. https://doi.org/10.1109/MNET.011.1900366
Panchenko, A., Lanze, F., Pennekamp, J., Engel, T., Zinnen, A., Henze, M., and Wehrle, K., Website fingerprinting at Internet scale, Proc. Network and Distributed System Security Symp., 2016, pp. 21–24. https://doi.org/10.14722/ndss.2016.23477
Wei, S., Ding, Y., and Han, X., TDSC: Two-stage DDoS detection and defense system based on clustering, Proc. 47th Annu. IEEE/IFIP Int. Conf. Dependable Systems and Networks Workshops (DSN-W), 2017, pp. 101–102. https://doi.org/10.1109/DSN-W.2017.11
Sahoo, K.S., Tripathy, B.K., Naik, K., Ramasubbareddy, S., Balusamy, B., Khari, M., and Burgos, D., An evolutionary SVM model for DDoS attack detection in software defined networks, IEEE Access, 2020, vol. 8, pp. 132502–132513. https://doi.org/10.1109/ACCESS.2020.3009733
Grechishnikov, E.V., Dobryshin, M.M., Kochedykov, S.S., and Novoselcev, V.I., Algorithmic model of functioning of the system to detect and counter cyber attacks on virtual private network, J. Phys.: Conf. Ser., 2019, vol. 1203, no. 1, p. 012064. https://doi.org/10.1088/1742-6596/1203/1/012064
Dobryshin, M.M., Proposal for improving systems for countering DDoS attacks, Telekommunikatsii, 2018, no. 10, pp. 32–38.
Dobryshin, M.M., Spirin, A.A., and Laktionov, A.D., Proposals for early detection of Botnet destructive effects on computer communication networks, Telekommunikatsii, 2020, no. 12, pp. 25–29.
Zhu, L., Tang, X., Shen, M., Du, X., and Guizani, M., Privacy-preserving DDoS attack detection using cross-domain traffic in software defined networks, IEEE J. Sel. Areas Commun., 2018, vol. 36, no. 3, pp. 628–643. https://doi.org/10.1109/JSAC.2018.2815442
Wang, F., Quach, T.T., Wheeler, J., Aimone, J.B., and James, C.D., Sparse coding for n-gram feature extraction and training for file fragment classification, IEEE Trans. Inf. Forensics Secur., 2018, vol. 13, no. 10, pp. 2553–2562. https://doi.org/10.1109/TIFS.2018.2823697
Karampidis, K. and Papadourakis, G., File type identification-computational intelligence for digital forensics, J. Digital Forensics, Secur. Law, 2017, vol. 12, no. 2, p. 6. https://doi.org/10.15394/jdfsl.2017.1472
Karampidis, K., Kavallieratou, E., and Papadourakis, G., Comparison of classification algorithms for file type detection: A digital forensics perspective, Polybits, 2017, vol. 56, pp. 15–20. https://doi.org/10.17562/PB-56-2
Kozachok, A.V., Development of a heuristic mechanism for detection of malware programs based on hidden Markov models, Autom. Control Comput. Sci., 2018, vol. 52, no. 8, pp. 1117–1123. https://doi.org/10.3103/S0146411618080345
Srinivas, M., Nayak, A., and Bhatt, A., Forged file detection and steganographic content identification (FFDASCI) using deep learning techniques, 2019. http://ceur-ws.org/Vol-2380/paper_142.pdf.
Konaray, S.K., Toprak, A., Pek, G.M., Akçekoce, H., and Kılınç, D., Detecting file types using machine learning algorithms, Proc. Innovations in Intelligent Systems and Applications Conf., 2019, pp. 1–4. https://doi.org/10.1109/ASYU48272.2019.8946393
Casino, F., Choo, K.K.R., and Patsakis, C., Hedge: Efficient traffic classification of encrypted and compressed packets, IEEE Trans. Inf. Forensics Secur., 2019, vol. 14, no. 11, pp. 2916–2926. https://doi.org/10.1109/TIFS.2019.2911156
De Gaspari, F., Hitaj, D., Pagnotta, G., De Carli, L., and Mancini, L.V., EnCoD: Distinguishing compressed and encrypted file fragments, Proc. Int. Conf. Network and System Security, Springer, 2020, pp. 42–62. https://doi.org/10.1007/978-3-030-65745-1_3
Mousavi, S.S., Detecting disk sectors data types using hidden Markov model, Proc. 17th Int. ISC Conf. Information Security and Cryptology (ISCISC), 2020, pp. 60–64. https://doi.org/10.1109/ISCISC51277.2020.9261906
Cheng, L., Liu, F., and Yao, D., Enterprise data breach: Causes, challenges, prevention, and future directions, Wiley Interdiscip. Rev.: Data Mining Knowl. Discovery, 2017, vol. 7, no. 5.
Doroud, H., et al., Speeding-up DPI traffic classification with chaining, Proc. IEEE Global Communications Conf. (GLOBECOM), 2018.
Hahn, D., Apthorpe, N., and Feamster, N., Detecting compressed cleartext traffic from consumer Internet of Things devices, arXiv preprint, 2018.
Wood, D., Apthorpe, N., and Feamster, N., Cleartext data transmissions in consumer IoT medical devices, Proc. Workshop Internet of Things Security and Privacy, 2017.
Scaife, N., Carter, H., Traynor, P., and Butler, K.R., Cryptolock (and drop it): Stopping ransomware attacks on user data, Proc. IEEE 36th Int. Conf. Distributed Computing Systems (ICDCS), 2016, pp. 303–312. https://doi.org/10.1109/ICDCS.2016.46
Raff, E., Zak, R., Cox, R., Sylvester, J., Yacci, P., Ward, R., and Nicholas, C., An investigation of byte n-gram features for malware classification, J. Comput. Virology Hacking Tech., 2018, vol. 14, no. 1, pp. 1–20. https://doi.org/10.1007/s11416-016-0283-1
Kozachok, A.V. and Spirin, A.A., Algorithm for classification of pseudo-random sequences, Vestn. Voronezh. Gos. Univ., Ser.: Sist. Anal. Inf. Tekhnol., 2020, no. 1, pp. 87–98. https://doi.org/10.17308/sait.2020.1/2595
Kozachok, A.V., Spirin, A.A., and Golembiovskaya, O.M., Algorithm for classification of pseudo-random sequences based on random forest, Dokl. Tomsk. Gos. Univ. Sist. Upr. Radioelektron., 2020, vol. 23, no. 3, pp. 55–60.
Kozachok, A.V. and Kozachok, V.I., Construction and evaluation of the new heuristic malware detection mechanism based on executable files static analysis, J. Comput. Virology Hacking Tech., 2018, vol. 14, no. 3, pp. 225–231. https://doi.org/10.1007/s11416-017-0309-3
Funding
This work was supported by the Ministry of Education and Science of the Russian Federation, project no. 18/2020.
Author information
Authors and Affiliations
Corresponding authors
Additional information
Translated by Yu. Kornienko
Rights and permissions
About this article
Cite this article
Kozachok, A.V., Spirin, A.A. Model of Pseudo-Random Sequences Generated by Encryption and Compression Algorithms. Program Comput Soft 47, 249–260 (2021). https://doi.org/10.1134/S0361768821040058
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1134/S0361768821040058