A Multi-Perspective malware detection approach through behavioral fusion of API call sequence
Introduction
The widespread development of malicious software (malware) reflects that the malware industry is more vigorous than ever. Although malware is one of the immediate threats on the Internet nowadays, it became a profitable commodity in the undercover economy of cybercrime (Calleja et al., 2018).
According to the Symantec Internet security report in 2018 Symantec (0000), the total number of new malware variants in 2017 was 669,947,865, with an expansion of 87.7% from the preceding year. The proliferation of malware variants in such a number means malware authors generate almost 2M new malware samples per day. Unfortunately, classical security techniques such as antivirus cannot cope with that rapidly increasing malware diversity, which raises doubts about the efficacy and trustworthiness of currently used approaches.
Furthermore, the rapid development of the Internet of Things (IoT) is posing other security challenges. Although IoT allows the vast diffusion of connected devices across different platforms, the IoT environment is also vulnerable to plenty of malware attacks through regular computers and smartphones (Alasmary et al., 2019). The infected computers and smartphones can infect other connected devices in the IoT environment.
For example, Trojan.Mirai.1 is an alternative variant of Mirai that can attack and infect the Windows platform. The infected hosts are employed to infect and breach confidential information from other devices (Guo et al., 2020). Moreover, the infected devices are transformed into a botnet to initiate variants of Denial of Service (DDoS) attacks (Jia, Zhong, Alrawais, Gong, Cheng, 2020, Moustafa, Turnbull, Choo, 2018, Ravi, Shalinie, 2020). Consequently, conventional computers’ malware can broaden their attacks on other IoT devices. Accordingly, we need reliable tools to detect these threats to protect conventional devices of various platforms.
Research communities have presented different heuristic approaches to detect and analyze malware (Maiorca, Biggio, Giacinto, 2019, Shalaginov, Banin, Dehghantanha, Franke, 2018, Singh, Dutta, Saha, 2019). However, the process of designing influential and efficient malware detection approaches remain challenging (Azmoodeh, Dehghantanha, Conti, Choo, 2018, Nguyen-Vu, Ahn, Jung, 2019, Ye, Hou, Chen, Lei, Wan, Wang, Xiong, Shao, 2019). The reason is malware authors, and cybercriminals are continually improving their tools and skills to circumvent detection and aim for new attacking strategies to breach targets (Cohen, Hendler, 2018, Friedrichs, Huger, O’donnell). For example, malware authors use obfuscation techniques to modify or mutate malware samples to bypass detection systems (Hopkins, Dehghantanha, 2015, Or-Meir, Nissim, Elovici, Rokach, 2019, Skolka, Staicu, Pradel, 2019, Yakura, Shinozaki, Nishimura, Oyama, Sakuma, 2019, You, Yim, 2010). Specifically, obfuscation techniques change the structure of malicious code or malware run time behavior. For instance, with a slight change to the opcode structure sequences, a malware variant cannot be recognized, or its detection becomes complicated (Burnap, French, Turner, Jones, 2018, Martín, Rodríguez-Fernández, Camacho, 2018, Ye, Li, Adjeroh, Iyengar, 2017, Zelinka, Amer, 2019). Unfortunately, traditional security strategies are not equipped with smart heuristics to deal with obfuscated samples. Hence, they are incapable of coping with newly modified samples.
Current malware analysis techniques are generally classified based on the feature set into two major categories: static and dynamic (Cui, Zhou, Wang, Li, Ren, 2018, Darabian, Dehghantanha, Hashemi, Homayoun, Choo, 2020, Galal, Mahdy, Atiea, 2016, Pektaş, Acarman, 2017). Static analysis examines the portable executable (PE) file content without running it on the system. Generally, static analyzers rely on disassembling tools to extract low-level static features such as byte sequences, string patterns, and opcode sequences. The extracted features are used to understand the behavioral characteristics and structure of malicious PE. On the other hand, dynamic analysis approaches derive behavioral data by executing the sample in a secured virtual environment called a sandbox. Dynamic features such as files, registry, network, and process activities can capture additional properties that reflect the behavior’s intentions. The extracted dynamic features are the outcome of integrated scenarios provided to the malware sample to perform its malicious task (Choi et al., 2019).
Although static analysis approaches are not sophisticated, they are vulnerable to obfuscation techniques. Moreover, pattern matching approaches are considered ineffective in recognizing zero-day or polymorphic malware (Kumar et al., 2019). Consequently, static approaches are usually considered an insubstantial method for malware classification (Martín, Rodríguez-Fernández, Camacho, 2018, Ucci, Aniello, Baldoni, 2019). In comparison with the static analysis, the dynamic analysis doesn’t need reverse engineering methods such as decompilation and decryption (Zhao et al., 2019). Despite the substantial consumption of time and storage space, the dynamic analysis is more elastic and robust to obfuscation methods than static analysis (Amer, El-Sappagh, Hu, 2020, Salehi, Sami, Ghiasi, 2017, Vinayakumar, Alazab, Soman, Poornachandran, Venkatraman, 2019).
Behavioral features, especially the API call sequences, have attracted a lot of consideration towards malware detection and classification. Researchers were mainly focused on extracting patterns from API call sequences. Those patterns are utilized with machine learning (ML) classification algorithms to recognize malicious sequences (Cohen, Hendler, Rubin, 2018, Fan, Liu, Luo, Chen, Tian, Zheng, Liu, 2018, Gibert, Mateu, Planes, 2020, Salehi, Sami, Ghiasi, 2017, Wadkar, Di Troia, Stamp, 2020). Therefore, the detection of new malware variants is performed by comparing their behavior to predefined stored models. However, dynamic approaches are vulnerable to false positives or false negatives (Amer and Zelinka, 2020). Moreover, considering that feature handling requires professional expertise, it is hard to get meaningful behavioral features to smarten malware detection (Zhao et al., 2019). Therefore, traditional ML classification algorithms become unconvincing for malware detection.
Many research works have formulated multiple ML classifiers to form ensemble classification models for malware detection. Those works, such as the works done by Menahem et al. (2009), Yan et al. (2018), and Khasawneh et al. (2015), have tried to overcome the performance limitations of a single ML classifier. However, although they improved malware detection accuracy, they are still vulnerable to false positives and false negatives. Moreover, all previous methods have ignored the contextual relations that are encoded among the individual API functions inside the whole API calling sequence.
In general, the most efficient malware model is that one that can tell whether new API call sequences are malicious or not. On the other hand, the ambiguity of API features and their numerous connections made developing an accurate and generic malware model extremely difficult. Perhaps the most significant impediment to developing successful malware models is the lack of understanding malware language. Furthermore, the large number of API functions contained within a single calling sequence will cause the malware model to have a high level of perplexity.
The perplexity, defined by Eq. (1), is a measure that is used to evaluate how well a probability model predicts a sample (Jurafsky and Martin, 2009). where denotes the entire calling sequences, refers to the count of all APIs in S, and is the calling sequence.
In consequence, the ideal malware model is the one that assigns a high probability (low perplexity) to the unseen API call sequences that lead to being malicious. Naturally, assigning a high probability to the new samples indicated that the malware model is not perplexed by the unknown samples, which means that it has a clear perception of how the malware behaves.
Correspondingly, in this paper, we proposed a model that effectively distinguishes between malicious and non-malicious API calling sequence behaviors. Our proposed model is an evolution of our preliminary work, which was published in Amer and Zelinka (2020). However, other behavioral features derived from the API calling series were thoroughly explored in this paper. We also demonstrated the novelty of our model by experimenting with malware datasets for Windows and Android, which are considered the most favored operating systems for malware attacks.
Accordingly, in this paper, we developed a malware detection models by integrating statistical, contextual, and graph mining features over the API calling sequence. We utilized those heterogeneous features to generate comprehensive multi-perspective perception opinions regarding the calling sequence behavior. Therefore, we provided promising solutions that remedy some of the dynamic detection approaches drawbacks discussed in Guo and Zhu (2018); Narayanan et al. (2018).
Our contributions in this paper are:
- •
Introducing effective relational multi-perspective behavioral recognition models for Windows and Android operating systems.
- •
Modelling the relationships between individual API functions in malicious and non-malicious calling sequences.
- •
Constricting the perplexity measure for the generated malware model
- •
Proposing a contextual indexing mechanism for APIs.
- •
Suggesting a new visualized representational form for API calling sequence that reveals the sequence behavior.
- •
Identifying the malicious deceptive or mimicry API call sequence.
This paper is structured as follows: background and related work are presented in Section 2. Our proposed model is discussed in detail in Section 3. The description of the datasets along with the proposed model’s empirical evaluation are discussed in Section 4. The conclusion and future work is given in Section 5.
Section snippets
Related Work
This section will outline and discuss the most relevant literature that uses dynamic analysis features, specifically the API calls for malware analysis and detection.
Tracking the API calling sequence is an excellent tactic to monitor any application (Alaeiyan, Parsa, Conti, 2019, Xiao, Lin, Sun, Ma, 2019, Zhao, Bo, Feng, Xu, Yu, 2019). The calling API sequence between processes and the operating system can be considered the most substantial behavioral difference between malicious and
Proposed model
Previous research works were mainly concentrated on obtaining some useful patterns from API call sequences. The derived patterns were used as training features for machine learning algorithms to detect malicious sequences in unseen or new samples. However, malware authors are also smart; they know that their malware will be analyzed to extract some fingerprints as patterns. Therefore, with any new misplacement in the API sequences, Those machine learning detection models will be ineffective.
Results and discussion
In this section, we evaluated our model using different datasets and standard evaluation metrics. We exhibited our model ability to detect unknown sequences. Finally, we showed how our model performed with malicious sequences that appeared as goodware, which is commonly known as fake goodware.
Conclusion
This paper introduced multi-perspective malware detection models that relied on the fusion of statistical, contextual, and graph mining features. We also showed that relying on multi-perspective detection models could still provide reliable performance even with the changing nature of the API calling sequence. We have experimentally proved that there are significant contrasts between the contextual behavior of malware and goodware. Upon such a distinction, we built different models that
CRediT authorship contribution statement
Eslam Amer: Conceptualization, Methodology, Formal analysis, Writing – original draft. Ivan Zelinka: Conceptualization, Formal analysis, Writing – original draft. Shaker El-Sappagh: Writing – review & editing.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowlgedgements
The following grants are acknowledged for the financial support for this research. Grant of SGS SP2020/108 and SP2020/78., VSB Technical University of Ostrava.
Eslam Amer is currently working as an associate professor of computer science. He earned his Ph.D. in 2012. His main research interests are focused on natural language processing and machine learning, along with their applications. He had two Post-Doctoral positions in top universities in Europe. Eslam is highly interested in embedding natural language processing to find new solutions to current challenges in different domains. Currently, he is working on proposing new paradigms for malware
References (112)
- et al.
Analysis and classification of context-based malware behavior
Computer Communications
(2019) - et al.
Enhancing unsupervised neural networks based text summarization with word embedding and ensemble learning
Expert systems with applications
(2019) - et al.
Dl-droid: Deep learning based android malware detection using real devices
Computers & Security
(2020) - et al.
A dynamic windows malware detection and prediction method based on contextual understanding of api call sequence
Computers & Security
(2020) - et al.
Maximum likelihood estimation for all-pass time series models
Journal of Multivariate Analysis
(2006) - et al.
Malware classification using self organising feature maps and machine activity data
computers & security
(2018) - et al.
Metamorphic malicious code behavior detection using probabilistic inference methods
Cognitive Systems Research
(2019) - et al.
Scalable detection of server-side polymorphic malware
Knowledge-Based Systems
(2018) - et al.
Detection of malicious webmail attachments based on propagation patterns
Knowledge-Based Systems
(2018) - et al.
A malware detection method based on family behavior graph
Computers & Security
(2018)
Malware detection in mobile environments based on autoencoders and api-images
Journal of Parallel and Distributed Computing
The rise of machine learning for detection and classification of malware: Research developments, trends and challenges
Journal of Network and Computer Applications
Maldozer: Automatic framework for android malware detection using deep learning
Digital Investigation
A learning model to detect maliciousness of portable executable using integrated feature set
Journal of King Saud University-Computer and Information Sciences
Android malware detection through hybrid features fusion and ensemble classifiers: the andropytool framework and the omnidroid dataset
Information Fusion
Candyman: Classifying android malware families by modelling dynamic traces with markov chains
Engineering Applications of Artificial Intelligence
Improving malware detection by applying multi-inducer ensemble
Computational Statistics & Data Analysis
Android fragmentation in malware detection
Computers & Security
End-to-end malware detection for android iot devices using deep learning
Ad Hoc Networks
Sentiment analysis based on improved pre-trained word embeddings
Expert Systems with Applications
Early-stage malware prediction using recurrent neural networks
computers & security
Maar: Robust features to detect malicious activity based on api calls, their arguments and return values
Engineering Applications of Artificial Intelligence
Integrated static and dynamic analysis for malware detection
Procedia Computer Science
Survey of machine learning techniques for malware analysis
Computers & Security
Detecting malware evolution using support vector machines
Expert Systems with Applications
Neural malware analysis with attention mechanism
Computers & Security
Using spatio-temporal information in api calls with machine learning algorithms for malware detection
Proceedings of the 2nd ACM workshop on Security and artificial intelligence
Analyzing and detecting emerging internet of things malware: A graph-based approach
IEEE Internet of Things Journal
Zero-day malware detection based on supervised learning algorithms of api call signatures
Proceedings of the Ninth Australasian Data Mining Conference - Volume 121
Enhancing efficiency of web search engines through ontology learning from unstructured information sources
2015 IEEE international conference on information reuse and integration
Contextual identification of windows malware through semantic interpretation of api call sequence
Applied Sciences
Akea: an arabic keyphrase extraction algorithm
International Conference on Advanced Intelligent Systems and Informatics
Keyphrase extraction methodology from short abstracts of medical documents
2016 8th Cairo International Biomedical Engineering Conference (CIBEC)
Enhancing semantic arabic information retrieval via arabic wikipedia assisted search expansion layer
International Conference on Advanced Intelligent Systems and Informatics
Detecting crypto-ransomware in iot networks based on energy consumption footprint
Journal of Ambient Intelligence and Humanized Computing
Integration of multi-modal features for android malware detection using linear svm
2016 11th Asia Joint Conference on Information Security (AsiaJCIS)
Evasion attacks against machine learning at test time
Joint European conference on machine learning and knowledge discovery in databases
Security evaluation of pattern classifiers under attack
IEEE transactions on knowledge and data engineering
The malsource dataset: Quantifying complexity and code reuse in malware development
IEEE Transactions on Information Forensics and Security
Deep learning based sequential model for malware analysis using windows exe api calls
PeerJ Computer Science
The need for speed: An analysis of brazilian malware classifers
IEEE Security & Privacy
Towards privacy-preserving malware detection systems for android
2018 IEEE 24th International Conference on Parallel and Distributed Systems (ICPADS)
An opcode-based technique for polymorphic internet of things malware detection
Concurrency and Computation: Practice and Experience
Android malware familial classification and representative sample selection via frequent subgraph analysis
IEEE Transactions on Information Forensics and Security
Comparing api call sequence algorithms for malware detection
Workshops of the International Conference on Advanced Information Networking and Applications
Bio-inspired computational paradigm for feature investigation and malware detection: interactive analytics
Multimedia Tools and Applications
A sense of self for unix processes
Proceedings 1996 IEEE Symposium on Security and Privacy
Behavior-based features model for malware detection
Journal of Computer Virology and Hacking Techniques
Cited by (36)
SNDMI: Spyware network traffic detection method based on inducement operations
2024, Computers and SecuritySeMalBERT: Semantic-based malware detection with bidirectional encoder representations from transformers
2024, Journal of Information Security and ApplicationsContent Disarm and Reconstruction of Microsoft Office OLE files
2024, Computers and SecurityImproved capsule networks based on Nash equilibrium for malicious code classification
2024, Computers and SecurityAPI-MalDetect: Automated malware detection framework for windows based on API calls and deep learning techniques
2023, Journal of Network and Computer ApplicationsComprehensive review on intelligent security defences in cloud: Taxonomy, security issues, ML/DL techniques, challenges and future trends
2022, Journal of King Saud University - Computer and Information Sciences
Eslam Amer is currently working as an associate professor of computer science. He earned his Ph.D. in 2012. His main research interests are focused on natural language processing and machine learning, along with their applications. He had two Post-Doctoral positions in top universities in Europe. Eslam is highly interested in embedding natural language processing to find new solutions to current challenges in different domains. Currently, he is working on proposing new paradigms for malware analysis using NLP and deep learning.
Ivan Zelinka is currently working as a professor at the Technical University of Ostrava (VŠB-TU), Faculty of Electrical Engineering and Computer Science and national supercomputing center IT4 Innovations. Dr. Zelinka has also participated in numerous grants and two EU projects as a member of the team (FP5 - RESTORM) and as supervisor of (FP7 - PROMOEVO) of the Czech team. Currently, he is the head of the Department of Applied Informatics and throughout his career he has supervised numerous MSc. and Bc. diploma theses in addition to his role of supervising doctoral students, including students from abroad. He was awarded the Siemens Award for his Ph.D. thesis and received an award from the journal Software news for his book about artificial intelligence. Ivan Zelinka is a member of the British Computer Society, IEEE (a committee of Czech section of Computational Intelligence), and serves on international program committees of various conferences and three international journals (Soft Computing, SWEVO, Editorial Council of Security Revue.) He is the author of numerous journal articles as well as books in Czech and the English language.
Shaker El-Sappagh received the bachelor’s degree in computer science from Information Systems Department, Faculty of Computers and Information, Cairo University, Egypt, in 1997, and the master’s degree from the same university in 2007. He received the Ph.D. degrees in computer science from Information Systems Department, Faculty of Computers and Information, Mansura University, Mansura, Egypt in 2015. In 2003, he joined the Department of Information Systems, Faculty of Computers and Information, Minia University, Egypt as a teaching assistant. Since June 2016, he has been with the Department of Information Systems, Faculty of computers and Information, Benha University as an assistant professor. Currently he is Post-Doctoral Fellow at Centro Singular de Investigación en Tecnoloxías Intelixentes (CiTIUS), Universidade de Santiago de Compostela, Spain. He has publications in clinical decision support systems and semantic intelligence. His current research interests include machine learning, medical informatics, (fuzzy) ontology engineering, distributed and hybrid clinical decision support systems, semantic data modeling, fuzzy expert systems, and cloud computing. He is a reviewer in many journals, and he is very interested in the diseases’ diagnosis and treatment researches.