Elsevier

Computers & Security

Volume 110, November 2021, 102449
Computers & Security

A Multi-Perspective malware detection approach through behavioral fusion of API call sequence

https://doi.org/10.1016/j.cose.2021.102449Get rights and content

Abstract

The widespread development of the malware industry is considered the main threat to our e-society. Therefore, malware analysis should also be enriched with smart heuristic tools that recognize malicious behaviors effectively. Although the generated API calling graph representation for malicious processes encodes worthwhile information about their malicious behavior, it is pragmatically inconvenient to generate a behavior graph for each process. Therefore, we experimented with creating generic behavioral graph models that describe malicious and non-malicious processes. These behavioral models relied on the fusion of statistical, contextual, and graph mining features that capture explicit and implicit relationships between API functions in the calling sequence. Our generated behavioral models proved the behavioral contrast between malicious and non-malicious calling sequences. According to that distinction, we built different relational perspective models that characterize processes’ behaviors. To prove our approach novelty, we experimented with our approach over Windows and Android platforms. Our experimentations demonstrated that our proposed system identified unseen malicious samples with high accuracy with low false-positive. In terms of detection accuracy, our model returns an average accuracy of 0.997 and 0.977 to the unseen Windows and Android malware testing samples, respectively. Moreover, we proposed a new indexing method for APIs based on their contextual similarities. We also suggested a new expressive, a visualized form that renders the API calling sequence. Consequently, we introduced a confidence metric to our model classification decision. Furthermore, we developed a behavioral heuristic that effectively identified malicious API call sequences that were deceptive or mimicry.

Introduction

The widespread development of malicious software (malware) reflects that the malware industry is more vigorous than ever. Although malware is one of the immediate threats on the Internet nowadays, it became a profitable commodity in the undercover economy of cybercrime (Calleja et al., 2018).

According to the Symantec Internet security report in 2018 Symantec (0000), the total number of new malware variants in 2017 was 669,947,865, with an expansion of 87.7% from the preceding year. The proliferation of malware variants in such a number means malware authors generate almost 2M new malware samples per day. Unfortunately, classical security techniques such as antivirus cannot cope with that rapidly increasing malware diversity, which raises doubts about the efficacy and trustworthiness of currently used approaches.

Furthermore, the rapid development of the Internet of Things (IoT) is posing other security challenges. Although IoT allows the vast diffusion of connected devices across different platforms, the IoT environment is also vulnerable to plenty of malware attacks through regular computers and smartphones (Alasmary et al., 2019). The infected computers and smartphones can infect other connected devices in the IoT environment.

For example, Trojan.Mirai.1 is an alternative variant of Mirai that can attack and infect the Windows platform. The infected hosts are employed to infect and breach confidential information from other devices (Guo et al., 2020). Moreover, the infected devices are transformed into a botnet to initiate variants of Denial of Service (DDoS) attacks (Jia, Zhong, Alrawais, Gong, Cheng, 2020, Moustafa, Turnbull, Choo, 2018, Ravi, Shalinie, 2020). Consequently, conventional computers’ malware can broaden their attacks on other IoT devices. Accordingly, we need reliable tools to detect these threats to protect conventional devices of various platforms.

Research communities have presented different heuristic approaches to detect and analyze malware (Maiorca, Biggio, Giacinto, 2019, Shalaginov, Banin, Dehghantanha, Franke, 2018, Singh, Dutta, Saha, 2019). However, the process of designing influential and efficient malware detection approaches remain challenging (Azmoodeh, Dehghantanha, Conti, Choo, 2018, Nguyen-Vu, Ahn, Jung, 2019, Ye, Hou, Chen, Lei, Wan, Wang, Xiong, Shao, 2019). The reason is malware authors, and cybercriminals are continually improving their tools and skills to circumvent detection and aim for new attacking strategies to breach targets (Cohen, Hendler, 2018, Friedrichs, Huger, O’donnell). For example, malware authors use obfuscation techniques to modify or mutate malware samples to bypass detection systems (Hopkins, Dehghantanha, 2015, Or-Meir, Nissim, Elovici, Rokach, 2019, Skolka, Staicu, Pradel, 2019, Yakura, Shinozaki, Nishimura, Oyama, Sakuma, 2019, You, Yim, 2010). Specifically, obfuscation techniques change the structure of malicious code or malware run time behavior. For instance, with a slight change to the opcode structure sequences, a malware variant cannot be recognized, or its detection becomes complicated (Burnap, French, Turner, Jones, 2018, Martín, Rodríguez-Fernández, Camacho, 2018, Ye, Li, Adjeroh, Iyengar, 2017, Zelinka, Amer, 2019). Unfortunately, traditional security strategies are not equipped with smart heuristics to deal with obfuscated samples. Hence, they are incapable of coping with newly modified samples.

Current malware analysis techniques are generally classified based on the feature set into two major categories: static and dynamic (Cui, Zhou, Wang, Li, Ren, 2018, Darabian, Dehghantanha, Hashemi, Homayoun, Choo, 2020, Galal, Mahdy, Atiea, 2016, Pektaş, Acarman, 2017). Static analysis examines the portable executable (PE) file content without running it on the system. Generally, static analyzers rely on disassembling tools to extract low-level static features such as byte sequences, string patterns, and opcode sequences. The extracted features are used to understand the behavioral characteristics and structure of malicious PE. On the other hand, dynamic analysis approaches derive behavioral data by executing the sample in a secured virtual environment called a sandbox. Dynamic features such as files, registry, network, and process activities can capture additional properties that reflect the behavior’s intentions. The extracted dynamic features are the outcome of integrated scenarios provided to the malware sample to perform its malicious task (Choi et al., 2019).

Although static analysis approaches are not sophisticated, they are vulnerable to obfuscation techniques. Moreover, pattern matching approaches are considered ineffective in recognizing zero-day or polymorphic malware (Kumar et al., 2019). Consequently, static approaches are usually considered an insubstantial method for malware classification (Martín, Rodríguez-Fernández, Camacho, 2018, Ucci, Aniello, Baldoni, 2019). In comparison with the static analysis, the dynamic analysis doesn’t need reverse engineering methods such as decompilation and decryption (Zhao et al., 2019). Despite the substantial consumption of time and storage space, the dynamic analysis is more elastic and robust to obfuscation methods than static analysis (Amer, El-Sappagh, Hu, 2020, Salehi, Sami, Ghiasi, 2017, Vinayakumar, Alazab, Soman, Poornachandran, Venkatraman, 2019).

Behavioral features, especially the API call sequences, have attracted a lot of consideration towards malware detection and classification. Researchers were mainly focused on extracting patterns from API call sequences. Those patterns are utilized with machine learning (ML) classification algorithms to recognize malicious sequences (Cohen, Hendler, Rubin, 2018, Fan, Liu, Luo, Chen, Tian, Zheng, Liu, 2018, Gibert, Mateu, Planes, 2020, Salehi, Sami, Ghiasi, 2017, Wadkar, Di Troia, Stamp, 2020). Therefore, the detection of new malware variants is performed by comparing their behavior to predefined stored models. However, dynamic approaches are vulnerable to false positives or false negatives (Amer and Zelinka, 2020). Moreover, considering that feature handling requires professional expertise, it is hard to get meaningful behavioral features to smarten malware detection (Zhao et al., 2019). Therefore, traditional ML classification algorithms become unconvincing for malware detection.

Many research works have formulated multiple ML classifiers to form ensemble classification models for malware detection. Those works, such as the works done by Menahem et al. (2009), Yan et al. (2018), and Khasawneh et al. (2015), have tried to overcome the performance limitations of a single ML classifier. However, although they improved malware detection accuracy, they are still vulnerable to false positives and false negatives. Moreover, all previous methods have ignored the contextual relations that are encoded among the individual API functions inside the whole API calling sequence.

In general, the most efficient malware model is that one that can tell whether new API call sequences are malicious or not. On the other hand, the ambiguity of API features and their numerous connections made developing an accurate and generic malware model extremely difficult. Perhaps the most significant impediment to developing successful malware models is the lack of understanding malware language. Furthermore, the large number of API functions contained within a single calling sequence will cause the malware model to have a high level of perplexity.

The perplexity, defined by Eq. (1), is a measure that is used to evaluate how well a probability model predicts a sample (Jurafsky and Martin, 2009).PP(S)=1P(s1s2s3sN)N where S denotes the entire calling sequences, N refers to the count of all APIs in S, and (s1s2s3sN) is the calling sequence.

In consequence, the ideal malware model is the one that assigns a high probability (low perplexity) to the unseen API call sequences that lead to being malicious. Naturally, assigning a high probability to the new samples indicated that the malware model is not perplexed by the unknown samples, which means that it has a clear perception of how the malware behaves.

Correspondingly, in this paper, we proposed a model that effectively distinguishes between malicious and non-malicious API calling sequence behaviors. Our proposed model is an evolution of our preliminary work, which was published in Amer and Zelinka (2020). However, other behavioral features derived from the API calling series were thoroughly explored in this paper. We also demonstrated the novelty of our model by experimenting with malware datasets for Windows and Android, which are considered the most favored operating systems for malware attacks.

Accordingly, in this paper, we developed a malware detection models by integrating statistical, contextual, and graph mining features over the API calling sequence. We utilized those heterogeneous features to generate comprehensive multi-perspective perception opinions regarding the calling sequence behavior. Therefore, we provided promising solutions that remedy some of the dynamic detection approaches drawbacks discussed in Guo and Zhu (2018); Narayanan et al. (2018).

Our contributions in this paper are:

  • Introducing effective relational multi-perspective behavioral recognition models for Windows and Android operating systems.

  • Modelling the relationships between individual API functions in malicious and non-malicious calling sequences.

  • Constricting the perplexity measure for the generated malware model

  • Proposing a contextual indexing mechanism for APIs.

  • Suggesting a new visualized representational form for API calling sequence that reveals the sequence behavior.

  • Identifying the malicious deceptive or mimicry API call sequence.

This paper is structured as follows: background and related work are presented in Section 2. Our proposed model is discussed in detail in Section 3. The description of the datasets along with the proposed model’s empirical evaluation are discussed in Section 4. The conclusion and future work is given in Section 5.

Section snippets

Related Work

This section will outline and discuss the most relevant literature that uses dynamic analysis features, specifically the API calls for malware analysis and detection.

Tracking the API calling sequence is an excellent tactic to monitor any application (Alaeiyan, Parsa, Conti, 2019, Xiao, Lin, Sun, Ma, 2019, Zhao, Bo, Feng, Xu, Yu, 2019). The calling API sequence between processes and the operating system can be considered the most substantial behavioral difference between malicious and

Proposed model

Previous research works were mainly concentrated on obtaining some useful patterns from API call sequences. The derived patterns were used as training features for machine learning algorithms to detect malicious sequences in unseen or new samples. However, malware authors are also smart; they know that their malware will be analyzed to extract some fingerprints as patterns. Therefore, with any new misplacement in the API sequences, Those machine learning detection models will be ineffective.

Results and discussion

In this section, we evaluated our model using different datasets and standard evaluation metrics. We exhibited our model ability to detect unknown sequences. Finally, we showed how our model performed with malicious sequences that appeared as goodware, which is commonly known as fake goodware.

Conclusion

This paper introduced multi-perspective malware detection models that relied on the fusion of statistical, contextual, and graph mining features. We also showed that relying on multi-perspective detection models could still provide reliable performance even with the changing nature of the API calling sequence. We have experimentally proved that there are significant contrasts between the contextual behavior of malware and goodware. Upon such a distinction, we built different models that

CRediT authorship contribution statement

Eslam Amer: Conceptualization, Methodology, Formal analysis, Writing – original draft. Ivan Zelinka: Conceptualization, Formal analysis, Writing – original draft. Shaker El-Sappagh: Writing – review & editing.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowlgedgements

The following grants are acknowledged for the financial support for this research. Grant of SGS SP2020/108 and SP2020/78., VSB Technical University of Ostrava.

Eslam Amer is currently working as an associate professor of computer science. He earned his Ph.D. in 2012. His main research interests are focused on natural language processing and machine learning, along with their applications. He had two Post-Doctoral positions in top universities in Europe. Eslam is highly interested in embedding natural language processing to find new solutions to current challenges in different domains. Currently, he is working on proposing new paradigms for malware

References (112)

  • G. D’Angelo et al.

    Malware detection in mobile environments based on autoencoders and api-images

    Journal of Parallel and Distributed Computing

    (2020)
  • D. Gibert et al.

    The rise of machine learning for detection and classification of malware: Research developments, trends and challenges

    Journal of Network and Computer Applications

    (2020)
  • E.B. Karbab et al.

    Maldozer: Automatic framework for android malware detection using deep learning

    Digital Investigation

    (2018)
  • A. Kumar et al.

    A learning model to detect maliciousness of portable executable using integrated feature set

    Journal of King Saud University-Computer and Information Sciences

    (2019)
  • A. Martín et al.

    Android malware detection through hybrid features fusion and ensemble classifiers: the andropytool framework and the omnidroid dataset

    Information Fusion

    (2019)
  • A. Martín et al.

    Candyman: Classifying android malware families by modelling dynamic traces with markov chains

    Engineering Applications of Artificial Intelligence

    (2018)
  • E. Menahem et al.

    Improving malware detection by applying multi-inducer ensemble

    Computational Statistics & Data Analysis

    (2009)
  • L. Nguyen-Vu et al.

    Android fragmentation in malware detection

    Computers & Security

    (2019)
  • Z. Ren et al.

    End-to-end malware detection for android iot devices using deep learning

    Ad Hoc Networks

    (2020)
  • S.M. Rezaeinia et al.

    Sentiment analysis based on improved pre-trained word embeddings

    Expert Systems with Applications

    (2019)
  • M. Rhode et al.

    Early-stage malware prediction using recurrent neural networks

    computers & security

    (2018)
  • Z. Salehi et al.

    Maar: Robust features to detect malicious activity based on api calls, their arguments and return values

    Engineering Applications of Artificial Intelligence

    (2017)
  • P. Shijo et al.

    Integrated static and dynamic analysis for malware detection

    Procedia Computer Science

    (2015)
  • D. Ucci et al.

    Survey of machine learning techniques for malware analysis

    Computers & Security

    (2019)
  • M. Wadkar et al.

    Detecting malware evolution using support vector machines

    Expert Systems with Applications

    (2020)
  • H. Yakura et al.

    Neural malware analysis with attention mechanism

    Computers & Security

    (2019)
  • F. Ahmed et al.

    Using spatio-temporal information in api calls with machine learning algorithms for malware detection

    Proceedings of the 2nd ACM workshop on Security and artificial intelligence

    (2009)
  • H. Alasmary et al.

    Analyzing and detecting emerging internet of things malware: A graph-based approach

    IEEE Internet of Things Journal

    (2019)
  • M. Alazab et al.

    Zero-day malware detection based on supervised learning algorithms of api call signatures

    Proceedings of the Ninth Australasian Data Mining Conference - Volume 121

    (2011)
  • E. Amer

    Enhancing efficiency of web search engines through ontology learning from unstructured information sources

    2015 IEEE international conference on information reuse and integration

    (2015)
  • E. Amer et al.

    Contextual identification of windows malware through semantic interpretation of api call sequence

    Applied Sciences

    (2020)
  • E. Amer et al.

    Akea: an arabic keyphrase extraction algorithm

    International Conference on Advanced Intelligent Systems and Informatics

    (2016)
  • E. Amer et al.

    Keyphrase extraction methodology from short abstracts of medical documents

    2016 8th Cairo International Biomedical Engineering Conference (CIBEC)

    (2016)
  • E. Amer et al.

    Enhancing semantic arabic information retrieval via arabic wikipedia assisted search expansion layer

    International Conference on Advanced Intelligent Systems and Informatics

    (2017)
  • A. Azmoodeh et al.

    Detecting crypto-ransomware in iot networks based on energy consumption footprint

    Journal of Ambient Intelligence and Humanized Computing

    (2018)
  • AZSecure-data.org, 2010. Intelligence and Security Informatics Data Sets....
  • T. Ban et al.

    Integration of multi-modal features for android malware detection using linear svm

    2016 11th Asia Joint Conference on Information Security (AsiaJCIS)

    (2016)
  • B. Biggio et al.

    Evasion attacks against machine learning at test time

    Joint European conference on machine learning and knowledge discovery in databases

    (2013)
  • B. Biggio et al.

    Security evaluation of pattern classifiers under attack

    IEEE transactions on knowledge and data engineering

    (2013)
  • A. Calleja et al.

    The malsource dataset: Quantifying complexity and code reuse in malware development

    IEEE Transactions on Information Forensics and Security

    (2018)
  • F.O. Catak et al.

    Deep learning based sequential model for malware analysis using windows exe api calls

    PeerJ Computer Science

    (2020)
  • F. Ceschin et al.

    The need for speed: An analysis of brazilian malware classifers

    IEEE Security & Privacy

    (2018)
  • H. Cui et al.

    Towards privacy-preserving malware detection systems for android

    2018 IEEE 24th International Conference on Parallel and Distributed Systems (ICPADS)

    (2018)
  • H. Darabian et al.

    An opcode-based technique for polymorphic internet of things malware detection

    Concurrency and Computation: Practice and Experience

    (2020)
  • M. Fan et al.

    Android malware familial classification and representative sample selection via frequent subgraph analysis

    IEEE Transactions on Information Forensics and Security

    (2018)
  • M. Ficco

    Comparing api call sequence algorithms for malware detection

    Workshops of the International Conference on Advanced Information Networking and Applications

    (2020)
  • A. Firdaus et al.

    Bio-inspired computational paradigm for feature investigation and malware detection: interactive analytics

    Multimedia Tools and Applications

    (2018)
  • S. Forrest et al.

    A sense of self for unix processes

    Proceedings 1996 IEEE Symposium on Security and Privacy

    (1996)
  • Friedrichs, O., Huger, A., O’donnell, A. J., 2015. Method and apparatus for detecting malicious software through...
  • H.S. Galal et al.

    Behavior-based features model for malware detection

    Journal of Computer Virology and Hacking Techniques

    (2016)
  • Cited by (36)

    View all citing articles on Scopus

    Eslam Amer is currently working as an associate professor of computer science. He earned his Ph.D. in 2012. His main research interests are focused on natural language processing and machine learning, along with their applications. He had two Post-Doctoral positions in top universities in Europe. Eslam is highly interested in embedding natural language processing to find new solutions to current challenges in different domains. Currently, he is working on proposing new paradigms for malware analysis using NLP and deep learning.

    Ivan Zelinka is currently working as a professor at the Technical University of Ostrava (VŠB-TU), Faculty of Electrical Engineering and Computer Science and national supercomputing center IT4 Innovations. Dr. Zelinka has also participated in numerous grants and two EU projects as a member of the team (FP5 - RESTORM) and as supervisor of (FP7 - PROMOEVO) of the Czech team. Currently, he is the head of the Department of Applied Informatics and throughout his career he has supervised numerous MSc. and Bc. diploma theses in addition to his role of supervising doctoral students, including students from abroad. He was awarded the Siemens Award for his Ph.D. thesis and received an award from the journal Software news for his book about artificial intelligence. Ivan Zelinka is a member of the British Computer Society, IEEE (a committee of Czech section of Computational Intelligence), and serves on international program committees of various conferences and three international journals (Soft Computing, SWEVO, Editorial Council of Security Revue.) He is the author of numerous journal articles as well as books in Czech and the English language.

    Shaker El-Sappagh received the bachelor’s degree in computer science from Information Systems Department, Faculty of Computers and Information, Cairo University, Egypt, in 1997, and the master’s degree from the same university in 2007. He received the Ph.D. degrees in computer science from Information Systems Department, Faculty of Computers and Information, Mansura University, Mansura, Egypt in 2015. In 2003, he joined the Department of Information Systems, Faculty of Computers and Information, Minia University, Egypt as a teaching assistant. Since June 2016, he has been with the Department of Information Systems, Faculty of computers and Information, Benha University as an assistant professor. Currently he is Post-Doctoral Fellow at Centro Singular de Investigación en Tecnoloxías Intelixentes (CiTIUS), Universidade de Santiago de Compostela, Spain. He has publications in clinical decision support systems and semantic intelligence. His current research interests include machine learning, medical informatics, (fuzzy) ontology engineering, distributed and hybrid clinical decision support systems, semantic data modeling, fuzzy expert systems, and cloud computing. He is a reviewer in many journals, and he is very interested in the diseases’ diagnosis and treatment researches.

    View full text