A TAN based hybrid model for android malware detection

https://doi.org/10.1016/j.jisa.2020.102483Get rights and content

Abstract

Android devices are very popular because of their availability at reasonable prices. However, there is a rapid rise of malware applications in Android platform in the recent past years due to its security vulnerabilities. The existing static malware detection mechanisms can locate malicious components associated with the source code of an application and dynamic analysis can identify exploits in the runtime environment. Hence, the advantages of both static and dynamic mechanisms need to be combined to form a hybrid analysis mechanism for achieving better accuracy in malware detection. The existing machine learning based hybrid malware analysis mechanisms do not check the interdependency of static and dynamic features used in their machine learning classifiers. This interdependency can lead to multicollinearity problem which can affect the classifier’s performance. Hence, in this paper we propose a novel TAN (Tree Augmented naive Bayes) based hybrid malware detection mechanism by employing the conditional dependencies among relevant static and dynamic features (API calls, permissions and system calls) which are required for the functionality of an application. We trained three ridge regularized logistic regression classifiers corresponding to API calls, permission and system calls of an application and modeled their output relationships as a TAN (Tree Augmented naive Bayes) for identifying whether the application is malicious or not. The experimental results show that the proposed mechanism can detect malicious applications over a long period with an accuracy of 0.97.

Introduction

Android operating system has been dominating the smart phone industry for the past 10 years. Android API framework contains functions to access the sensitive resources in the system. This has enabled the cyber attackers to create malicious applications and distribute them through third party app stores or advertisements via social networks. Further, it is possible for an adversary to inject malicious payloads in the existing applications. The malicious apps enable an attacker to perform various kinds of operations such as stealing the information, sending SMS, remotely control the device etc [19], [21], [43]. Hence, it is necessary to protect smart phones from these malicious applications.

Existing malware detection mechanisms are mainly classified into static, dynamic and hybrid analysis. Static analysis can capture the malicious behavior from an application’s source code without executing it [46]. Dynamic analysis can identify the malicious behavior of an application from its runtime information such as system calls produced during its execution time [32]. The advantage of static analysis is in locating the malicious component from the source code (high code coverage) [23] and that of dynamic analysis is in identifying the exploits in the runtime environment [36]. Hence, the advantages of both static and dynamic analysis mechanisms can be combined to form a hybrid analysis mechanism for achieving better accuracy in malware detection [56], [57]. The existing hybrid mechanisms do not check the interdependency between static and dynamic features used in their machine learning classifier. The interdependency between static and dynamic features leads to multicollinearity problem[2]. Multicollinearity occurs when the correlation between two or more features in a machine learning model is high. This multicollinearity problem can affect the performance of a machine learning classifier.

According to Zhang et al. [55], the source code level API calls can determine the underlying semantics of an application. These APIs are protected by some permissions which need to be declared in the manifest file [20]. It is possible for a developer to declare permissions without API calls or vice versa. Hence, using either permissions or API calls alone as features is not enough to detect the malicious behavior of an application. Therefore, it is required to combine both API calls and permissions for accurately detecting malware applications. Malware applications do not require user triggers for invoking sensitive API calls unlike goodwares [18]. This automated invocation of API calls gets reflected in a system call sequence [49]. It is known that, an application generates system calls in accordance with the execution of API calls during runtime [41]. It is difficult to infer malicious behavior from the system call sequence itself in a short time due to the limited code coverage an application makes during its execution. Hence, we can conclude that static features such as API calls, permissions and dynamic features such as system calls are relevant features for detecting malicious applications and there exist some conditional dependencies among these features. However, using conditionally dependent static and dynamic features as a feature vector in a machine learning classifier for hybrid analysis can lead to multicollinearity problem [2]. There is a special kind of Bayesian network called TAN (Tree Augmented Naive Bayes) used for modeling the conditional dependencies between random variables in the form of a tree. In this paper, we employed a Tree Augmented Naive Bayes (TAN) model to combine the classifier ouput variables corresponding to the static features such as API calls, permissions and the dynamic features such as system calls based on their conditional dependencies for predicting the malicious behavior. This TAN based model can capture the interdependence between static and dynamic features for predicting the malicious behavior. The experimental results show that the proposed mechanism can detect malicious applications over a long period with an accuracy of 0.97.

The rest of the paper is organized as follows. In Section 2, a review of the related works is given. In Section 3, a short description for ridge regularized LR classifier is given. In Section 4, the proposed detection mechanism is given. The experimental results are given in Section 5. In Section 6, classifier retraining for detecting evolving malwares is given. Conclusions and future directions for research are given in Section 7.

Section snippets

Static analysis

Talha et al. [44] suggested a permission based malware identification system called Apkauditor. It analyzes the permissions requested by the application for identifying whether it is malicious or not. Arp et al. [6] suggested a static malware detection mechanism called Drebin. In Drebin, the static features such as hardware components, requested permissions, app components, intent filters etc. are given as input to an SVM classifier for identifying whether it is malicious or not. In [12], Cen

Ridge regularized logistic regression classifier

Let D={(Xi,Yi):i=1,2,3,4,5,6,7,,m} be a labeled dataset, where

Xi=(xi1,,xin) be the n dimensional feature vector correspond to the ith element and Yi ∈ {0, 1} denotes its label. Let X=(x1,,xn) be any data element and Y denotes its label. Let β=(β1,β2,,βn) be the regression parameters. Then, the probability P(Y=1|X) can be estimated using ridge regularized logistic regression [33] as given below.

Let hβ(X)=11+exp(βTX). Then,P(Yi|Xi)=hβ(Xi)Yi(1hβ(Xi))(1Yi).β is estimated as argmaxβi=1mlog(P(

Methodology

There is a set of API calls defined in the source code of an application. These API calls need some permissions which are declared in the manifest file [20]. Further, there is a set of system call sequences indirectly specified for every application [22]. In the case of malware applications, this set will be very large. The application will generate a system call sequence from these predefined sequences in accordance with the execution of API calls in the source code. Therefore, there exist

Results and discussions

We have taken 1650 malware applications from Drebin [6], AMD [48], AndroZoo (AZ) [3] and external repositories (Github) [26] and 1650 goodware applications from AndroZoo (AZ) [3] and Google Play (GP)4 for evaluating the performance of our approach. Drebin dataset contains malware samples ranging from 2010 to 12, AMD dataset contains malware samples from 2010 to 2016 and AndroZoo dataset contains malware/goodware samples from 2010 to 2019. AndroZoo is a

Retraining the classifiers for detecting evolving malwares

It is known that, Android tends to revise its API calls time to time. Further, Google had created many new API calls in the recent past years. Among these created set of API calls, some of them have overlapping functionalities with others. Hence, evolving malware applications can use these new API calls for performing the malicious activities. Hence, machine learning models trained with API and permissions in older apps may fail to detect new malwares. In malware detection, concept drift

Conclusion

In this paper, we proposed a novel mechanism for detecting Android malware applications by combining static and dynamic features influencing the malicious activity by exploring their conditional dependencies. The proposed mechanism can accurately capture the malicious behavior than existing static and dynamic analysis mechanisms. However, few malware applications can escape from the detection mechanism by employing adversarial techniques [54]. Therefore, a future direction for the research is

Declaration of Competing Interest

All authors have participated in (a) conception and design, or analysis and interpretation of the data; (b) drafting the article or revising it critically for important intellectual content; and (c) approval of the final version.

This manuscript has not been submitted to, nor is under review at, another journal or other publishing venue.

The authors have no affiliation with any organization with a direct or indirect financial interest in the subject matter discussed in the manuscript

Acknowledgement

This work is done as a part of Center for Research and Innovation in Cyber Threat Resilience project (CRICTR 2018-19), which is funded by Kerala state planning board.

References (57)

  • T. Blasing et al.

    An android application sandbox system for suspicious software detection

    2010 5th International Conference on Malicious and Unwanted Software (MALWARE 2010)

    (2010)
  • I. Burguera et al.

    Crowdroid: behavior-based malware detection system for android

    Proc. 1st ACM Work. Secur. Priv. smartphones Mob. devices

    (2011)
  • G. Canfora et al.

    Detection of malicious web pages using system calls sequences

    International Conference on Availability, Reliability, and Security

    (2014)
  • G. Canfora et al.

    Detecting android malware using sequences of system calls

    Proceedings of the 3rd International Workshop on Software Development Lifecycle for Mobile

    (2015)
  • L. Cen et al.

    A probabilistic discriminative model for android malware detection with decompiled source code

    IEEE Trans Dependable Secur Comput

    (2015)
  • S. Chen et al.

    Stormdroid: a streaminglized machine learning-based system for detecting android malware

    Proceedings of the 11th ACM on Asia Conference on Computer and Communications Security

    (2016)
  • C. Chow et al.

    Approximating discrete probability distributions with dependence trees

    IEEE Trans Inf Theory

    (1968)
  • J. Crussell et al.

    Andarwin: Scalable detection of semantically similar android applications

    European Symposium on Research in Computer Security

    (2013)
  • S.K. Dash et al.

    Droidscribe: classifying android malware based on runtime behavior

    2016 IEEE Security and Privacy Workshops (SPW)

    (2016)
  • M. Dimjašević et al.

    Evaluation of android malware detection based on system calls

    Proc. 2016 ACM Int. Work. Secur. Priv. Anal. IWSPA ’16

    (2016)
  • K.O. Elish et al.

    Profiling user-trigger dependence for android malware detection

    Comput Secur

    (2015)
  • P. Faruki et al.

    Android security: a survey of issues, malware penetration, and defenses

    IEEE Commun Surv Tutor

    (2015)
  • A.P. Felt et al.

    A survey of mobile malware in the wild

    Proc. 1st ACM Work. Secur. Priv. smartphones Mob. devices

    (2011)
  • S. Forrest et al.

    A sense of self for unix processes

    Secur. Privacy, 1996. Proceedings., 1996 IEEE Symp.

    (1996)
  • G. Fraser et al.

    Automated test generation for java generics

    Int. Conf. Softw. Qual.

    (2014)
  • N. Friedman et al.

    Bayesian network classifiers

    Mach Learn

    (1997)
  • J. Gao et al.

    Should you consider adware as malware in your study?

    2019 IEEE 26th International Conference on Software Analysis, Evolution and Reengineering (SANER)

    (2019)
  • github.com. Android malware 2019. [Online] Available: https://github.com/sk3ptre/AndroidMalware_2019;...
  • Cited by (68)

    • A comprehensive review on permissions-based Android malware detection

      2024, International Journal of Information Security
    View all citing articles on Scopus
    View full text