FSDroid:- A feature selection technique to detect malware from Android using Machine Learning Techniques

Mahindru, Arvind; Sangal, A.L.

doi:10.1007/s11042-020-10367-w

FSDroid:- A feature selection technique to detect malware from Android using Machine Learning Techniques

FSDroid

Published: 14 January 2021

Volume 80, pages 13271–13323, (2021)
Cite this article

Download PDF

Multimedia Tools and Applications Aims and scope Submit manuscript

FSDroid:- A feature selection technique to detect malware from Android using Machine Learning Techniques

Download PDF

6243 Accesses
43 Citations
1 Altmetric
Explore all metrics

Abstract

With the recognition of free apps, Android has become the most widely used smartphone operating system these days and it naturally invited cyber-criminals to build malware-infected apps that can steal vital information from these devices. The most critical problem is to detect malware-infected apps and keep them out of Google play store. The vulnerability lies in the underlying permission model of Android apps. Consequently, it has become the responsibility of the app developers to precisely specify the permissions which are going to be demanded by the apps during their installation and execution time. In this study, we examine the permission-induced risk which begins by giving unnecessary permissions to these Android apps. The experimental work done in this research paper includes the development of an effective malware detection system which helps to determine and investigate the detective influence of numerous well-known and broadly used set of features for malware detection. To select best features from our collected features data set we implement ten distinct feature selection approaches. Further, we developed the malware detection model by utilizing LSSVM (Least Square Support Vector Machine) learning approach connected through three distinct kernel functions i.e., linear, radial basis and polynomial. Experiments were performed by using 2,00,000 distinct Android apps. Empirical result reveals that the model build by utilizing LSSVM with RBF (i.e., radial basis kernel function) named as FSdroid is able to detect 98.8% of malware when compared to distinct anti-virus scanners and also achieved 3% higher detection rate when compared to different frameworks or approaches proposed in the literature.

Modeling Hybrid Feature-Based Phishing Websites Detection Using Machine Learning Techniques

Article 21 March 2022

Survey on SVM and their application in image classification

Article 11 January 2018

Feature selection techniques for machine learning: a survey of more than two decades of research

Article 01 December 2023

1 Introduction

Today, smartphone is not only a cellular telephone, but it can integrate with the computer-like operating system, which is also able to perform various tasks with the help of apps. Symbian was the first modern mobile operating system for smartphones that entered the market in the year 2000. After that, limited mobile phone companies, like Nokia, Microsoft, Apple, and Google, has followed them and launched their own mobile operating systems in the market. Among these, Android operating system^{Footnote 1} launched by Google in the year 2008 is quite popular as it is freely available, open source, and has a wide range of free apps in its play store. According to Stat Counter,^{Footnote 2} Android covers 74.92% share in the market till date. However, the success of Android in the market is mainly due to its apps. Currently, about 2.6 million apps are present in the official play store of Android,^{Footnote 3} which users can download and install for various purposes.

Android is a privilege-separated operating system where every app has its own individual system identity, i.e., Group-ID and Linux user-ID.^{Footnote 4} Each app of Android runs in a procedure sandbox and accesses the permissions to use the resources which are not present in its sandbox. Depending upon the sensitivity of permissions, the system automatically grant permissions or may prompt the users to approve or reject the requests for permissions. By taking the advantage of these permissions cyber-criminals target the user privacy. As stated in,^{Footnote 5} G-Data Security expert counted 3,246,284 malware apps until the end of the year 2018 and discovered over 7,50,000 new malware apps at the end of 2019. To defend Google play store^{Footnote 6} from malware apps, Google introduced Google Bouncer in the year 2012, which scans new apps at the time of their launch. However, it has limitations, e.g., Bouncer can easily fingerprint.^{Footnote 7} It is not hard to bypass Google’s security check, so that malicious Android apps can make their way to Google Play store ^{Footnote 8} and ultimately to users’ devices. By taking advantage of these permissions, cyber-criminal build malware apps on a daily basis and invite users to install these apps. More than two billion active Android devices are present in the market.^{Footnote 9} To overcome the drawback of the bouncer and to protect Android devices, Google introduced Google play protect in the market.^{Footnote 10} It has the capability to scan the apps in real-time. But it also have the limitations as stated in [28].

Android apps work on the principle of permission-model [11]. In addition to that, it provides protection at four level, that categorize permissions as^{Footnote 11} “signature”, “signature or system”, “normal” and “dangerous”. In our study, we do not consider “signature” and “signature or system” because they are system granted. We only consider “normal” and “dangerous” permissions which are granted by the user. Normal permissions does not pay any risk to the user’s privacy. If the permission is listed in its manifest, then it is granted by the system automatically. On the other hand, dangerous permissions give access to the user’s confidential data. However, it is purely dependent upon the user to give access or revoke the use of permission or set of permissions.

Performance of malware detection rely on selecting the right set of features. The features which are selected as an input to detect malware have a great effect on the performance of the malware detection. To select appropriate features or feature sets, in this study we use distinct feature selection approaches. Feature selection approaches are divided into two distinct classes, one class contains feature ranking approaches, and second class contains feature subset selection approaches. Feature ranking approaches are based on conclusive factors to arrange each feature according to its rank, further high ranking features are selected for a specified work. On the other side feature subset selection approaches depend upon the principle of selection of features’ subset, which collectively improve detection capability. In this work, six distinct kinds of feature ranking and four distinct kinds of feature subset selection approaches are used to select the right features sets. Further, selected feature sets helps in minimizing the value of misclassification errors since it eliminates irrelevant features and holds only those features which have excellent discriminative power.

In the literature, a number of researchers applied different machine learning algorithms to detect malware. Some of the broadly used algorithms are decision tree learning algorithms [60], neural networks [54, 59], clustering [14], regression and classification. The construction of an appropriate malware detection model which can help to detect apps that are really infected with malware is still a challenging task in the field of cyber-security. So in this work, we implement LSSVM using three distinct kernel functions viz., polynomial, linear, and RBF to build a model for malware detection. LSSVM is the variant of the SVM that is established on the hypothesis of statistical learning.

The list of phases pursued by us in building an effective Android malware detection model is demonstrated in Fig. 1. To conduct an empirical study on the collection of large data set, we collect Android application packages (.apk) files from distinct promised repositories. After that, we divide collected .apk files into two classes, i.e., benign and malware based on the results of different anti-virus scanners (Microsoft window defender^{Footnote 12} and VirusTotal^{Footnote 13}). In the next phase, we extracted permissions and API calls (consider as features in our work) by using distinct tools available publicly. Additionally, the right set of features is selected by implementing feature selection approaches on our collected data set. Further, the selected feature sets (i.e., permissions and API calls) are used as an input to form a model by considering LSSVM having three distinct kernel functions. At last, to validate that our proposed model is capable to detect malware or not we validate it with some existing frameworks or approaches developed in the literature and also compared our proposed framework with different anti-virus scanners available in the market.

The novel and unique assistance of this research work is presented as follows:

To the best of our knowledge, this is the first work in which 2,00,000 unique apps are collected that further belongs to thirty different categories of Android apps. To build effective and efficient malware detection model we extract permissions and API calls and consider them as features in this research paper.
In this research paper, we proposed a new framework that works on the principle of machine learning algorithm by selecting relevant features using feature selection approaches. Empirical result reveals that our proposed framework is able to detect 98.75% unknown malware from real-world apps.
Our proposed framework is capable to detect malware from real-world apps in less time-period when compared to distinct anti-virus scanners available in the market.
Our proposed framework is able to detect 3% higher detection rate when compared to different frameworks or approaches proposed in the literature.
In this study, we applied t-test analysis to investigate that features selected by feature selection approaches are having significant difference or not.

The rest of the paper is summarized as follows. Section 2 of this paper, discusses about the perviously developed approaches or frameworks used for malware detection along with the gaps present in the literature. Also, this section provides a brief structure of our proposed model based on the gaps in the literature. Section 3, explains the different feature raking approaches. Section 4, explains the different feature subset selection approaches used in this paper. Section 5 explains the LSSVM having different kernels to detect malware. In Section 6, we discuss different comparison methods of the proposed framework with existing techniques available in the literature. Performance parameters used for evaluation in this study are mentioned in Section 7. Sections 8 and 9 give the experimental setup of our proposed framework and the outcomes. In Section 10, we have discussed threats to validity and summarized our work with future scope in Section 11.

2 Related work and overview of proposed framework

Approaches or frameworks which were developed by the previous researchers to detect malware from Android apps are presented in this section. To find and overcome the gaps in the existing approaches, we divide this section in to two subsections. In the first subsection, we discuss about the frameworks or approaches, developed in the literature. In the second subsection, we first discuss about the data sets used in earlier studies and then we present the description about the collection of Android apps, formulation of our data set, extraction and formulation of feature sets, capability of features, feature selection approaches implemented in the literature. The research questions answered in this study are also formulated in this section.

2.1 Related work

In this subsection, we discuss about the analysis and its types which are used for Android malware. Later, we discuss about the detection techniques which were suggested by the previous researchers and academicians.

2.1.1 Analysis of Android apps

There are three different ways to carry out the analysis of Android apps i.e., static [29, 50, 72], dynamic [29] and hybrid [9, 34]. Static analysis is the one which analyzes the app without executing it. In dynamic analysis, it analyzes the app during its execution. The hybrid approach is the combination of both the static and dynamic analysis. Petsas et al. [68] have explored that malicious apps targeting the Android platform can evade dynamic analysis. They applied and tested heuristics of sophistication by integrating them in existing malware samples, which attempt to conceal the presence when examined in an emulated environment. Bläsing et al. [12] suggested Android Application Sandbox (AASandbox) that execute on both static and dynamic analysis to automatically identify the suspicious apps from Android. In this study, we perform dynamic analysis of Android apps to build a malware detection model.

2.1.2 Android Malware detection

Chen et al. [18] suggested Pegasus which use the app permissions to detect malware apps. They formed Permission Event Graph (PEG) by using the fundamentals of static analysis and applied models of the APIs. Peiravian and Zhu [67] employed machine learning methods to detect malicious Android apps. They perform experiment on 1200 real malware-infected apps and validated their performance. Chakradeo et al. [17] introduced Mobile Application Security Triage (MAST), a framework which supports to manage malware resources toward the apps by the most significant potential to exhibit malicious behavior. The MAST is a statistical method that measures the correlation between multiple categorical data using Multiple Correlation Analysis (MCA).

Wang et al. [84] studied the risk of single permission and the group of permissions. They employ feature ranking methods to rank individual Android permissions based on the risk involved. Enck et al. [25] build a framework named Kirin which used the principles of light-weight certification of apps to detect malware at the time of installation. Ongtang et al. [65] presented Secure Application INTeraction (Saint), that governs installation and run-time permissions. Grace et al. [35] proposed Woodpecker which examined the Android permission-based security model applied to pre-installed apps. Bugiel et al. [13] developed a model for a policy-driven and system-centric runtime monitoring of communication channels among apps at multiple layers. Zhou et al. [98] presented a systematic characterization of existing Android malware, such as, collecting charges from the devices by subscribing the services and mistreating SMS related Android permissions. Barrera et al. [8] proposed permission-based security models which helps to control access to different resources of system. They presented a methodology by doing empirical analysis of 1,100 Android apps for permission-based security models which make unique usage of self-organizing maps.

In recent study [52] Papilio introduced, a new approach for visualizing permissions of real-world Android apps. They build a new specific layout approach that includes node-link diagrams, matrix layouts and aspects of set membership. Matsudo et al. [61] presents a system model for supporting users’ approval decision when an app is installed. They introduced a reputation-based security evaluation first, which analyzes permissions to judge app is malicious or not. Arp et al. [4] proposed DREBIN, a lightweight approach for the detection of Android malware. They combined concept of machine learning and static analysis, which makes malware development better. DERBIN can scan a number of apps and can protect users to install apps from untrusted sources. Jeon et al. [40] address the issues of finer-grained permissions of Android. There proposed framework was based on four major groups of Android permissions and experiments were performed by taking top Android apps to differentiate between benign or malware-infected apps. PUMA presented in [74], is a new framework for detecting malware-infected Android apps by implementing machine learning algorithms after analyzing the extracted permissions from the Android apps itself. Grace et al. [34] developed RiskRanker, a proactive approach to accurately and scalably sift over a number of apps in existing App stores, to spot zero-delay malware. They conclude that 118,318 apps among 322 zero-day specimens from 11 distinct families were successfully discovered. TaintDroid [26] is a information flow tracking tool that can concurrently track multiple sources of sensitive data. A new model to protect smartphones was discussed in [70]. This model execute attack detection on a remote server where the implementation of the app on the smartphone mirror in a virtual machine. Schmidt et al. [5] presented anomaly detection using machine learning to monitor system-based information and system gathering behavior that is processed by a remote system. Zheng et al. [95] focused on the demanding task of triggering a particular behavior through automated UI interactions. They proposed a hybrid analysis approach to display UI-based trigger conditions in Android apps. To discover malware at kernel-level and user-level, a technique, named MADAM, has been developed in [24], which is capable to distinguish malware or benign apps. A fine-grained dynamic binary instrumentation tool named as DroidScope is presented in [91], for Android that reconstructs two levels of semantic information i.e., Java and operating system. A framework to monitor system calls named as Crowdroid is introduced in [14]. Crowdroid can see the track of information flows and API analysis which paid great impact to find malware activities in the network.

A root privilege management scheme called Root Privilege Manager (RPM) were proposed by [80]. It prevents Android apps from the risk raised by the permissions i.e., normal or dangerous. Wang et al. [82] analyses the used permissions and support-based permissions candidate method to detect Android malware. A hybrid feature selection approach which work on Rough Set Quick Reduct algorithm to detect malware was proposed in [10]. Wang et al. [85] collected 11 kind of static features by extraction from each type of app to characterize its behavior. By collecting the behavior, they applied classification algorithms to categorize malware and benign apps. Kirubavathi et al. [46] proposed a structural-based analysis learning approach, which accepts machine learning algorithms to detect malware and benign apps. They adopt botnet linked patterns of requested permissions as a feature to evaluate benign and malware apps. Jerlin et al. [41] suggested a new approach to detect malware by using its Application Programmable Interfaces (APIs). They adopt upper and lower boundaries as one of its feature to detect malware from Android. Mahindru and Singh [60] applied supervised machine learning algorithm on 172-permissions extracted during its installation and start-up time from Android apps.

Xiao et al. [88] proposed an approach that was based on deep learning to distinguish between benign and malware apps. In their approach, they consider system call as feature and trained it with the help of Long Short-Term Memory (LSTM) classifier. In their study, they trained LSTM models with system call sequences from malware and benign apps. Experiments were performed on 3567 malware-infected and 3536 benign apps and achieved recall of 96.6%. Mahindru and Sangal [54] proposed a framework DeepDroid that works on the principle of deep learning. They extract permissions and API calls as features from collected Android application packages (.apk). To select significant features to develop malware detection model six distinct feature ranking approaches are applied on extracted features. Experiments were performed on 1,00,000 benign apps and 20,000 malware-infected apps. Framework developed using Principal component analysis (PCA) as feature ranking approach achieved a detection rate of 94%. Letteri et al. [49] proposed a botnet detection methodology for internet of things (IOT) based on deep learning techniques, tested on a new, SDN-specific data set with a high (up to 97%) classification accuracy. Devpriya and Lingamgunta [23] proposed a novel hash-based multifactor secure mutual authentication scheme that includes hashing properties, certificates, nonce values,traditional user ids, and password mechanisms that resist MITM attacks, replay attacks, and forgery attacks.

Ma et al. [53], presented Android malware detection model based on the principle of API information. In their study, with the help of API information they construct three distinct data sets that are related to boolean, frequency and time-series. Based on these three data sets, three distinct detection models are developed. Experiments were performed by using 10010 benign and 10683 malware apps and achieved an accuracy of 98.98% by considering an ensemble approach. Mahindru and Sangal [59] proposed PerbDroid that developed by using features selected by feature ranking approaches and deep learning as machine classifier. Experiments were performed on 2,00,000 distinct Android apps and achieved a detection rate of 97.8%. Wang et al. [86] proposed a hybrid model based on convolutional neural network (CNN) and deep autoencoder (DAE). To improve the accuracy of malware detection model, they employed multiple CNN to select features from high-dimensional features of Android apps. Experiments were performed on 10,000 benign and 13,000 malware-infected apps and trained it with the help of serial convolutional neural network architecture (CNN-S). Mahindru and Sangal [56] proposed malware detection model with semi-supervised machine learning techniques. They applied LLGC algorithm on 2,00,000 distinct Android apps and achieved an accuracy of 97.8%.

Yamaguchi and Gupta [90] discussed properties of IOT device which make it more vulnerable for malware attacks i.e., large volume and pervasiveness. In their study, they proposed method to mitigate the attack on IOT based devices. Gupta et al. [36] proposed a book that is related to security measure and challenges faced by different communication devices. They also discussed different methods to mitigate the attacks. In [32], it was seen that feature selection approach paid a great effect in developing the model. Authors implemented Principal Component Analysis (PCA) to reduce the complexity of the model. With the advancement in the machine learning algorithms like SVM [20, 94], Deep learning model [32, 38, 96], it not only helped in detecting intrusion detection [45], cyber attacks but it also helped in health sector and in wireless routing too. Distinct researchers applied deep learning model and hybrid methods [32, 38, 96] in their study and achieved remarkable results.

Arora et al. [3] proposed PermPair, in which they construct and compare the graphs by extracting permissions from benign and malware-infected apps. Empirical result reveals that proposed malware detection model achieved an accuracy of 95.44% when compared to other similar approaches and favorite mobile anti-malware apps. Mahindru and Sangal [55] proposed DLDroid malware detection model, that is based on feature selection approaches and Deep Neural Network (DNN) machine learning algorithm. In their study, they collected Android apps that are developed during COVID-19. Experiments were performed on 11,000 distinct Android apps and model developed using DNN and Rough set analysis achieved a detection rate of 97.9% when compared to distinct anti-virus scanners available in the market.

Table 1 describes the brief details of some existing Android malware detection techniques present in literature. It also includes the type of monitoring and type of analysis used for these techniques. The conclusions made from these techniques are presented in the last column of the table.

Table 1 Brief description of some existing Android malware detection frameworks or approaches

FSDroid:- A feature selection technique to detect malware from Android using Machine Learning Techniques

Abstract

Similar content being viewed by others

Modeling Hybrid Feature-Based Phishing Websites Detection Using Machine Learning Techniques

Survey on SVM and their application in image classification

Feature selection techniques for machine learning: a survey of more than two decades of research

1 Introduction

2 Related work and overview of proposed framework

2.1 Related work

2.1.1 Analysis of Android apps

2.1.2 Android Malware detection

2.2 Gaps and overview of our proposed framework

2.2.1 Gaps present in the previous frameworks/approaches

2.2.2 Description of the collected Android apps

2.2.3 Formulation of data set

2.2.4 Formulation of feature sets

2.2.5 Capability of features

2.2.6 Feature selection approaches

2.2.7 Research questions

3 Feature ranking approaches

3.1 Gain-ratio feature selection

3.2 Chi-Squared test

3.3 Information-gain feature selection

3.4 OneR feature selection

3.5 Principal Component Analysis (PCA)

3.6 Logistic regression analysis

4 Feature subset selection approaches

4.1 Correlation based feature selection

4.2 Rough set analysis (RSA)

4.3 Consistency subset evaluation approach

4.4 Filtered subset evaluation

5 Machine learning techniques

5.1 LSSVM classifier

6 Comparison of proposed model with different existing techniques

7 Evaluation of performance parameters

Accuracy

F-measure

8 Experimental setup

9 Results of performed experiment

9.1 Feature ranking approaches

9.2 Feature subset selection approaches

9.3 Machine learning techniques

9.4 Comparison of results

9.5 Evaluation of FSDroid

9.5.1 Comparison of results with previously used classifiers and frameworks

9.5.2 Comparison of results with different Anti-Virus scanners

9.5.3 Detection of known and unknown malware families

Detection of known malware families

Detection of unknown malware families

9.5.4 Experimental findings

10 Threat to validity

11 Conclusion

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation