CSForest: an approach for imbalanced family classification of android malicious applications

Dhalaria, Meghna; Gandotra, Ekta

doi:10.1007/s41870-021-00661-7

CSForest: an approach for imbalanced family classification of android malicious applications

Original Research
Published: 15 April 2021

Volume 13, pages 1059–1071, (2021)
Cite this article

International Journal of Information Technology Aims and scope Submit manuscript

206 Accesses
6 Citations
Explore all metrics

Abstract

Recently, a variety of mobile security threats have been emerged due to the exponential growth in mobile technologies. Various techniques have been developed to address the risks associated with malware. The most popular method to detect Android malware relies on the signature-based method. The drawback of this method is that it is unable to detect unknown malware. Due to this problem, machine learning came into existence for detecting and classifying malware applications. The conventional machine learning algorithms focus on optimizing classification accuracy. However, the imbalanced real-life datasets cause the traditional classification algorithm to perform poorly in classifying malicious apps. To handle the problem of imbalanced family classification of malicious applications, we propose a Cost-Sensitive Forest (CSForest) method which contains a group of decision trees. A cost-sensitive voting technique is used for prediction purposes. The proposed approach is evaluated on a dataset that includes the features extracted from both static and dynamic malware analysis and consisting of 13 imbalanced families of Android malware. Furthermore, the results of proposed technique are compared with the C4.5, Random Forest and CSTree to determine its effectiveness in classifying the families of malicious applications while considering only static features, only dynamic features and their hybrid. From the experimental results, it is found that CSForest performs better than the other algorithms in handling the imbalanced family classification of Android malicious applications while considering the hybrid set of features. It acquires the highest F-measure rate i.e. 0.919 with a minimum total cost of 180.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Review on Random Forest: An Ensemble Classifier

Fraud Detection in Mobile Payment Systems using an XGBoost-based Framework

Article 14 October 2022

The current state and future of mobile security in the light of the recent mobile security threat reports

Article 30 January 2023

References

Singla S, Gandotra E, Bansal D, Sofat S (2015) Detecting and classifying morphed malwares: a survey. Int J Comput Appl 122(10):28–33
Google Scholar
Gandotra E, Singla S, Bansal D, Sofat S (2018) Clustering morphed malware using opcode sequence pattern matching. Recent Pat Eng 12(1):30–36
Article Google Scholar
Kouliaridis V, Barmpatsalou K, Kambourakis G, Chen S (2020) A survey on mobile malware detection techniques. IEICE Trans Inf Syst 103(2):204–211
Article Google Scholar
Aslan OA, Samet R (2020) A comprehensive review on malware detection approaches. IEEE Access 8:6249–6271
Article Google Scholar
Barrera D, Kayacik HG, Oorschot PCV, Somayaji A (2010) A methodology forempirical analysis of permission-based security models and its application toAndroid. in: Proc. of 17th ACM Conf. computer and communications security, CCS 10 pp.73–84.
Gandotra E, Bansal D, Sofat S (2014) Malware analysis and classification: a survey. J Inf Secur Appl 5(2):56–64
Google Scholar
Dhalaria M, Gandotra E (2021) Android malware detection techniques: a literature review. Recent Pat Eng 15(2):225–245. https://doi.org/10.2174/1872212114999200710143847
Article Google Scholar
He H, Garcia EA (2009) Learning from imbalanced data. IEEE Trans Knowl Data Eng 21(9):1263–1284
Article Google Scholar
Sun Y, Wong AKC, Kamel MS (2009) Classification of imbalanced data: a review. Int J Pattern Recogn 23(4):687–719
Article Google Scholar
García V, Mollineda RA, Sánchez JS (2008) On the k-NN performance in a challenging scenario of imbalance and overlapping. Pattern Anal Appl 11(3–4):269–280
Article MathSciNet Google Scholar
Chen XW, Wasikowski M (2008) Fast: a roc-based feature selection metric for small samples and imbalanced data classification problems. In: Proceedings of the 14th ACM SIGKDD International Conference on knowledge discovery and data mining, pp 124–132
Japkowicz N (2001) Concept-learning in the presence of between-class and within-class imbalances. In: Proc. Fourteenth Conf. Canadian Soc. for computational studies of intelligence, Ottawa, Canada, pp. 67–77
Krawczyk B, Jeleń L, Krzyżak A, Fevens (2012) Oversampling methods for classification of imbalanced breast cancer malignancy data. In: International Conference on computer vision and graphics, Springer, Berlin, Heidelberg, pp. 483-490
Zmyślony M, Krawczyk B, Woźniak M (2013) Combined classifiers with neural fuser for spam detection. In: International Joint Conference CISIS’12-ICEUTE´ 12-SOCO’ 12 special sessions, Springer, Berlin, Heidelberg, pp. 245-252
Yang Z, Tang WH, Shintemirov A, Wu QH (2009) Association rule mining-based dissolved gas analysis for fault diagnosis of power transformers. IEEE Trans Syst Man Cybern B 39(6):597–610
Article Google Scholar
López V, Fernández A, Moreno-Torres JG, Herrera F (2012) Analysis of preprocessing vs. cost-sensitive learning for imbalanced classification. Open problems on intrinsic data characteristics. Expert Syst Appl 39(7):6585–6608
Article Google Scholar
Haixiang G, Yijing L, Yanan L, Xiao L, Jinling L (2016) BPSO-Adaboost-KNN ensemble learning algorithm for multi-class imbalanced data classification. Eng Appl Artif Intell 49:176–193
Article Google Scholar
Ting KM (2002) An instance-weighting method to induce cost-sensitive trees. IEEE Trans Knowl Data Eng 14(3):659–665
Article MathSciNet Google Scholar
Barandela R, Valdovinos RM, Sánchez JS, Ferri FJ (2004) The imbalanced training sample problem: under or over sampling?. In: Joint IAPR International Workshops on statistical techniques in pattern recognition (SPR) and structural and syntactic pattern recognition (SSPR), Springer, Berlin, Heidelberg, pp 806–814
Quinlan JR (1993) C4.5: programs for machine learning, vol 1. Morgan Kaufmann, pp 1–299
Book Google Scholar
Islam MZ, Giggins H (2011) Knowledge discovery through sysfor: a systematically developed forest of multiple decision trees. In: Proceedings of the Ninth Australasian Data Mining Conference, vol. 121, Australian Computer Society, pp. 195–204.
Sheng VS, Ling CX (2006) Thresholding for making classifiers cost-sensitive. In: Proceedings of the National Conference on artificial intelligence, vol. 21, AAAI Press, MIT Press, Menlo Park, Cambridge, pp. 476–48.
Breiman L (2001) Random forests. Mach Learn 45(1):5–32
Article Google Scholar
Breiman L (1996) Bagging predictors. Mach Learn 24(2):123–140
MATH Google Scholar
Freund Y, Schapire RE (1997) A decision-theoretic generalization of on-line learning and an application to boosting. J Comput Syst Sci 55(1):119–139
Article MathSciNet Google Scholar
Ling CX, Sheng VS, Bruckhaus T, Madhavji NH (2006) Maximum profit mining and its application in software development. In: Proceedings of the 12th ACM SIGKDD International Conference on knowledge discovery and data mining, pp 929–934
Sheng VS, Gu B, Fang W, Wu J (2014) Cost-sensitive learning for defect escalation. Knowl Based Syst 66:146–155
Article Google Scholar
Cen L, Gates CS, Si L, Li N (2014) A probabilistic discriminative model for android malware detection with decompiled source code. IEEE Trans Depend Secure 12(4):400–412
Article Google Scholar
Aafer Y, Du W, Yin H (2013) Droidapiminer: mining api-level features for robust malware detection in android. In: International Conference on security and privacy in communication systems, Springer, Cham, pp 86–10
Sanz B, Santos I, Laorden C, Ugarte-Pedrero X, Bringas PC, Álvarez G (2013) Puma: permission usage to detect malware in android. In: International Joint Conference CISIS’12-ICEUTE 12-SOCO 12 special sessions, Springer, Berlin, Heidelberg, pp 289–298
Jang JW, Kang H, Woo J, Mohaisen A, Kim HK (2015) Andro-AutoPsy: anti-malware system based on similarity matching of malware and malware creator-centric information. Digit Invest 14:17–35
Article Google Scholar
Gupta D, Rani R (2020) Improving malware detection using big data and ensemble learning. Comput Electr Eng 86:106729. https://doi.org/10.1016/j.compeleceng.2020.106729
Article Google Scholar
Gupta D, Rani R (2019) A study of big data evolution and research challenges. J Inf Sci 45(3):322–340
Article Google Scholar
Xu Y, Wu C, Zheng K, Niu X, Yang Y (2017) Fuzzy–synthetic minority oversampling technique: oversampling based on fuzzy set theory for Android malware detection in imbalanced datasets. Int J Distrib Sens N 13(4):1–15
Oak R, Du M, Yan D, Takawale H, Amit I (2019) Malware detection on highly imbalanced data through sequence modeling. In: Proceedings of the 12th ACM Workshop on artificial intelligence and security, pp 37–48
Sahin Y, Bulkan S, Duman E (2013) A cost-sensitive decision tree approach for fraud detection. Expert Syst Appl 40(15):5916–5923
Article Google Scholar
Krawczyk B, Woźniak M, Schaefer G (2014) Cost-sensitive decision tree ensembles for effective imbalanced classification. Appl Soft Comput 14:554–562
Article Google Scholar
Chawla NV, Lazarevic A, Hall LO et al (2003) SMOTEBoost: improving prediction of the minority class in boosting. In: Proceedings of European Conference on principles of data mining and knowledge discovery, Cavtat, Croatia, Berlin, Heidelberg: Springer, pp 107–119
Qiong GU, Ming WX, Zhao WU et al (2016) An improved SMOTE algorithm based on genetic algorithm for imbalanced data classification. J Digit Inf Manag 14(2):93–103
Google Scholar
Ebenuwa SH, Sharif MS, Al-Nemrat A, Al-Bayatti AH, Alalwan N, Alzahrani AI, Alfarraj O (2019) Variance ranking for multi-classed imbalanced datasets: a case study of one-versus-all. Symmetry 11(12):1504. https://doi.org/10.3390/sym11121504
Article Google Scholar
Siers MJ, and Islam MZ (2014) Cost sensitive decision forest and voting for software defect prediction. In: Pacific Rim International Conference on artificial intelligence, Springer, Cham, pp 929–936
Virusshare (2019) https://virusshare.com/. Accessed 2 Mar 2019
Avira (2019) https://www.avira.com/. Accessed 27 Apr 2019
Enck W, Octeau D, McDaniel PD, Chaudhuri S (2011) A study of android application security. USENIX Secur Symp 2(2):1–38
Google Scholar
Android4me: J2ME port of Google’s Android (2011) https://code.google.com/p/android4me/downloads/list. Accessed 16 May 2019
Gandotra E, Bansal D, Sofat S (2016) Tools & techniques for malware analysis and classification. Int J New Gener Comput 7(3):176–197
CuckooDroid (2019) [Online]. https://cuckoo-droid.readthedocs.io/en/latest/installation/. Accessed 5 Oct 2019
Dhalaria M, Gandotra E (2020) A hybrid approach for android malware detection and family classification. Int J Interact Multi (IJIMAI). https://doi.org/10.9781/ijimai.2020.09.001.[InPress]
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science and Engineering, Jaypee University of Information Technology Waknaghat, Solan, HP, India
Meghna Dhalaria & Ekta Gandotra

Authors

Meghna Dhalaria
View author publications
You can also search for this author in PubMed Google Scholar
Ekta Gandotra
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ekta Gandotra.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Dhalaria, M., Gandotra, E. CSForest: an approach for imbalanced family classification of android malicious applications. Int. j. inf. tecnol. 13, 1059–1071 (2021). https://doi.org/10.1007/s41870-021-00661-7

Download citation

Received: 14 July 2020
Accepted: 27 March 2021
Published: 15 April 2021
Issue Date: June 2021
DOI: https://doi.org/10.1007/s41870-021-00661-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

CSForest: an approach for imbalanced family classification of android malicious applications

Abstract

Access this article

Similar content being viewed by others

A Review on Random Forest: An Ensemble Classifier

Fraud Detection in Mobile Payment Systems using an XGBoost-based Framework

The current state and future of mobile security in the light of the recent mobile security threat reports

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

CSForest: an approach for imbalanced family classification of android malicious applications

Abstract

Access this article

Similar content being viewed by others

A Review on Random Forest: An Ensemble Classifier

Fraud Detection in Mobile Payment Systems using an XGBoost-based Framework

The current state and future of mobile security in the light of the recent mobile security threat reports

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation