Skip to main content
Log in

CSForest: an approach for imbalanced family classification of android malicious applications

  • Original Research
  • Published:
International Journal of Information Technology Aims and scope Submit manuscript

Abstract

Recently, a variety of mobile security threats have been emerged due to the exponential growth in mobile technologies. Various techniques have been developed to address the risks associated with malware. The most popular method to detect Android malware relies on the signature-based method. The drawback of this method is that it is unable to detect unknown malware. Due to this problem, machine learning came into existence for detecting and classifying malware applications. The conventional machine learning algorithms focus on optimizing classification accuracy. However, the imbalanced real-life datasets cause the traditional classification algorithm to perform poorly in classifying malicious apps. To handle the problem of imbalanced family classification of malicious applications, we propose a Cost-Sensitive Forest (CSForest) method which contains a group of decision trees. A cost-sensitive voting technique is used for prediction purposes. The proposed approach is evaluated on a dataset that includes the features extracted from both static and dynamic malware analysis and consisting of 13 imbalanced families of Android malware. Furthermore, the results of proposed technique are compared with the C4.5, Random Forest and CSTree to determine its effectiveness in classifying the families of malicious applications while considering only static features, only dynamic features and their hybrid. From the experimental results, it is found that CSForest performs better than the other algorithms in handling the imbalanced family classification of Android malicious applications while considering the hybrid set of features. It acquires the highest F-measure rate i.e. 0.919 with a minimum total cost of 180.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

References

  1. Singla S, Gandotra E, Bansal D, Sofat S (2015) Detecting and classifying morphed malwares: a survey. Int J Comput Appl 122(10):28–33

    Google Scholar 

  2. Gandotra E, Singla S, Bansal D, Sofat S (2018) Clustering morphed malware using opcode sequence pattern matching. Recent Pat Eng 12(1):30–36

    Article  Google Scholar 

  3. Kouliaridis V, Barmpatsalou K, Kambourakis G, Chen S (2020) A survey on mobile malware detection techniques. IEICE Trans Inf Syst 103(2):204–211

    Article  Google Scholar 

  4. Aslan OA, Samet R (2020) A comprehensive review on malware detection approaches. IEEE Access 8:6249–6271

    Article  Google Scholar 

  5. Barrera D, Kayacik HG, Oorschot PCV, Somayaji A (2010) A methodology forempirical analysis of permission-based security models and its application toAndroid. in: Proc. of 17th ACM Conf. computer and communications security, CCS 10 pp.73–84.

  6. Gandotra E, Bansal D, Sofat S (2014) Malware analysis and classification: a survey. J Inf Secur Appl 5(2):56–64

    Google Scholar 

  7. Dhalaria M, Gandotra E (2021) Android malware detection techniques: a literature review. Recent Pat Eng 15(2):225–245. https://doi.org/10.2174/1872212114999200710143847

    Article  Google Scholar 

  8. He H, Garcia EA (2009) Learning from imbalanced data. IEEE Trans Knowl Data Eng 21(9):1263–1284

    Article  Google Scholar 

  9. Sun Y, Wong AKC, Kamel MS (2009) Classification of imbalanced data: a review. Int J Pattern Recogn 23(4):687–719

    Article  Google Scholar 

  10. García V, Mollineda RA, Sánchez JS (2008) On the k-NN performance in a challenging scenario of imbalance and overlapping. Pattern Anal Appl 11(3–4):269–280

    Article  MathSciNet  Google Scholar 

  11. Chen XW, Wasikowski M (2008) Fast: a roc-based feature selection metric for small samples and imbalanced data classification problems. In: Proceedings of the 14th ACM SIGKDD International Conference on knowledge discovery and data mining, pp 124–132

  12. Japkowicz N (2001) Concept-learning in the presence of between-class and within-class imbalances. In: Proc. Fourteenth Conf. Canadian Soc. for computational studies of intelligence, Ottawa, Canada, pp. 67–77

  13. Krawczyk B, Jeleń L, Krzyżak A, Fevens (2012) Oversampling methods for classification of imbalanced breast cancer malignancy data. In: International Conference on computer vision and graphics, Springer, Berlin, Heidelberg, pp. 483-490

  14. Zmyślony M, Krawczyk B, Woźniak M (2013) Combined classifiers with neural fuser for spam detection. In: International Joint Conference CISIS’12-ICEUTE´ 12-SOCO’ 12 special sessions, Springer, Berlin, Heidelberg, pp. 245-252

  15. Yang Z, Tang WH, Shintemirov A, Wu QH (2009) Association rule mining-based dissolved gas analysis for fault diagnosis of power transformers. IEEE Trans Syst Man Cybern B 39(6):597–610

    Article  Google Scholar 

  16. López V, Fernández A, Moreno-Torres JG, Herrera F (2012) Analysis of preprocessing vs. cost-sensitive learning for imbalanced classification. Open problems on intrinsic data characteristics. Expert Syst Appl 39(7):6585–6608

    Article  Google Scholar 

  17. Haixiang G, Yijing L, Yanan L, Xiao L, Jinling L (2016) BPSO-Adaboost-KNN ensemble learning algorithm for multi-class imbalanced data classification. Eng Appl Artif Intell 49:176–193

    Article  Google Scholar 

  18. Ting KM (2002) An instance-weighting method to induce cost-sensitive trees. IEEE Trans Knowl Data Eng 14(3):659–665

    Article  MathSciNet  Google Scholar 

  19. Barandela R, Valdovinos RM, Sánchez JS, Ferri FJ (2004) The imbalanced training sample problem: under or over sampling?. In: Joint IAPR International Workshops on statistical techniques in pattern recognition (SPR) and structural and syntactic pattern recognition (SSPR), Springer, Berlin, Heidelberg, pp 806–814

  20. Quinlan JR (1993) C4.5: programs for machine learning, vol 1. Morgan Kaufmann, pp 1–299

    Book  Google Scholar 

  21. Islam MZ, Giggins H (2011) Knowledge discovery through sysfor: a systematically developed forest of multiple decision trees. In: Proceedings of the Ninth Australasian Data Mining Conference, vol. 121, Australian Computer Society, pp. 195–204.

  22. Sheng VS, Ling CX (2006) Thresholding for making classifiers cost-sensitive. In: Proceedings of the National Conference on artificial intelligence, vol. 21, AAAI Press, MIT Press, Menlo Park, Cambridge, pp. 476–48.

  23. Breiman L (2001) Random forests. Mach Learn 45(1):5–32

    Article  Google Scholar 

  24. Breiman L (1996) Bagging predictors. Mach Learn 24(2):123–140

    MATH  Google Scholar 

  25. Freund Y, Schapire RE (1997) A decision-theoretic generalization of on-line learning and an application to boosting. J Comput Syst Sci 55(1):119–139

    Article  MathSciNet  Google Scholar 

  26. Ling CX, Sheng VS, Bruckhaus T, Madhavji NH (2006) Maximum profit mining and its application in software development. In: Proceedings of the 12th ACM SIGKDD International Conference on knowledge discovery and data mining, pp 929–934

  27. Sheng VS, Gu B, Fang W, Wu J (2014) Cost-sensitive learning for defect escalation. Knowl Based Syst 66:146–155

    Article  Google Scholar 

  28. Cen L, Gates CS, Si L, Li N (2014) A probabilistic discriminative model for android malware detection with decompiled source code. IEEE Trans Depend Secure 12(4):400–412

    Article  Google Scholar 

  29. Aafer Y, Du W, Yin H (2013) Droidapiminer: mining api-level features for robust malware detection in android. In: International Conference on security and privacy in communication systems, Springer, Cham, pp 86–10

  30. Sanz B, Santos I, Laorden C, Ugarte-Pedrero X, Bringas PC, Álvarez G (2013) Puma: permission usage to detect malware in android. In: International Joint Conference CISIS’12-ICEUTE 12-SOCO 12 special sessions, Springer, Berlin, Heidelberg, pp 289–298

  31. Jang JW, Kang H, Woo J, Mohaisen A, Kim HK (2015) Andro-AutoPsy: anti-malware system based on similarity matching of malware and malware creator-centric information. Digit Invest 14:17–35

    Article  Google Scholar 

  32. Gupta D, Rani R (2020) Improving malware detection using big data and ensemble learning. Comput Electr Eng 86:106729. https://doi.org/10.1016/j.compeleceng.2020.106729

    Article  Google Scholar 

  33. Gupta D, Rani R (2019) A study of big data evolution and research challenges. J Inf Sci 45(3):322–340

    Article  Google Scholar 

  34. Xu Y, Wu C, Zheng K, Niu X, Yang Y (2017) Fuzzy–synthetic minority oversampling technique: oversampling based on fuzzy set theory for Android malware detection in imbalanced datasets. Int J Distrib Sens N 13(4):1–15

  35. Oak R, Du M, Yan D, Takawale H, Amit I (2019) Malware detection on highly imbalanced data through sequence modeling. In: Proceedings of the 12th ACM Workshop on artificial intelligence and security, pp 37–48

  36. Sahin Y, Bulkan S, Duman E (2013) A cost-sensitive decision tree approach for fraud detection. Expert Syst Appl 40(15):5916–5923

    Article  Google Scholar 

  37. Krawczyk B, Woźniak M, Schaefer G (2014) Cost-sensitive decision tree ensembles for effective imbalanced classification. Appl Soft Comput 14:554–562

    Article  Google Scholar 

  38. Chawla NV, Lazarevic A, Hall LO et al (2003) SMOTEBoost: improving prediction of the minority class in boosting. In: Proceedings of European Conference on principles of data mining and knowledge discovery, Cavtat, Croatia, Berlin, Heidelberg: Springer, pp 107–119

  39. Qiong GU, Ming WX, Zhao WU et al (2016) An improved SMOTE algorithm based on genetic algorithm for imbalanced data classification. J Digit Inf Manag 14(2):93–103

    Google Scholar 

  40. Ebenuwa SH, Sharif MS, Al-Nemrat A, Al-Bayatti AH, Alalwan N, Alzahrani AI, Alfarraj O (2019) Variance ranking for multi-classed imbalanced datasets: a case study of one-versus-all. Symmetry 11(12):1504. https://doi.org/10.3390/sym11121504

    Article  Google Scholar 

  41. Siers MJ, and Islam MZ (2014) Cost sensitive decision forest and voting for software defect prediction. In: Pacific Rim International Conference on artificial intelligence, Springer, Cham, pp 929–936

  42. Virusshare (2019) https://virusshare.com/. Accessed 2 Mar 2019

  43. Avira (2019) https://www.avira.com/. Accessed 27 Apr 2019

  44. Enck W, Octeau D, McDaniel PD, Chaudhuri S (2011) A study of android application security. USENIX Secur Symp 2(2):1–38

    Google Scholar 

  45. Android4me: J2ME port of Google’s Android (2011) https://code.google.com/p/android4me/downloads/list. Accessed 16 May 2019

  46. Gandotra E, Bansal D, Sofat S (2016) Tools & techniques for malware analysis and classification. Int J New Gener Comput 7(3):176–197

  47. CuckooDroid (2019) [Online]. https://cuckoo-droid.readthedocs.io/en/latest/installation/. Accessed 5 Oct 2019

  48. Dhalaria M, Gandotra E (2020) A hybrid approach for android malware detection and family classification. Int J Interact Multi (IJIMAI). https://doi.org/10.9781/ijimai.2020.09.001.[InPress]

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ekta Gandotra.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Dhalaria, M., Gandotra, E. CSForest: an approach for imbalanced family classification of android malicious applications. Int. j. inf. tecnol. 13, 1059–1071 (2021). https://doi.org/10.1007/s41870-021-00661-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s41870-021-00661-7

Keywords

Navigation