Skip to main content
Log in

usfAD: a robust anomaly detector based on unsupervised stochastic forest

  • Original Article
  • Published:
International Journal of Machine Learning and Cybernetics Aims and scope Submit manuscript

Abstract

In real-world applications, data can be represented using different units/scales. For example, weight in kilograms or pounds and fuel-efficiency in km/l or l/100 km. One unit can be a linear or non-linear scaling of another. The variation in metrics due to the non-linear scaling makes Anomaly Detection (AD) challenging. Most existing AD algorithms rely on distance- or density-based functions, which makes them sensitive to how data is expressed. This means that they are representation dependent. To avoid such a problem, we introduce a new anomaly detection method, which we call ‘usfAD: Unsupervised Stochastic Forest-based Anomaly Detector’. Our empirical evaluation in synthetic and real-world cybersecurity (spam detection, malicious URL detection and intrusion detection) datasets shows that our approach is more robust to the variation in units/scales used to express data. It produces more consistent and better results than five state-of-the-art AD methods namely: local outlier factor; one-class support vector machine; isolation forest; nearest neighbor in a random subsample of data; and, simple histogram-based probabilistic method.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Notes

  1. https://archive.ics.uci.edu/ml/datasets/spambase

  2. https://www.unb.ca/cic/datasets/nsl.html

  3. https://www.unb.ca/cic/datasets/url-2016.html

  4. https://www.unsw.adfa.edu.au/unsw-canberra-cyber/cybersecurity/ADFA-NB15-Datasets/

References

  1. Aggarwal CC (2017) Outlier analysis. Springer, Berlin

    Book  Google Scholar 

  2. Aryal S (2018) Anomaly detection technique robust to units and scales of measurement. In: Proceedings of the 22nd Pacific-Asia Conference on Knowledge Discovery and Data Mining, pp 589–601

  3. Aryal S, Baniya AA, Santosh K (2019) Improved histogram-based anomaly detector with the extended principal component features. arxiv. https://arxiv.org/abs/1909.12702

  4. Aryal S, Ting KM, Haffari G (2016) Revisiting attribute independence assumption in probabilistic unsupervised anomaly detection. In: Proceedings of the 11th Pacific Asia Workshop on Intelligence and Security Informatics, pp 73–86

  5. Aryal S, Ting KM, Washio T, Haffari G (2017) Data-dependent dissimilarity measure: an effective alternative to geometric distance measures. Knowl Inf Syst 53(2):479–506

    Article  Google Scholar 

  6. Aryal S, Ting KM, Washio T, Haffari G (2020) A comparative study of data-dependent approaches without learning in measuring similarities of data objects. Data Min Knowl Disc 34(1):124–162. https://doi.org/10.1007/s10618-019-00660-0

    Article  MathSciNet  Google Scholar 

  7. Aryal S, Ting KM, Wells JR, Washio T (2014) Improving iForest with Relative Mass. In: Proceedings of the 18th Pacific Asia Conference on Knowledge Discovery and Data Mining (PAKDD), pp 510–521

  8. Bakshi BR (1999) Multiscale analysis and modelling using wavelets. J Chemom 13(1):415–434

    Article  Google Scholar 

  9. Bandaragoda T, Ting KM, Albrecht D, Liu F, Wells J (2014) Efficient anomaly detection by isolation using nearest neighbour ensemble. In: Proceedings of the IEEE international conference on data mining workshops, pp 698–705

  10. Baniya AA, Aryal S, Santosh KC (2019) A novel data pre-processing technique: making data mining robust to different units and scales of measurement. In: Proceedings of the 26th international conference on neural information processing (ICONIP) of the Asia-Pacific Neural Network Society, (p. Accepted)

  11. Bay SD, Schwabacher M (2003) Mining distance-based outliers in near linear time with randomization and a simple pruning rule. In: Proceedings of the ninth ACM SIGKDD conference on knowledge discovery and data mining, pp 29–38

  12. Bengio Y, Courville A, Vincent P (2013) Representation learning: a review and new perspectives. IEEE Trans Pattern Anal Mach Intell 35(8):1798–1828

    Article  Google Scholar 

  13. Boriah S, Chandola V, Kumar V (2008) Similarity measures for categorical data: a comparative evaluation. In: Proceedings of the eighth SIAM international conference on data mining, pp 243–254

  14. Breiman L (2001) Random forests. Mach Learn 45(1):5–32

    Article  Google Scholar 

  15. Breunig MM, Kriegel H-P, Ng RT, Sander J (2000) LOF: identifying density-based local outliers. In; Proceedings of ACM SIGMOD conference on management of data, pp 93–104

  16. Chandola V, Banerjee A, Kumar V (2009) Anomaly detection: a survey. ACM Comput Surv 41(3):15-1-15–58

    Article  Google Scholar 

  17. Cheng T, Li Z (2006) A multiscale approach for spatio-temporal outlier detection. Trans GIS 10(2):253–263

    Article  Google Scholar 

  18. Conover WJ, Iman RL (1981) Rank transformations as a bridge between parametric and nonparametric statistics. Am Statist 35(3):124–129

    MATH  Google Scholar 

  19. Fernando TL, Webb GI (2017) SimUSF: An efficient and effective similarity measure that is invariant to violations of the interval scale assumption. Data Min Knowl Disc 31(1):264–286

    Article  MathSciNet  Google Scholar 

  20. Gao Z, Guo L, Ma C, Ma X, Sun K, Xiang H, Liu X et al (2019) AMAD: adversarial multiscale anomaly detection on high-dimensional and time-evolving categorical data. In: Proceedings of the 1st international workshop on deep learning practice for high-dimensional sparse data (DLP-KDD ’19), pp 1–8

  21. Goldstein M, Dengel A (2012) Histogram-based outlier score (HBOS): a fast unsupervised anomaly detection algorithm. In: Proceedings of the 35th German Conference on Artificial Intelligence, pp 59–63

  22. Hand DJ, Till RJ (2001) A simple generalisation of the area under the roc curve for multiple class. Mach Learn 45(2):171–186

    Article  Google Scholar 

  23. Hawkins DM (1980) Identification of outliers. Chapman and Hall, London

    Book  Google Scholar 

  24. Jiang H, Wang H, Hu W, Kakde D, Chaudhuri A (2017) Fast incremental SVDD learning algorithm with the Gaussian Kernel. In: Proceedings of the Thirty-Third AAAI conference on artificial intelligence (AAAI), pp 3991–3998

  25. Joiner BL (1981) Lurking variables: some examples. Am Statist 35(4):227–233

    Google Scholar 

  26. Liu F, Ting KM, Zhou Z-H (2008) Isolation forest. In: Proceedings of the Eighth IEEE international conference on data mining, pp 413–422

  27. Liu Q, Klucik R, Chen C, Grant G, Gallaher D, Lv Q, Shang L (2017) Unsupervised detection of contextual anomaly in remotely sensed data. Remote Sens Environ 202(1):75–87

    Article  Google Scholar 

  28. Lord FM (1953) On the statistical treatment of football numbers. Am Psychol 8(12):750–751

    Article  Google Scholar 

  29. Mamun MS, Rathore MA, Lashkari AH, Stakhanova N (2016) Detecting malicious URLs using lexical analysis. In: Proceedings of the international conference on network and system security (NSS 2016), pp 467–482

  30. Pang G, Cao L, Chen L, Liu H (2018) Learning representations of ultrahigh-dimensional data for random distance-based outlier detection. In: Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining, pp 2041–2050

  31. Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Duchesnay E et al (2011) Scikit-learn: machine learning in python. J Mach Learn Res 12:2825–2830

    MathSciNet  MATH  Google Scholar 

  32. Rekha AG (2015) A fast support vector data description system for anomaly detection using big data. In: Proceedings of the 30th Annual ACM symposium on applied computing (SAC), pp 931–932

  33. Scholkopf B, Platt JC, Shawe-Taylor J, Smola AJ, Williamson RC (2001) Estimating the support of a high-dimensional distribution. Neural Comput 13(7):1443–1471

    Article  Google Scholar 

  34. Shi T, Horvath S (2006) Unsupervised learning with random forest predictors. J Comput Graph Stat 15(1):118–138

    Article  MathSciNet  Google Scholar 

  35. Siddiqui S, Khan MS, Ferens K (2017) Multiscale Hebbian neural network for cyber threat detection. In: Proceedings of the international joint conference on neural networks (IJCNN), pp 1427–1434

  36. Stevens SS (1946) On the theory of scales of measurement. Science 103(2684):677–680

    Article  Google Scholar 

  37. Sugiyama M, Borgwardt KM (2013) Rapid distance-based outlier detection via sampling. In: Proceedings of the 27th annual conference on neural information processing systems, pp 467–475

  38. Tax D, Duin R (2004) Support vector data description. Mach Learn 54(1):45–66

    Article  Google Scholar 

  39. Ting KM, Washio T, Wells JR, Aryal S (2017) Defying the gravity of learning curve: a characteristic of nearest neighbour anomaly detectors. Mach Learn 106(1):55–91

    Article  MathSciNet  Google Scholar 

  40. Townsend JT, Ashby FG (1984) Measurement scales and statistics: the misconception misconceived. Psychol Bull 96(2):394–401

    Article  Google Scholar 

  41. Velleman PF, Wilkinson L (1993) Nominal, ordinal, interval, and ratio typologies are misleading. Am Stat 47(1):65–72

    Google Scholar 

  42. Weinan E (2011) Principles of multiscale modeling (Vol 6). Cambridge University Press, Cambridge

    MATH  Google Scholar 

  43. Zhong G, Wang L-N, Ling X, Dong J (2016) An overview on data representation learning: from traditional feature learning to recent deep learning. J Financ Data Sci 2(4):265–278

    Article  Google Scholar 

Download references

Acknowledgements

This paper is an extension of a conference paper published in Proceedings of the 22nd Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD) 2018 [2]. Authors would like to thank Mr Arbind Agrahari Baniya for his help to run some experiments in this extended version of the paper. This material is based upon work supported by the Air Force Office of Scientific Research under award number FA2386-20-1-4005.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to K.C. Santosh.

Ethics declarations

Conflicts of interest

Authors declare no conflict of interest.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Aryal, S., Santosh, K. & Dazeley, R. usfAD: a robust anomaly detector based on unsupervised stochastic forest. Int. J. Mach. Learn. & Cyber. 12, 1137–1150 (2021). https://doi.org/10.1007/s13042-020-01225-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s13042-020-01225-0

Keywords

Navigation