usfAD: a robust anomaly detector based on unsupervised stochastic forest

Aryal, Sunil; Santosh, K.C.; Dazeley, Richard

doi:10.1007/s13042-020-01225-0

usfAD: a robust anomaly detector based on unsupervised stochastic forest

Original Article
Published: 02 November 2020

Volume 12, pages 1137–1150, (2021)
Cite this article

International Journal of Machine Learning and Cybernetics Aims and scope Submit manuscript

428 Accesses
6 Citations
Explore all metrics

Abstract

In real-world applications, data can be represented using different units/scales. For example, weight in kilograms or pounds and fuel-efficiency in km/l or l/100 km. One unit can be a linear or non-linear scaling of another. The variation in metrics due to the non-linear scaling makes Anomaly Detection (AD) challenging. Most existing AD algorithms rely on distance- or density-based functions, which makes them sensitive to how data is expressed. This means that they are representation dependent. To avoid such a problem, we introduce a new anomaly detection method, which we call ‘usfAD: Unsupervised Stochastic Forest-based Anomaly Detector’. Our empirical evaluation in synthetic and real-world cybersecurity (spam detection, malicious URL detection and intrusion detection) datasets shows that our approach is more robust to the variation in units/scales used to express data. It produces more consistent and better results than five state-of-the-art AD methods namely: local outlier factor; one-class support vector machine; isolation forest; nearest neighbor in a random subsample of data; and, simple histogram-based probabilistic method.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Review on Random Forest: An Ensemble Classifier

A review of spam email detection: analysis of spammer strategies and the dataset shift problem

Article Open access 11 May 2022

Survey of intrusion detection systems: techniques, datasets and challenges

Article Open access 17 July 2019

Notes

References

Aggarwal CC (2017) Outlier analysis. Springer, Berlin
Book Google Scholar
Aryal S (2018) Anomaly detection technique robust to units and scales of measurement. In: Proceedings of the 22nd Pacific-Asia Conference on Knowledge Discovery and Data Mining, pp 589–601
Aryal S, Baniya AA, Santosh K (2019) Improved histogram-based anomaly detector with the extended principal component features. arxiv. https://arxiv.org/abs/1909.12702
Aryal S, Ting KM, Haffari G (2016) Revisiting attribute independence assumption in probabilistic unsupervised anomaly detection. In: Proceedings of the 11th Pacific Asia Workshop on Intelligence and Security Informatics, pp 73–86
Aryal S, Ting KM, Washio T, Haffari G (2017) Data-dependent dissimilarity measure: an effective alternative to geometric distance measures. Knowl Inf Syst 53(2):479–506
Article Google Scholar
Aryal S, Ting KM, Washio T, Haffari G (2020) A comparative study of data-dependent approaches without learning in measuring similarities of data objects. Data Min Knowl Disc 34(1):124–162. https://doi.org/10.1007/s10618-019-00660-0
Article MathSciNet Google Scholar
Aryal S, Ting KM, Wells JR, Washio T (2014) Improving iForest with Relative Mass. In: Proceedings of the 18th Pacific Asia Conference on Knowledge Discovery and Data Mining (PAKDD), pp 510–521
Bakshi BR (1999) Multiscale analysis and modelling using wavelets. J Chemom 13(1):415–434
Article Google Scholar
Bandaragoda T, Ting KM, Albrecht D, Liu F, Wells J (2014) Efficient anomaly detection by isolation using nearest neighbour ensemble. In: Proceedings of the IEEE international conference on data mining workshops, pp 698–705
Baniya AA, Aryal S, Santosh KC (2019) A novel data pre-processing technique: making data mining robust to different units and scales of measurement. In: Proceedings of the 26th international conference on neural information processing (ICONIP) of the Asia-Pacific Neural Network Society, (p. Accepted)
Bay SD, Schwabacher M (2003) Mining distance-based outliers in near linear time with randomization and a simple pruning rule. In: Proceedings of the ninth ACM SIGKDD conference on knowledge discovery and data mining, pp 29–38
Bengio Y, Courville A, Vincent P (2013) Representation learning: a review and new perspectives. IEEE Trans Pattern Anal Mach Intell 35(8):1798–1828
Article Google Scholar
Boriah S, Chandola V, Kumar V (2008) Similarity measures for categorical data: a comparative evaluation. In: Proceedings of the eighth SIAM international conference on data mining, pp 243–254
Breiman L (2001) Random forests. Mach Learn 45(1):5–32
Article Google Scholar
Breunig MM, Kriegel H-P, Ng RT, Sander J (2000) LOF: identifying density-based local outliers. In; Proceedings of ACM SIGMOD conference on management of data, pp 93–104
Chandola V, Banerjee A, Kumar V (2009) Anomaly detection: a survey. ACM Comput Surv 41(3):15-1-15–58
Article Google Scholar
Cheng T, Li Z (2006) A multiscale approach for spatio-temporal outlier detection. Trans GIS 10(2):253–263
Article Google Scholar
Conover WJ, Iman RL (1981) Rank transformations as a bridge between parametric and nonparametric statistics. Am Statist 35(3):124–129
MATH Google Scholar
Fernando TL, Webb GI (2017) SimUSF: An efficient and effective similarity measure that is invariant to violations of the interval scale assumption. Data Min Knowl Disc 31(1):264–286
Article MathSciNet Google Scholar
Gao Z, Guo L, Ma C, Ma X, Sun K, Xiang H, Liu X et al (2019) AMAD: adversarial multiscale anomaly detection on high-dimensional and time-evolving categorical data. In: Proceedings of the 1st international workshop on deep learning practice for high-dimensional sparse data (DLP-KDD ’19), pp 1–8
Goldstein M, Dengel A (2012) Histogram-based outlier score (HBOS): a fast unsupervised anomaly detection algorithm. In: Proceedings of the 35th German Conference on Artificial Intelligence, pp 59–63
Hand DJ, Till RJ (2001) A simple generalisation of the area under the roc curve for multiple class. Mach Learn 45(2):171–186
Article Google Scholar
Hawkins DM (1980) Identification of outliers. Chapman and Hall, London
Book Google Scholar
Jiang H, Wang H, Hu W, Kakde D, Chaudhuri A (2017) Fast incremental SVDD learning algorithm with the Gaussian Kernel. In: Proceedings of the Thirty-Third AAAI conference on artificial intelligence (AAAI), pp 3991–3998
Joiner BL (1981) Lurking variables: some examples. Am Statist 35(4):227–233
Google Scholar
Liu F, Ting KM, Zhou Z-H (2008) Isolation forest. In: Proceedings of the Eighth IEEE international conference on data mining, pp 413–422
Liu Q, Klucik R, Chen C, Grant G, Gallaher D, Lv Q, Shang L (2017) Unsupervised detection of contextual anomaly in remotely sensed data. Remote Sens Environ 202(1):75–87
Article Google Scholar
Lord FM (1953) On the statistical treatment of football numbers. Am Psychol 8(12):750–751
Article Google Scholar
Mamun MS, Rathore MA, Lashkari AH, Stakhanova N (2016) Detecting malicious URLs using lexical analysis. In: Proceedings of the international conference on network and system security (NSS 2016), pp 467–482
Pang G, Cao L, Chen L, Liu H (2018) Learning representations of ultrahigh-dimensional data for random distance-based outlier detection. In: Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining, pp 2041–2050
Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Duchesnay E et al (2011) Scikit-learn: machine learning in python. J Mach Learn Res 12:2825–2830
MathSciNet MATH Google Scholar
Rekha AG (2015) A fast support vector data description system for anomaly detection using big data. In: Proceedings of the 30th Annual ACM symposium on applied computing (SAC), pp 931–932
Scholkopf B, Platt JC, Shawe-Taylor J, Smola AJ, Williamson RC (2001) Estimating the support of a high-dimensional distribution. Neural Comput 13(7):1443–1471
Article Google Scholar
Shi T, Horvath S (2006) Unsupervised learning with random forest predictors. J Comput Graph Stat 15(1):118–138
Article MathSciNet Google Scholar
Siddiqui S, Khan MS, Ferens K (2017) Multiscale Hebbian neural network for cyber threat detection. In: Proceedings of the international joint conference on neural networks (IJCNN), pp 1427–1434
Stevens SS (1946) On the theory of scales of measurement. Science 103(2684):677–680
Article Google Scholar
Sugiyama M, Borgwardt KM (2013) Rapid distance-based outlier detection via sampling. In: Proceedings of the 27th annual conference on neural information processing systems, pp 467–475
Tax D, Duin R (2004) Support vector data description. Mach Learn 54(1):45–66
Article Google Scholar
Ting KM, Washio T, Wells JR, Aryal S (2017) Defying the gravity of learning curve: a characteristic of nearest neighbour anomaly detectors. Mach Learn 106(1):55–91
Article MathSciNet Google Scholar
Townsend JT, Ashby FG (1984) Measurement scales and statistics: the misconception misconceived. Psychol Bull 96(2):394–401
Article Google Scholar
Velleman PF, Wilkinson L (1993) Nominal, ordinal, interval, and ratio typologies are misleading. Am Stat 47(1):65–72
Google Scholar
Weinan E (2011) Principles of multiscale modeling (Vol 6). Cambridge University Press, Cambridge
MATH Google Scholar
Zhong G, Wang L-N, Ling X, Dong J (2016) An overview on data representation learning: from traditional feature learning to recent deep learning. J Financ Data Sci 2(4):265–278
Article Google Scholar

Download references

Acknowledgements

This paper is an extension of a conference paper published in Proceedings of the 22nd Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD) 2018 [2]. Authors would like to thank Mr Arbind Agrahari Baniya for his help to run some experiments in this extended version of the paper. This material is based upon work supported by the Air Force Office of Scientific Research under award number FA2386-20-1-4005.

Author information

Authors and Affiliations

School of Information Technology, Deakin University, 75 Pigdons Rd, Waurn Ponds, VIC, 3216, Australia
Sunil Aryal & Richard Dazeley
Department of Computer Science, University of South Dakota, 414 E Clark St, Vermillion, SD, 57069, USA
K.C. Santosh

Authors

Sunil Aryal
View author publications
You can also search for this author in PubMed Google Scholar
K.C. Santosh
View author publications
You can also search for this author in PubMed Google Scholar
Richard Dazeley
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to K.C. Santosh.

Ethics declarations

Conflicts of interest

Authors declare no conflict of interest.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Aryal, S., Santosh, K. & Dazeley, R. usfAD: a robust anomaly detector based on unsupervised stochastic forest. Int. J. Mach. Learn. & Cyber. 12, 1137–1150 (2021). https://doi.org/10.1007/s13042-020-01225-0

Download citation

Received: 11 April 2020
Accepted: 16 October 2020
Published: 02 November 2020
Issue Date: April 2021
DOI: https://doi.org/10.1007/s13042-020-01225-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

usfAD: a robust anomaly detector based on unsupervised stochastic forest

Abstract

Access this article

Similar content being viewed by others

A Review on Random Forest: An Ensemble Classifier

A review of spam email detection: analysis of spammer strategies and the dataset shift problem

Survey of intrusion detection systems: techniques, datasets and challenges

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflicts of interest

Rights and permissions

About this article

Cite this article

Keywords

Navigation

usfAD: a robust anomaly detector based on unsupervised stochastic forest

Abstract

Access this article

Similar content being viewed by others

A Review on Random Forest: An Ensemble Classifier

A review of spam email detection: analysis of spammer strategies and the dataset shift problem

Survey of intrusion detection systems: techniques, datasets and challenges

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflicts of interest

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation