Abstract
Big data is usually massive, diverse, time-varying, and high-dimensional. The focus of this paper is on the domain description of big data, which is the basis for solving the above problems. This paper has three main contributions. Firstly, one hyperellipsoid model is proposed to analyze domain description of big data. The parameters of the hyperellipsoid model can be adaptively adjusted according to the proposed objective function without relying on manual parameter selection, which expands the application range of the model. Secondly, an improved FDPC algorithm is proposed to generate multiple hyperellipsoid models to approximate the spatial distribution of big data, thus improving the accuracy of domain description. Multiple hyperellipsoid models can not only greatly eliminate the spatial redundancy of the domain description based on one hyperellipsoid model, but also provide a feasible method for describing complex spatial distribution. Thirdly, an online domain description algorithm based on hyperellipsoid models is proposed, which improves the robustness of hyperellipsoid models on time-varying data. The parallel processing flow of the algorithm is given. In the experiment, synthetic instances and real-world datasets were applied to test the performance of hyperellipsoid models. By comparing LOF, OneClassSVM, SVDD and isolation forest, the performance of the proposed method is competitive and promising.
Similar content being viewed by others
References
XX Wu JP Zhang FY Wang 2020 Stability-based generalization analysis of distributed learning algorithmsfor big data IEEE Trans Neural Netw Learn Syst https://doi.org/10.1109/TNNLS.2019.2910188
Liu XY, Zhu Q, Pramanik S, Brown CT, Qian G (2020) VA-store: a virtual approximate store approach to supporting repetitive big data in genome sequence analyses. IEEE Trans Knowl Data Eng. https://doi.org/10.1109/TKDE.2018.2885952
M Radovanović A Nanopoulos M Ivanović 2015 Reverse nearest neighbors in unsupervised distance-based outlier detection IEEE Trans Knowl Data Eng 27 5 1369 1382
Liu H, Li X, Li J, Zhang S (2018) Efficient outlier detection for high-dimensional data. IEEETrans Syst Man Cybern Syst 48(12): 2451–2461
P Oza VM Patel 2019 One-class convolutional neural network IEEE Signal Process Lett Syst 26 2 277 281
B Liu Y Xiao PS Yu Z Hao L Cao 2014 An efficient approach for outlier detection with imperfect data labels IEEE Trans Knowl Data Eng 26 7 1602 1616
S Decherchi W Rocchia 2017 Import vector domain description: a kernel logistic one-class learning algorithm IEEE Trans Neural Netw Learn Syst 28 7 1722 1729
F Angiulli S Basta S Lodi C Sartori 2016 GPU Strategies for distance-based outlier detection IEEE Trans Parallel Distrib Syst 27 11 3256 3268
K Gokcesu MM Neyshabouri H Gokcesu SS Kozat 2019 Sequential outlier detection based on incremental decision trees IEEE Trans Signal Process 67 4 993 1005
Y Cong J Liu B Fan P Zeng H Yu J Luo 2018 Online similarity learning for big data with overfitting IEEE Trans Big Data 4 1 78 89
A Rodriguez A Laio 2014 Clustering by fast search and find of density peaks Science https://doi.org/10.1126/science.1242072
FE Curtis 2012 A penalty-interior-point algorithm for nonlinear constrained optimization IEEE Trans Pattern Anal Mach Intell 4 2 181 209
Y Altmann N Dobigeon JY Tourneret 2014 Unsupervised post-nonlinear unmixing of hyperspectral image using a hamiltonian monte Carlo algorithm IEEE Trans Image Process 23 6 2663 2675
RosenbergA, Hirschberg J (2007) V-measure: a conditional entropy-based external cluster evaluation measure. In: Proc. EMNLP-CoNLL, Prague, Czech Republic, pp. 410–420.
YC Xiao HG Wang WL Xu 2015 Parameter selection of gaussian kernel for one-class SVM IEEE Trans Cybern 45 5 941 953
W Zhang 2015 Support vector data description using privileged information Electron Lett 51 14 1075 1076
S Ahmed Y Lee S Hyun I Koo 2019 Unsupervised machine learning-based detection of covert data integrity assault in smart grid networks utilizing isolation forest IEEE Trans Inf Foren Secur 14 10 2765 2777
JB Shen XP Hao ZY Liang Y Liu WG Wang L Shao 2016 Real-time superpixel segmentation by DBSCAN clustering algorithm IEEE Trans Image Process 25 12 5933 5942
V D'Orangeville MA Mayers ME Monga MS Wang 2013 Efficient cluster labeling for support vector clustering IEEE Trans Knowl Data Eng 25 11 2494 2506
PA Forero V Kekatos GB Giannakis 2012 Robust clustering using outlier-sparsity regularization IEEE Trans Signal Process 60 8 4163 4177
Ramesh D, Kumari K (2021) DEBC-GM: denclue based gaussian mixture approach for big data clustering. In: Proc. IEEE International Conference on Current Trends toward Converging Technologies, Coimbatore, India. https://doi.org/10.1109/ICCTCT.2018.8550895
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Qiu, Z. Online domain description of big data based on hyperellipsoid models. Int. J. Mach. Learn. & Cyber. 12, 2185–2197 (2021). https://doi.org/10.1007/s13042-021-01300-0
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s13042-021-01300-0