Abstract
There exist multiple methods to detect outliers in multivariate data in the literature, but most of them require to estimate the covariance matrix. The higher the dimension, the more complex the estimation of the matrix becoming impossible in high dimensions. In order to avoid estimating this matrix, we propose a novel random projection-based procedure to detect outliers in Gaussian multivariate data. It consists in projecting the data in several one-dimensional subspaces where an appropriate univariate outlier detection method, similar to Tukey’s method but with a threshold depending on the initial dimension and the sample size, is applied. The required number of projections is determined using sequential analysis. Simulated and real datasets illustrate the performance of the proposed method.
Similar content being viewed by others
References
Aggarwal CC (2017) Outlier analysis. Springer, Berlin
Barnett V, Lewis T (1994) Outliers in statistical data. Wiley series in probability and statistics. Wiley, Hoboken
Becker C, Gather U (1999) The masking breakdown point of multivariate outlier identification rules. J Am Stat Assoc 94(447):947–955
Cardot H, Mas A, Sarda P (2007) Clt in functional linear regression models. Probab Theory Relat Fields 138(3–4):325–361
Cerioli A (2010) Multivariate outlier detection with high-breakdown estimators. J Am Stat Assoc 105(489):147–156
Cerioli A, Riani M, Atkinson AC (2009) Controlling the size of multivariate outlier tests with the MCD estimator of scatter. Stat Comput 19(3):341–353
Cuesta-Albertos JA, del Barrio E, Fraiman R, Matrán C (2007) The random projection method in goodness of fit for functional data. Comput Stat Data Anal 51(10):4814–4831
Cuesta-Albertos JA, Febrero-Bande M (2010) A simple multiway anova for functional data. TEST 19(3):537–557
Cuesta-Albertos JA, Fraiman R, Ransford T (2006) Random projections and goodness-of-fit tests in infinitedimensional spaces. Bull Braz Math Soc 37(4):477–501
Cuesta-Albertos JA, Fraiman R, Ransford T (2007) A sharp form of the Cramér–Wold theorem. J Theor Probab 20(2):201–209
Cuesta-Albertos JA, Gamboa F, Nieto-Reyes A (2014) A random-projection based procedure to test if a stationary process is gaussian. Comput Stat Data Anal 75:124–141
Cuesta-Albertos JA, García-Portugués E, Febrero-Bande M, González-Manteiga W (2019) Goodness-of-fit tests for the functional linear model based on randomly projected empirical processes. Ann Stat 47(1):439–467
Cuesta-Albertos JA, Nieto-Reyes A (2008) The random Tukey depth. Comput Stat Data Anal 52(11):4979–4988
Davies L, Gather U (1993) The identification of multiple outliers. J Am Stat Assoc 88(423):782–792
Donoho DL (1982) Breakdown properties of multivariate location estimators. Ph.D. qualifying paper
Esbensen K, Guyot D, Westad F, Houmoller L (2002) Multivariate data analysis in practice : an introduction to multivariate data analysis and experimental design. CAMO, Kanas
Febrero M, Galeano P, González-Manteiga W (2007) A functional analysis of NO\(_{x}\) levels: location and scale estimation and outlier detection. Comput. Statist. 22(3):411–427
Filzmoser P, Maronna R, Werner M (2008) Outlier identification in high dimensions. Comput Stat Data Anal 52(3):1694–1711
Healy MJR (1968) Multivariate normal plotting. J R Stat Soc Ser C Appl Stat 17(2):157–161
Hubert M, Rousseeuw PJ, Segaert P (2015) Multivariate functional outlier detection. Stat Methods Appl 24(2):177–202
Johnson W, Lindenstrauss J (1984) Extensions of Lipschitz maps into a Hilbert space. Contemp Math 26:189–206
Johnstone IM, Lu AY (2009) On consistency and sparsity for principal components analysis in high dimensions. Am Stat Assoc 104(486):682–693
Larsen FH, van den Berg F, Engelsen SB (2006) An exploratory chemometric study of \(^{1}\)H NMR spectra of table wines. J Chemom 20(5):198–208
Maronna RA, Martin RD, Yohai VJ, Salibián-Barrera M (2019) Robust statistics: theory and methods (with R). Wiley, Hoboken
Maronna RA, Yohai VJ (1995) The behavior of the Stahel–Donoho robust multivariate estimator. J Am Stat Assoc 90(429):330–341
Pan J, Fung W, Fang K (2000) Multiple outlier detection in multivariate data using projection pursuit techniques. J Stat Plann Inference 83(1):153–167
Peña D, Prieto FJ (2001) Multivariate outlier detection and robust covariance matrix estimation. Technometrics 43(3):286–310
Ro K, Zou C, Wang Z, Yin G (2015) Outlier detection for high-dimensional data. Biometrika 102(3):589–599
Rousseeuw PJ, Debruyne M, Engelen S, Hubert M (2006) Robustness and outlier detection in chemometrics. Crit Rev Anal Chem 36(3–4):221–242
Serfling R, Mazumder S (2013) Computationally easy outlier detection via projection pursuit with finitely many directions. J Nonparametr Stat 25(2):447–461
Stahel WA (1981) Breakdown of covariance estimators. Fachgruppe fur Statistik
Tartakovsky A, Nikiforov I, Basseville M (2014) Sequential analysis: hypothesis testing and changepoint detection. Chapman and Hall, Boca Raton
Acknowledgements
We thank the referees for detailed and useful comments that led to improve on an earlier draft of the paper.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Research partially supported by the Spanish Ministerio de Economía y Competitividad, Grant MTM2017-86061-C2-2-P.
Supplementary Information
Below is the link to the electronic supplementary material.
Rights and permissions
About this article
Cite this article
Navarro-Esteban, P., Cuesta-Albertos, J.A. High-dimensional outlier detection using random projections. TEST 30, 908–934 (2021). https://doi.org/10.1007/s11749-020-00750-y
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11749-020-00750-y