Skip to main content
Log in

High-dimensional outlier detection using random projections

  • Original Paper
  • Published:
TEST Aims and scope Submit manuscript

Abstract

There exist multiple methods to detect outliers in multivariate data in the literature, but most of them require to estimate the covariance matrix. The higher the dimension, the more complex the estimation of the matrix becoming impossible in high dimensions. In order to avoid estimating this matrix, we propose a novel random projection-based procedure to detect outliers in Gaussian multivariate data. It consists in projecting the data in several one-dimensional subspaces where an appropriate univariate outlier detection method, similar to Tukey’s method but with a threshold depending on the initial dimension and the sample size, is applied. The required number of projections is determined using sequential analysis. Simulated and real datasets illustrate the performance of the proposed method.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

References

  • Aggarwal CC (2017) Outlier analysis. Springer, Berlin

    Book  Google Scholar 

  • Barnett V, Lewis T (1994) Outliers in statistical data. Wiley series in probability and statistics. Wiley, Hoboken

    MATH  Google Scholar 

  • Becker C, Gather U (1999) The masking breakdown point of multivariate outlier identification rules. J Am Stat Assoc 94(447):947–955

    Article  MathSciNet  Google Scholar 

  • Cardot H, Mas A, Sarda P (2007) Clt in functional linear regression models. Probab Theory Relat Fields 138(3–4):325–361

    Article  MathSciNet  Google Scholar 

  • Cerioli A (2010) Multivariate outlier detection with high-breakdown estimators. J Am Stat Assoc 105(489):147–156

    Article  MathSciNet  Google Scholar 

  • Cerioli A, Riani M, Atkinson AC (2009) Controlling the size of multivariate outlier tests with the MCD estimator of scatter. Stat Comput 19(3):341–353

    Article  MathSciNet  Google Scholar 

  • Cuesta-Albertos JA, del Barrio E, Fraiman R, Matrán C (2007) The random projection method in goodness of fit for functional data. Comput Stat Data Anal 51(10):4814–4831

    Article  MathSciNet  Google Scholar 

  • Cuesta-Albertos JA, Febrero-Bande M (2010) A simple multiway anova for functional data. TEST 19(3):537–557

    Article  MathSciNet  Google Scholar 

  • Cuesta-Albertos JA, Fraiman R, Ransford T (2006) Random projections and goodness-of-fit tests in infinitedimensional spaces. Bull Braz Math Soc 37(4):477–501

    Article  MathSciNet  Google Scholar 

  • Cuesta-Albertos JA, Fraiman R, Ransford T (2007) A sharp form of the Cramér–Wold theorem. J Theor Probab 20(2):201–209

    Article  Google Scholar 

  • Cuesta-Albertos JA, Gamboa F, Nieto-Reyes A (2014) A random-projection based procedure to test if a stationary process is gaussian. Comput Stat Data Anal 75:124–141

    Article  Google Scholar 

  • Cuesta-Albertos JA, García-Portugués E, Febrero-Bande M, González-Manteiga W (2019) Goodness-of-fit tests for the functional linear model based on randomly projected empirical processes. Ann Stat 47(1):439–467

    Article  MathSciNet  Google Scholar 

  • Cuesta-Albertos JA, Nieto-Reyes A (2008) The random Tukey depth. Comput Stat Data Anal 52(11):4979–4988

    Article  MathSciNet  Google Scholar 

  • Davies L, Gather U (1993) The identification of multiple outliers. J Am Stat Assoc 88(423):782–792

    Article  MathSciNet  Google Scholar 

  • Donoho DL (1982) Breakdown properties of multivariate location estimators. Ph.D. qualifying paper

  • Esbensen K, Guyot D, Westad F, Houmoller L (2002) Multivariate data analysis in practice : an introduction to multivariate data analysis and experimental design. CAMO, Kanas

    Google Scholar 

  • Febrero M, Galeano P, González-Manteiga W (2007) A functional analysis of NO\(_{x}\) levels: location and scale estimation and outlier detection. Comput. Statist. 22(3):411–427

    Article  MathSciNet  Google Scholar 

  • Filzmoser P, Maronna R, Werner M (2008) Outlier identification in high dimensions. Comput Stat Data Anal 52(3):1694–1711

    Article  MathSciNet  Google Scholar 

  • Healy MJR (1968) Multivariate normal plotting. J R Stat Soc Ser C Appl Stat 17(2):157–161

    Google Scholar 

  • Hubert M, Rousseeuw PJ, Segaert P (2015) Multivariate functional outlier detection. Stat Methods Appl 24(2):177–202

    Article  MathSciNet  Google Scholar 

  • Johnson W, Lindenstrauss J (1984) Extensions of Lipschitz maps into a Hilbert space. Contemp Math 26:189–206

    Article  Google Scholar 

  • Johnstone IM, Lu AY (2009) On consistency and sparsity for principal components analysis in high dimensions. Am Stat Assoc 104(486):682–693

    Article  MathSciNet  Google Scholar 

  • Larsen FH, van den Berg F, Engelsen SB (2006) An exploratory chemometric study of \(^{1}\)H NMR spectra of table wines. J Chemom 20(5):198–208

    Article  Google Scholar 

  • Maronna RA, Martin RD, Yohai VJ, Salibián-Barrera M (2019) Robust statistics: theory and methods (with R). Wiley, Hoboken

    MATH  Google Scholar 

  • Maronna RA, Yohai VJ (1995) The behavior of the Stahel–Donoho robust multivariate estimator. J Am Stat Assoc 90(429):330–341

    Article  MathSciNet  Google Scholar 

  • Pan J, Fung W, Fang K (2000) Multiple outlier detection in multivariate data using projection pursuit techniques. J Stat Plann Inference 83(1):153–167

    Article  MathSciNet  Google Scholar 

  • Peña D, Prieto FJ (2001) Multivariate outlier detection and robust covariance matrix estimation. Technometrics 43(3):286–310

    Article  MathSciNet  Google Scholar 

  • Ro K, Zou C, Wang Z, Yin G (2015) Outlier detection for high-dimensional data. Biometrika 102(3):589–599

    Article  MathSciNet  Google Scholar 

  • Rousseeuw PJ, Debruyne M, Engelen S, Hubert M (2006) Robustness and outlier detection in chemometrics. Crit Rev Anal Chem 36(3–4):221–242

    Article  Google Scholar 

  • Serfling R, Mazumder S (2013) Computationally easy outlier detection via projection pursuit with finitely many directions. J Nonparametr Stat 25(2):447–461

    Article  MathSciNet  Google Scholar 

  • Stahel WA (1981) Breakdown of covariance estimators. Fachgruppe fur Statistik

  • Tartakovsky A, Nikiforov I, Basseville M (2014) Sequential analysis: hypothesis testing and changepoint detection. Chapman and Hall, Boca Raton

    Book  Google Scholar 

Download references

Acknowledgements

We thank the referees for detailed and useful comments that led to improve on an earlier draft of the paper.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to P. Navarro-Esteban.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Research partially supported by the Spanish Ministerio de Economía y Competitividad, Grant MTM2017-86061-C2-2-P.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 406 KB)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Navarro-Esteban, P., Cuesta-Albertos, J.A. High-dimensional outlier detection using random projections. TEST 30, 908–934 (2021). https://doi.org/10.1007/s11749-020-00750-y

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11749-020-00750-y

Keywords

Mathematics Subject Classification

Navigation