当前位置: X-MOL 学术Technometrics › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Fast robust correlation for high dimensional data
Technometrics ( IF 2.3 ) Pub Date : 2019-11-01 , DOI: 10.1080/00401706.2019.1677270
Jakob Raymaekers 1 , Peter J. Rousseeuw 1
Affiliation  

The product moment covariance is a cornerstone of multivariate data analysis, from which one can derive correlations, principal components, linear regression, sparse modeling, variable screening, and many other methods. Unfortunately the product moment covariance and the corresponding Pearson correlation are very susceptible to outliers (anomalies) in the data. Several robust measures of covariance have been developed, but few are suitable for the ultrahigh dimensional data that are becoming more prevalent nowadays. For that one needs methods whose computation scales well with the dimension, are guaranteed to yield a positive semidefinite covariance matrix, and are sufficiently robust to outliers as well as sufficiently accurate in the statistical sense of low variability. We construct such methods using data transformations. The resulting approach is simple, fast and widely applicable. We study its robustness by deriving influence functions and breakdown values, and computing the mean squared error on contaminated data. Using these results we select a method that performs well overall, and can be used in ultrahigh dimensional settings. It is illustrated on a genomic data set.

中文翻译:

高维数据的快速稳健相关

乘积矩协方差是多元数据分析的基石,从中可以推导出相关性、主成分、线性回归、稀疏建模、变量筛选和许多其他方法。不幸的是,乘积矩协方差和相应的 Pearson 相关非常容易受到数据中的异常值(异常)的影响。已经开发了几种稳健的协方差度量,但很少有适用于当今变得越来越普遍的超高维数据。为此,需要的方法的计算与维度很好地缩放,保证产生正半定协方差矩阵,并且对异常值足够稳健,并且在低可变性的统计意义上足够准确。我们使用数据转换来构建这样的方法。由此产生的方法简单、快速且广泛适用。我们通过推导影响函数和分解值,并计算受污染数据的均方误差来研究其稳健性。使用这些结果,我们选择了一种整体性能良好的方法,可用于超高维设置。它在基因组数据集上进行了说明。
更新日期:2019-11-01
down
wechat
bug