当前位置: X-MOL 学术Ann. Stat. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Permutation methods for factor analysis and PCA
Annals of Statistics ( IF 4.5 ) Pub Date : 2020-10-01 , DOI: 10.1214/19-aos1907
Edgar Dobriban

Researchers often have datasets measuring features $x_{ij}$ of samples, such as test scores of students. In factor analysis and PCA, these features are thought to be influenced by unobserved factors, such as skills. Can we determine how many components affect the data? This is an important problem, because it has a large impact on all downstream data analysis. Consequently, many approaches have been developed to address it. Parallel Analysis is a popular permutation method. It works by randomly scrambling each feature of the data. It selects components if their singular values are larger than those of the permuted data. Despite widespread use in leading textbooks and scientific publications, as well as empirical evidence for its accuracy, it currently has no theoretical justification. In this paper, we show that the parallel analysis permutation method consistently selects the large components in certain high-dimensional factor models. However, it does not select the smaller components. The intuition is that permutations keep the noise invariant, while "destroying" the low-rank signal. This provides justification for permutation methods in PCA and factor models under some conditions. Our work uncovers drawbacks of permutation methods, and paves the way to improvements.

中文翻译:

因子分析和 PCA 的排列方法

研究人员通常拥有测量样本特征 $x_{ij}$ 的数据集,例如学生的考试成绩。在因子分析和 PCA 中,这些特征被认为受未观察到的因素影响,例如技能。我们能确定有多少组件影响数据吗?这是一个重要的问题,因为它对所有下游数据分析都有很大的影响。因此,已经开发了许多方法来解决它。并行分析是一种流行的排列方法。它的工作原理是随机加扰数据的每个特征。如果它们的奇异值大于排列数据的奇异值,它就会选择组件。尽管在主要教科书和科学出版物中广泛使用,并且其准确性有经验证据,但目前还没有理论依据。在本文中,我们表明,平行分析排列方法始终如一地选择某些高维因子模型中的大分量。但是,它不会选择较小的组件。直觉是排列保持噪声不变,同时“破坏”低秩信号。这为某些条件下 PCA 和因子模型中的置换方法提供了依据。我们的工作揭示了置换方法的缺点,并为改进铺平了道路。
更新日期:2020-10-01
down
wechat
bug