An empirical comparison of two approaches for CDPCA in high-dimensional data,Statistical Methods & Applications

当前位置： X-MOL 学术 › Stat. Methods Appl. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

An empirical comparison of two approaches for CDPCA in high-dimensional data
Statistical Methods & Applications ( IF 1.1 ) Pub Date : 2020-08-18 , DOI: 10.1007/s10260-020-00546-2
Adelaide Freitas , Eloísa Macedo , Maurizio Vichi

Modified principal component analysis techniques, specially those yielding sparse solutions, are attractive due to its usefulness for interpretation purposes, in particular, in high-dimensional data sets. Clustering and disjoint principal component analysis (CDPCA) is a constrained PCA that promotes sparsity in the loadings matrix. In particular, CDPCA seeks to describe the data in terms of disjoint (and possibly sparse) components and has, simultaneously, the particularity of identifying clusters of objects. Based on simulated and real gene expression data sets where the number of variables is higher than the number of the objects, we empirically compare the performance of two different heuristic iterative procedures, namely ALS and two-step-SDP algorithms proposed in the specialized literature to perform CDPCA. To avoid possible effect of different variance values among the original variables, all the data was standardized. Although both procedures perform well, numerical tests highlight two main features that distinguish their performance, in particular related to the two-step-SDP algorithm: it provides faster results than ALS and, since it employs a clustering procedure (k-means) on the variables, outperforms ALS algorithm in recovering the true variable partitioning unveiled by the generated data sets. Overall, both procedures produce satisfactory results in terms of solution precision, where ALS performs better, and in recovering the true object clusters, in which two-step-SDP outperforms ALS approach for data sets with lower sample size and more structure complexity (i.e., error level in the CDPCA model). The proportion of explained variance by the components estimated by both algorithms is affected by the data structure complexity (higher error level, the lower variance) and presents similar values for the two algorithms, except for data sets with two object clusters where the two-step-SDP approach yields higher variance. Moreover, experimental tests suggest that the two-step-SDP approach, in general, presents more ability to recover the true number of object clusters, while the ALS algorithm is better in terms of quality of object clustering with more homogeneous, compact and well-separated clusters in the reduced space of the CDPCA components.

中文翻译：

高维数据中CDPCA两种方法的经验比较

改进的主成分分析技术，特别是那些产生稀疏解的技术，由于其可用于解释目的（尤其是在高维数据集中）有用，因此具有吸引力。聚类和不相交的主成分分析（CDPCA）是受约束的PCA，可促进载荷矩阵中的稀疏性。特别地，CDPCA试图以不相交（并且可能稀疏）的成分来描述数据，并且同时具有识别对象簇的特殊性。基于变量数量大于对象数量的模拟和真实基因表达数据集，我们根据经验比较了两种不同的启发式迭代过程的性能，即专业文献中提出的ALS和两步SDP算法以执行CDPCA。为了避免原始变量之间不同方差值的可能影响，对所有数据进行了标准化。尽管两种方法都能很好地执行，但是数值测试突出了两个主要特征，这些特征区别于它们的性能，特别是与两步SDP算法有关：它比ALS提供更快的结果，并且由于在聚类上采用了聚类过程（k-means）。变量，在恢复由生成的数据集揭示的真实变量分区方面胜过ALS算法。总体而言，这两种方法在解决方案精度（ALS表现更好）以及恢复真实的对象簇方面均取得令人满意的结果，其中两步SDP在样本量较小且结构复杂度更高的数据集方面优于ALS方法（例如， CDPCA模型中的错误级别）。两种算法估计的分量所解释的方差所占的比例受数据结构复杂度（较高的错误级别，方差较低）的影响，并且对于两种算法均表示相似的值，但带有两个对象簇的数据集需要两步-SDP方法产生更高的方差。此外，实验测试表明，两步SDP方法通常具有更多的能力来恢复真实数量的对象簇，而ALS算法在对象簇的质量方面更好，并且同质性，紧凑性和良好性更好。在减少的CDPCA组件空间中分离群集。除了具有两个对象群集的数据集外，其中两步SDP方法产生较大的方差。此外，实验测试表明，两步SDP方法通常具有更多的能力来恢复真实数量的对象簇，而ALS算法在对象簇的质量方面更好，并且同质性，紧凑性和良好性更好。在减少的CDPCA组件空间中分离群集。除了具有两个对象群集的数据集外，其中两步SDP方法产生较大的方差。此外，实验测试表明，两步SDP方法通常具有更多的能力来恢复真实数量的对象簇，而ALS算法在对象簇的质量方面更好，并且同质性，紧凑性和良好性更好。在减少的CDPCA组件空间中分离群集。

更新日期：2020-08-19

点击分享查看原文

点击收藏

阅读更多本刊最新论文