当前位置: X-MOL 学术Stat. Methods Appl. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Chunk-wise regularised PCA-based imputation of missing data
Statistical Methods & Applications ( IF 1.1 ) Pub Date : 2021-06-25 , DOI: 10.1007/s10260-021-00575-5
A. Iodice D’Enza , A. Markos , F. Palumbo

Standard multivariate techniques like Principal Component Analysis (PCA) are based on the eigendecomposition of a matrix and therefore require complete data sets. Recent comparative reviews of PCA algorithms for missing data showed the regularised iterative PCA algorithm (RPCA) to be effective. This paper presents two chunk-wise implementations of RPCA suitable for the imputation of “tall” data sets, that is, data sets with many observations. A “chunk” is a subset of the whole set of available observations. In particular, one implementation is suitable for distributed computation as it imputes each chunk independently. The other implementation, instead, is suitable for incremental computation, where the imputation of each new chunk is based on all the chunks analysed that far. The proposed procedures were compared to batch RPCA considering different data sets and missing data mechanisms. Experimental results showed that the distributed approach had similar performance to batch RPCA for data with entries missing completely at random. The incremental approach showed appreciable performance when the data is missing not completely at random, and the first analysed chunks contain sufficient information on the data structure.



中文翻译:

基于分块正则化 PCA 的缺失数据插补

主成分分析 (PCA) 等标准多元技术基于矩阵的特征分解,因此需要完整的数据集。最近对缺失数据的 PCA 算法的比较审查表明,正则化迭代 PCA 算法 (RPCA) 是有效的。本文介绍了 RPCA 的两种分块实现,适用于“大”数据集的插补,即具有许多观察的数据集。“块”是整个可用观察集的子集。特别是,一种实现适用于分布式计算,因为它独立地估算每个块。相反,另一种实现适用于增量计算,其中每个新块的插补基于到目前为止分析的所有块。考虑到不同的数据集和缺失的数据机制,将提议的程序与批处理 RPCA 进行了比较。实验结果表明,对于条目完全随机丢失的数据,分布式方法具有与批处理 RPCA 相似的性能。当数据不是完全随机丢失时,增量方法显示出可观的性能,并且第一个分析的块包含有关数据结构的足够信息。

更新日期:2021-06-28
down
wechat
bug