Chunk-wise regularised PCA-based imputation of missing data,Statistical Methods & Applications

当前位置： X-MOL 学术 › Stat. Methods Appl. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Chunk-wise regularised PCA-based imputation of missing data
Statistical Methods & Applications ( IF 1.1 ) Pub Date : 2021-06-25 , DOI: 10.1007/s10260-021-00575-5
A. Iodice D’Enza , A. Markos , F. Palumbo

Standard multivariate techniques like Principal Component Analysis (PCA) are based on the eigendecomposition of a matrix and therefore require complete data sets. Recent comparative reviews of PCA algorithms for missing data showed the regularised iterative PCA algorithm (RPCA) to be effective. This paper presents two chunk-wise implementations of RPCA suitable for the imputation of “tall” data sets, that is, data sets with many observations. A “chunk” is a subset of the whole set of available observations. In particular, one implementation is suitable for distributed computation as it imputes each chunk independently. The other implementation, instead, is suitable for incremental computation, where the imputation of each new chunk is based on all the chunks analysed that far. The proposed procedures were compared to batch RPCA considering different data sets and missing data mechanisms. Experimental results showed that the distributed approach had similar performance to batch RPCA for data with entries missing completely at random. The incremental approach showed appreciable performance when the data is missing not completely at random, and the first analysed chunks contain sufficient information on the data structure.

中文翻译：

丢失的数据块，明智的正则基于PCA-归集

像主成分分析（PCA）标准的多元技术基于矩阵的特征分解，因此需要完整的数据集。PCA的算法缺失数据最近比较审查表明，正则化迭代算法PCA（RPCA）是有效的。RPCA的本文提出了两种块明智实现适合的“高”的数据集的归集，也就是数据集的许多观察。A“块”是一整套可供观察的一个子集。特别地，一个实施方式是适用于分布式计算，因为它独立地责难每个组块。另一个实现，取而代之的，是适合于增量计算，其中每个新块的估算是基于所有块分析那么远。考虑到不同的数据集和缺失数据机制，将提议的程序与批处理 RPCA 进行了比较。实验结果表明，分布式方法也有类似的表现，以批量RPCA与完全随机丢失项数据。增量的方法却有明显的性能时，数据丢失并不是完全随机的，并且第一分析块包含的数据结构的足够信息。

更新日期：2021-06-28

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文