Fast, scalable and geo-distributed PCA for big data analytics,Information Systems

当前位置： X-MOL 学术 › Inform. Syst. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Fast, scalable and geo-distributed PCA for big data analytics
Information Systems ( IF 3.0 ) Pub Date : 2021-01-06 , DOI: 10.1016/j.is.2020.101710
T. M. Tariq Adnan , Md. Mehrab Tanjim , Muhammad Abdullah Adnan

Principal Component Analysis (PCA) is a widely popular technique for reducing the dimensionality of a dataset. Interestingly, when dimensions of the dataset grow too large, existing state-of-the-art methods for PCA face scalability issue due to the explosion of intermediate data. Moreover, in a geographically distributed environment where most of today’s data are originally generated, these methods require unnecessary data transmissions as they apply centralized algorithms for PCA and thus are proven to be inefficient. To solve these problems, we take advantage of the zero-noise-limit Probabilistic PCA model, which provably outputs the correct principal components, and introduce a block-division method for it in order to suppress the explosion of intermediate data efficiently. We employ several optimization ideas such as mean propagation for preserving sparsity, dynamic tuning of the number of blocks to automatically adjust to large dimensions, etc. Additionally, in the geo-distributed environment, we propose a communication efficient solution by reducing idle time, passing only the required parameters, and choosing geographically ideal central datacenter for faster accumulation. We refer to our algorithm as TallnWide. Our empirical evaluation with real datasets shows that TallnWide can successfully handle significantly higher dimensional data ( $10 \times$ ) than existing methods, and offer up to $2.9 \times$ improvement in running time in the geo-distributed environment compared to the conventional approaches. For reproducibility and extensibility of our work, we make the source code of TallnWide publicly available at https://github.com/tmadnan10/TallnWide.

中文翻译：

快速，可扩展且按地理分布的PCA用于大数据分析

主成分分析（PCA）是一种广泛使用的降低数据集维数的技术。有趣的是，当数据集的维数太大时，由于中间数据的爆炸，用于PCA的现有最新方法会面临可伸缩性问题。而且，在最初产生当今大多数数据的地理分布环境中，这些方法由于将集中式算法应用于PCA而需要不必要的数据传输，因此被证明是无效的。为了解决这些问题，我们利用零噪声极限概率PCA模型，该模型可证明地输出正确的主分量，并为此引入了块划分方法，以便有效地抑制中间数据的爆炸。我们采用了几种优化思想，例如保持稀疏性的均值传播，动态调整块数以自动调整为大尺寸等。此外，在地理分布环境中，我们提出了一种通信效率高的解决方案，可减少空闲时间，仅需要的参数，并选择地理位置理想的中央数据中心以加快累积速度。我们将算法称为TallnWide。我们对真实数据集的经验评估表明，TallnWide可以成功处理明显更高维度的数据（并选择地理位置理想的中央数据中心以加快积累速度。我们将算法称为TallnWide。我们对真实数据集的经验评估表明，TallnWide可以成功处理明显更高维度的数据（并选择地理位置理想的中央数据中心以加快积累速度。我们将算法称为TallnWide。我们对真实数据集的经验评估表明，TallnWide可以成功处理明显更高维度的数据（ $10 \times$ ），而不是现有方法，并提供 $2.9 \times$ 与传统方法相比，可改善地理分布环境中的运行时间。为了使我们的工作具有可重复性和可扩展性，我们在以下网址公开提供了TallnWide的源代码：https：//github.com/tmadnan10/TallnWide。

更新日期：2021-01-16

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11