当前位置: X-MOL 学术J. Comput. Graph. Stat. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Least-Square Approximation for a Distributed System
Journal of Computational and Graphical Statistics ( IF 2.4 ) Pub Date : 2021-06-21 , DOI: 10.1080/10618600.2021.1923517
Xuening Zhu 1 , Feng Li 2 , Hansheng Wang 3
Affiliation  

Abstract

In this work, we develop a distributed least-square approximation (DLSA) method that is able to solve a large family of regression problems (e.g., linear regression, logistic regression, and Cox’s model) on a distributed system. By approximating the local objective function using a local quadratic form, we are able to obtain a combined estimator by taking a weighted average of local estimators. The resulting estimator is proved to be statistically as efficient as the global estimator. Moreover, it requires only one round of communication. We further conduct a shrinkage estimation based on the DLSA estimation using an adaptive Lasso approach. The solution can be easily obtained by using the LARS algorithm on the master node. It is theoretically shown that the resulting estimator possesses the oracle property and is selection consistent by using a newly designed distributed Bayesian information criterion. The finite sample performance and computational efficiency are further illustrated by an extensive numerical study and an airline dataset. The airline dataset is 52 GB in size. The entire methodology has been implemented in Python for a de-facto standard Spark system. The proposed DLSA algorithm on the Spark system takes 26 min to obtain a logistic regression estimator, which is more efficient and memory friendly than conventional methods. Supplementary materials for this article are available online.



中文翻译:

分布式系统的最小二乘近似

摘要

在这项工作中,我们开发了一种分布式最小二乘逼近 (DLSA) 方法,该方法能够解决分布式系统上的大量回归问题(例如,线性回归、逻辑回归和 Cox 模型)。通过使用局部二次形式逼近局部目标函数,我们能够通过对局部估计量取加权平均来获得组合估计量。结果证明在统计上与全局估计器一样有效。而且,它只需要一轮通信。我们进一步使用自适应套索方法基于 DLSA 估计进行收缩估计。通过在主节点上使用LARS算法可以很容易地得到解决方案。理论上表明,通过使用新设计的分布式贝叶斯信息准则,所得估计量具有预言性,并且是选择一致的。广泛的数值研究和航空公司数据集进一步说明了有限样本性能和计算效率。航空公司数据集大小为 52 GB。整个方法已经在 Python 中实现了事实上的标准 Spark 系统。Spark 系统上提出的 DLSA 算法需要 26 分钟才能获得逻辑回归估计量,这比传统方法更有效且内存友好。本文的补充材料可在线获取。

更新日期:2021-06-21
down
wechat
bug