当前位置: X-MOL 学术IEEE Trans. Parallel Distrib. Syst. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Why Dataset Properties Bound the Scalability of Parallel Machine Learning Training Algorithms
IEEE Transactions on Parallel and Distributed Systems ( IF 5.3 ) Pub Date : 2021-01-06 , DOI: 10.1109/tpds.2020.3048836
Daning Cheng , Shigang Li , Hanping Zhang , Fen Xia , Yunquan Zhang

As the training dataset size and the model size of machine learning increase rapidly, more computing resources are consumed to speedup the training process. However, the scalability and performance reproducibility of parallel machine learning training, which mainly uses stochastic optimization algorithms, are limited. In this paper, we demonstrate that the sample difference in the dataset plays a prominent role in the scalability of parallel machine learning algorithms. We propose to use statistical properties of dataset to measure sample differences. These properties include the variance of sample features, sample sparsity, sample diversity, and similarity in sampling sequences. We choose four types of parallel training algorithms as our research objects: (1) the asynchronous parallel SGD algorithm (Hogwild! algorithm), (2) the parallel model average SGD algorithm (minibatch SGD algorithm), (3) the decentralization optimization algorithm, and (4) the dual coordinate optimization (DADM algorithm). Our results show that the statistical properties of training datasets determine the scalability upper bound of these parallel training algorithms.

中文翻译:

为什么数据集属性会限制并行机器学习训练算法的可扩展性

随着训练数据集大小和机器学习模型大小的迅速增加,消耗了更多的计算资源以加快训练过程。但是,主要使用随机优化算法的并行机器学习训练的可伸缩性和性能可重复性受到限制。在本文中,我们证明了数据集中的样本差异在并行机器学习算法的可伸缩性中起着重要作用。我们建议使用数据集的统计属性来衡量样本差异。这些属性包括样本特征的变化,样本稀疏性,样本多样性以及采样序列中的相似性。我们选择四种类型的并行训练算法作为我们的研究对象:(1)异步并行SGD算法(Hogwild!算法),(2)并行模型平均SGD算法(minibatch SGD算法),(3)分散优化算法,和(4)对偶坐标优化(DADM算法)。我们的结果表明,训练数据集的统计属性决定了这些并行训练算法的可扩展性上限。
更新日期:2021-02-19
down
wechat
bug