当前位置: X-MOL 学术arXiv.cs.IT › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Straggler Mitigation through Unequal Error Protection for Distributed Approximate Matrix Multiplication
arXiv - CS - Information Theory Pub Date : 2021-03-04 , DOI: arxiv-2103.02928
Busra Tegin, Eduin. E. Hernandez, Stefano Rini, Tolga M. Duman

Large-scale machine learning and data mining methods routinely distribute computations across multiple agents to parallelize processing. The time required for the computations at the agents is affected by the availability of local resources giving rise to the "straggler problem". As a remedy to this problem, linear coding of the matrix sub-blocks can be used, i.e., the Parameter Server (PS) utilizes a channel code to encode the matrix sub-blocks and distributes these matrices to the workers for multiplication. In this paper, we employ Unequal Error Protection (UEP) codes to obtain an approximation of the matrix product in the distributed computation setting in the presence of stragglers. The resiliency level of each sub-block is chosen according to its norm, as blocks with larger norms have higher effects on the result of the matrix multiplication. In particular, we consider two approaches in distributing the matrix computation: (i) a row-times-column paradigm, and (ii) a column-times-row paradigm. For both paradigms, we characterize the performance of the proposed approach from a theoretical perspective by bounding the expected reconstruction error for matrices with uncorrelated entries. We also apply the proposed coding strategy to the computation of the back-propagation step in the training of a Deep Neural Network (DNN) for an image classification task in the evaluation of the gradient during back-propagation. Our numerical experiments show that it is indeed possible to obtain significant improvements in the overall time required to achieve the DNN training convergence by producing matrix product approximations using UEP codes.

中文翻译:

通过不均等错误保护缓解分布式近似矩阵乘法的流浪汉

大规模的机器学习和数据挖掘方法通常将计算分布在多个代理上,以并行处理。在代理处进行计算所需的时间受到本地资源可用性的影响,从而引起“散户问题”。作为此问题的一种补救措施,可以使用矩阵子块的线性编码,即,参数服务器(PS)利用通道代码对矩阵子块进行编码,并将这些矩阵分配给工人进行乘法。在本文中,我们采用不等错误保护(UEP)码来获得存在散乱因素的分布式计算设置中矩阵乘积的近似值。每个子块的弹性级别是根据其规范选择的,因为范数更大的块对矩阵乘法的结果有更高的影响。特别地,我们考虑了两种分布矩阵计算的方法:(i)行-时-列范式,和(ii)列-时-列范式。对于这两种范式,我们通过限制具有不相关项的矩阵的预期重构误差,从理论角度描述了所提出方法的性能。我们还将拟议的编码策略应用于训练图像的神经网络(DNN)的反向传播步骤中的图像分类任务,以评估反向传播期间的梯度。
更新日期:2021-03-05
down
wechat
bug