当前位置: X-MOL 学术IEEE Access › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
SoftMemoryBox II: A Scalable, Shared Memory Buffer Framework for Accelerating Distributed Training of Large-scale Deep Neural Networks
IEEE Access ( IF 3.9 ) Pub Date : 2020-01-01 , DOI: 10.1109/access.2020.3038112
Shinyoung Ahn , Eunji Lim

Distributed processing using high-performance computing resources is essential for developers to train large-scale deep neural networks (DNNs). The major impediment to distributed DNN training is the communication bottleneck during the parameter exchange among the distributed DNN training workers. The communication bottleneck increases training time and decreases the utilization of the computational resources. Our previous study, SoftMemoryBox (SMB1) presented considerably superior performance compared to message passing interface (MPI) in the parameter communication of distributed DNN training. However, SMB1 had disadvantages such as the limited scalability of the distributed DNN training due to the restricted communication bandwidth from a single memory server, inability to provide a synchronization function for the shared memory buffer, and low portability/usability as a consequence of the kernel-level implementation. This paper proposes a scalable, shared memory buffer framework, called SoftMemoryBox II (SMB2), which overcomes the shortcomings of SMB1. With SMB2, distributed training processes can easily share virtually unified shared memory buffers composed of memory segments provided from remote memory servers and can exchange DNN parameters at high speed through the shared memory buffer. The scalable communication bandwidth of the SMB2 framework facilitates the reduction of DNN distributed training times compared to SMB1. According to intensive evaluation results, the communication bandwidth of the proposed SMB2 is 6.3 times greater than that of SMB1 when the SMB2 framework is scaled out to use eight memory servers. Moreover, the training time of SMB2-based asynchronous distributed training of five DNN models is up to 2.4 times faster than SMB1-based training.

中文翻译:

SoftMemoryBox II:用于加速大规模深度神经网络分布式训练的可扩展共享内存缓冲区框架

使用高性能计算资源的分布式处理对于开发人员训练大规模深度神经网络 (DNN) 至关重要。分布式 DNN 训练的主要障碍是分布式 DNN 训练工作者之间参数交换过程中的通信瓶颈。通信瓶颈增加了训练时间并降低了计算资源的利用率。我们之前的研究 SoftMemoryBox (SMB1) 在分布式 DNN 训练的参数通信中与消息传递接口 (MPI) 相比表现出相当优越的性能。然而,SMB1 存在缺点,例如分布式 DNN 训练的可扩展性有限,因为单个内存服务器的通信带宽受限,无法为共享内存缓冲区提供同步功能,由于内核级实现,可移植性/可用性低。本文提出了一种可扩展的共享内存缓冲区框架,称为 SoftMemoryBox II (SMB2),它克服了 SMB1 的缺点。借助 SMB2,分布式训练过程可以轻松共享由远程内存服务器提供的内存段组成的虚拟统一共享内存缓冲区,并可以通过共享内存缓冲区高速交换 DNN 参数。与 SMB1 相比,SMB2 框架的可扩展通信带宽有助于减少 DNN 分布式训练时间。根据密集评估结果,当 SMB2 框架横向扩展为使用 8 个内存服务器时,所提出的 SMB2 的通信带宽是 SMB1 的 6.3 倍。而且,
更新日期:2020-01-01
down
wechat
bug