当前位置: X-MOL 学术Parallel Comput. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Exploring GPU acceleration of Deep Neural Networks using Block Circulant Matrices
Parallel Computing ( IF 1.4 ) Pub Date : 2020-10-16 , DOI: 10.1016/j.parco.2020.102701
Shi Dong , Pu Zhao , Xue Lin , David Kaeli

Training a Deep Neural Network (DNN) is a significant computing task since it places high demands on computing resources and memory bandwidth. Many approaches have been proposed to compress the network, while maintaining high model accuracy, reducing the computational demands associated with large-scale DNN training. One attractive approach is to leverage Block Circulant Matrices (BCM), compressing the linear transformation layers, e.g., convolutional and fully-connected layers, that heavily rely on performing General Matrix Multiplications (GEMM). By using BCMs, we can reduce the weight storage for a linear transformation layer from O(N2) to O(N). BCMs are also more efficient in terms of computational complexity, improving algorithmic complexity from O(N2) to O(Nlog(N)).

Previous work has only evaluated DNNs using BCMs targeting FPGAs for inference. There has been little prior work that considers the potential benefits of using BCMs for accelerating DNN training on GPUs. In this paper, we explore acceleration of DNNs using BCM on a state-of-the-art GPU. First, we identify the challenges posed by using BCMs. Next, we perform both general and GPU-specific optimizations that impact: (i) the decomposition and interaction of individual operations, and (ii) the overall GPU kernel design. We modify the algorithmic steps to remove redundant computations, while maintaining mathematical integrity. We also leverage multiple GPU kernel optimizations, considering performance factors, such as occupancy, data sharing/reuse patterns, and memory coalescing. We evaluate the performance of DNN training on an NVIDIA Tesla V100, providing insights into the benefits of our proposed kernel optimizations on a state-of-the-art GPU. Based on our results, we can achieve average speedups of 1.31× and 2.79× for the convolutional layers and fully-connected layers, respectively for AlexNet. We can also achieve average speedups of 1.33× and 3.66× for the convolutional layers and fully-connected layers, respectively for VGGNet-16.



中文翻译:

使用块循环矩阵探索深度神经网络的GPU加速

训练深度神经网络(DNN)是一项重要的计算任务,因为它对计算资源和内存带宽提出了很高的要求。已经提出了许多方法来压缩网络,同时保持较高的模型精度,从而减少与大规模DNN训练相关的计算需求。一种有吸引力的方法是利用块循环矩阵(BCM),压缩严重依赖于执行通用矩阵乘法(GEMM)的线性变换层,例如卷积和完全连接的层。通过使用BCM,我们可以减少线性变换层的权重存储Øñ2Øñ。BCM在计算复杂度方面也更加高效,从以下方面改进了算法复杂度Øñ2Øñ日志ñ

先前的工作仅使用针对FPGA的BCM评估了DNN。之前很少有工作考虑使用BCM加速GPU上的DNN训练的潜在好处。在本文中,我们探索了在最先进的GPU上使用BCM加速DNN的能力。首先,我们确定使用BCM带来的挑战。接下来,我们将执行影响以下方面的常规优化和特定于GPU的优化:(i)各个操作的分解和交互,以及(ii)GPU内核的总体设计。我们修改算法步骤以删除多余的计算,同时保持数学完整性。考虑到性能因素(例如占用率,数据共享/重用模式和内存合并),我们还利用了多个GPU内核优化。我们评估了NVIDIA Tesla V100上DNN培训的性能,提供有关最新GPU上我们建议的内核优化的好处的见解。根据我们的结果,我们可以实现平均1.31的加速比× 和2.79×分别用于AlexNet的卷积层和完全连接层。我们还可以实现1.33的平均加速× 和3.66× 用于卷积层和完全连接层,分别用于VGGNet-16。

更新日期:2020-10-17
down
wechat
bug