Efficient convolution pooling on the GPU,Journal of Parallel and Distributed Computing

当前位置： X-MOL 学术 › J. Parallel Distrib. Comput. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Efficient convolution pooling on the GPU
Journal of Parallel and Distributed Computing ( IF 3.4 ) Pub Date : 2020-01-07 , DOI: 10.1016/j.jpdc.2019.12.006
Shunsuke Suita , Takahiro Nishimura , Hiroki Tokura , Koji Nakano , Yasuaki Ito , Akihiko Kasagi , Tsuguchika Tabaru

The main contribution of this paper is to show efficient implementations of the convolution-pooling in the GPU, in which the pooling follows the multiple convolution. Since the multiple convolution and the pooling operations are performed alternately in earlier stages of many Convolutional Neural Networks (CNNs), it is very important to accelerate the convolution-pooling. Our new GPU implementation uses two techniques, (1) convolution interchange with direct sum, and (2) conversion to matrix multiplication. By these techniques, the computational and memory access cost are reduced. Further the convolution interchange is converted to matrix multiplication, which can be computed by cuBLAS very efficiently. Experimental results using Tesla V100 GPU show that our new GPU implementation compatible with cuDNN for the convolution-pooling is expected 2.90 times and 1.43 times faster for fp32 and fp16 than the multiple convolution and then the pooling by cuDNN, respectively. the most popular library of primitives to implement the CNNs in the GPU.

中文翻译：

GPU上的高效卷积池

本文的主要贡献是展示了GPU中卷积池的有效实现，其中池遵循多重卷积。由于多次卷积和池化操作是在许多卷积神经网络（CNN）的早期阶段交替执行的，因此加速卷积池非常重要。我们的新GPU实现使用两种技术：（1）具有直接和的卷积互换，以及（2）转换为矩阵乘法。通过这些技术，减少了计算和内存访问成本。此外，卷积交换被转换为矩阵乘法，可以通过cuBLAS非常有效地进行计算。使用Tesla V100 GPU进行的实验结果表明，我们预计将与cuDNN兼容的新GPU实现用于卷积池2。fp32和fp16的速度分别比多重卷积和cuDNN的合并速度快90倍和1.43倍。在GPU中实现CNN的最流行的原语库。

更新日期：2020-01-07

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11