Optimizing Depthwise Separable Convolution Operations on GPUs,IEEE Transactions on Parallel and Distributed Systems

当前位置： X-MOL 学术 › IEEE Trans. Parallel Distrib. Syst. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Optimizing Depthwise Separable Convolution Operations on GPUs
IEEE Transactions on Parallel and Distributed Systems ( IF 5.6 ) Pub Date : 2021-05-28 , DOI: 10.1109/tpds.2021.3084813
Gangzhao Lu , Weizhe Zhang , Zheng Wang

The depthwise separable convolution is commonly seen in convolutional neural networks (CNNs), and is widely used to reduce the computation overhead of a standard multi-channel 2D convolution. Existing implementations of depthwise separable convolutions target accelerating model training with large batch sizes with a large number of samples to be processed at once. Such approaches are inadequate for small-batch-sized model training and the typical scenario of model inference where the model takes in a few samples at once. This article aims to bridge the gap of optimizing depthwise separable convolutions by targeting the GPU architecture. We achieve this by designing two novel algorithms to improve the column and row reuse of the convolution operation to reduce the number of memory operations performed on the width and the height dimensions of the 2D convolution. Our approach employs a dynamic tile size scheme to adaptively distribute the computational data across GPU threads to improve GPU utilization and to hide the memory access latency. We apply our approach on two GPU platforms: an NVIDIA RTX 2080Ti GPU and an embedded NVIDIA Jetson AGX Xavier GPU, and two data types: 32-bit floating point (FP32) and 8-bit integer (INT8). We compared our approach against cuDNN that is heavily tuned for the NVIDIA GPU architecture. Experimental results show that our approach delivers over

$2\times$

(up to

$3\times$

) performance improvement over cuDNN. We show that, when using a moderate batch size, our approach averagely reduces the end-to-end training time of MobileNet and EfficientNet by 9.7 and 7.3 percent respectively, and reduces the end-to-end inference time of MobileNet and EfficientNet by 12.2 and 11.6 percent respectively.

中文翻译：

优化 GPU 上的深度可分离卷积运算

深度可分离卷积常见于卷积神经网络 (CNN)，并广泛用于减少标准多通道 2D 卷积的计算开销。深度可分离卷积的现有实现目标是加速大批量模型训练，并同时处理大量样本。这种方法不适用于小批量模型训练以及模型一次接受几个样本的模型推理的典型场景。本文旨在通过针对 GPU 架构来弥补优化深度可分离卷积的差距。我们通过设计两种新颖的算法来改善卷积运算的列和行重用，从而减少在 2D 卷积的宽度和高度维度上执行的内存操作数量来实现这一目标。我们的方法采用动态切片大小方案来跨 GPU 线程自适应地分配计算数据，以提高 GPU 利用率并隐藏内存访问延迟。我们在两个 GPU 平台上应用我们的方法：NVIDIA RTX 2080Ti GPU 和嵌入式 NVIDIA Jetson AGX Xavier GPU，以及两种数据类型：32 位浮点 (FP32) 和 8 位整数 (INT8)。我们将我们的方法与针对 NVIDIA GPU 架构进行了大量调整的 cuDNN 进行了比较。实验结果表明，我们的方法比 cuDNN 提供了超过 $2\times$（高达 $3\times$）的性能改进。我们表明，当使用中等批量大小时，我们的方法平均将 MobileNet 和 EfficientNet 的端到端训练时间分别减少 9.7% 和 7.3%，并将 MobileNet 和 EfficientNet 的端到端推理时间减少 12.2%和 11.6%。

更新日期：2021-05-28

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南11