High Performance and Portable Convolution Operators for ARM-based Multicore Processors,arXiv - CS - Performance

当前位置： X-MOL 学术 › arXiv.cs.PF › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

High Performance and Portable Convolution Operators for ARM-based Multicore Processors
arXiv - CS - Performance Pub Date : 2020-05-13 , DOI: arxiv-2005.06410
Pablo San Juan, Adri\'an Castell\'o, Manuel F. Dolz, Pedro Alonso-Jord\'a, Enrique S. Quintana-Ort\'i

The considerable impact of Convolutional Neural Networks on many Artificial Intelligence tasks has led to the development of various high performance algorithms for the convolution operator present in this type of networks. One of these approaches leverages the \imcol transform followed by a general matrix multiplication (GEMM) in order to take advantage of the highly optimized realizations of the GEMM kernel in many linear algebra libraries. The main problems of this approach are 1) the large memory workspace required to host the intermediate matrices generated by the IM2COL transform; and 2) the time to perform the IM2COL transform, which is not negligible for complex neural networks. This paper presents a portable high performance convolution algorithm based on the BLIS realization of the GEMM kernel that avoids the use of the intermediate memory by taking advantage of the BLIS structure. In addition, the proposed algorithm eliminates the cost of the explicit IM2COL transform, while maintaining the portability and performance of the underlying realization of GEMM in BLIS.

中文翻译：

用于基于 ARM 的多核处理器的高性能和便携式卷积运算符

卷积神经网络对许多人工智能任务的巨大影响导致了针对此类网络中存在的卷积算子的各种高性能算法的开发。其中一种方法利用 \imcol 变换，然后是通用矩阵乘法 (GEMM)，以利用许多线性代数库中 GEMM 内核的高度优化实现。这种方法的主要问题是 1) 承载 IM2COL 变换生成的中间矩阵所需的大内存工作空间；2) 执行 IM2COL 变换的时间，这对于复杂的神经网络来说是不可忽略的。本文提出了一种基于GEMM内核BLIS实现的可移植高性能卷积算法，利用BLIS结构避免使用中间存储器。此外，该算法消除了显式IM2COL变换的代价，同时保持了BLIS中GEMM底层实现的可移植性和性能。

更新日期：2020-05-14

点击分享查看原文

点击收藏

阅读更多本刊最新论文

全部期刊列表>>