FeatherCNN: Fast Inference Computation with TensorGEMM on ARM Architectures,IEEE Transactions on Parallel and Distributed Systems

当前位置： X-MOL 学术 › IEEE Trans. Parallel Distrib. Syst. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

FeatherCNN: Fast Inference Computation with TensorGEMM on ARM Architectures
IEEE Transactions on Parallel and Distributed Systems ( IF 5.6 ) Pub Date : 2020-03-01 , DOI: 10.1109/tpds.2019.2939785
Haidong Lan , Jintao Meng , Christian Hundt , Bertil Schmidt , Minwen Deng , Xiaoning Wang , Weiguo Liu , Yu Qiao , Shengzhong Feng

Deep Learning is ubiquitous in a wide field of applications ranging from research to industry. In comparison to time-consuming iterative training of convolutional neural networks (CNNs), inference is a relatively lightweight operation making it amenable to execution on mobile devices. Nevertheless, lower latency and higher computation efficiency are crucial to allow for complex models and prolonged battery life. Addressing the aforementioned challenges, we propose FeatherCNN – a fast inference library for ARM CPUs – targeting the performance ceiling of mobile devices. FeatherCNN employs three key techniques: 1) A highly efficient TensorGEMM (generalized matrix multiplication) routine is applied to accelerate Winograd convolution on ARM CPUs, 2) General layer optimization based on custom high performance kernels improves both the computational efficiency and locality of memory access patterns for non-Winograd layers. 3) The framework design emphasizes joint layer-wise optimization using layer fusion to remove redundant calculations and memory movements. Performance evaluation reveals that FeatherCNN significantly outperforms state-of-the-art libraries. A forward propagation pass of VGG-16 on a 64-core ARM server is 48, 14, and 12 times faster than Caffe using OpenBLAS, Caffe2 using Eigen, and NNPACK, respectively. In addition, FeatherCNN is 3.19 times faster than the recently released TensorFlow Lite library on an iPhone 7 plus. In terms of GEMM performance, FeatherCNN achieves 14.8 and 39.0 percent higher performance than Apple's Accelerate framework on an iPhone 7 plus and Eigen on a Samsung Galaxy S8, respectively. The source code of FeatherCNN library is publicly available at https://github.com/tencent/feathercnn.

中文翻译：

FeatherCNN：在 ARM 架构上使用 TensorGEMM 进行快速推理计算

深度学习在从研究到工业的广泛应用领域中无处不在。与卷积神经网络 (CNN) 的耗时迭代训练相比，推理是一种相对轻量级的操作，因此适合在移动设备上执行。尽管如此，较低的延迟和较高的计算效率对于允许复杂模型和延长电池寿命至关重要。针对上述挑战，我们提出了 FeatherCNN——一个用于 ARM CPU 的快速推理库——针对移动设备的性能上限。FeatherCNN 采用了三个关键技术：1) 应用高效的 TensorGEMM（广义矩阵乘法）例程来加速 ARM CPU 上的 Winograd 卷积，2) 基于定制高性能内核的通用层优化提高了非 Winograd 层的内存访问模式的计算效率和局部性。3) 框架设计强调使用层融合来消除冗余计算和内存移动的联合逐层优化。性能评估表明 FeatherCNN 显着优于最先进的库。VGG-16 在 64 核 ARM 服务器上的前向传播速度分别比使用 OpenBLAS 的 Caffe、使用 Eigen 的 Caffe2 和 NNPACK 快 48、14 和 12 倍。此外，FeatherCNN 比 iPhone 7 plus 上最近发布的 TensorFlow Lite 库快 3.19 倍。在 GEMM 性能方面，FeatherCNN 实现了比苹果高 14.8% 和 39.0% 的性能 s Accelerate 框架分别在 iPhone 7 plus 和三星 Galaxy S8 上使用 Eigen。FeatherCNN 库的源代码可在 https://github.com/tencent/feathercnn 上公开获得。

更新日期：2020-03-01

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11