High performance and energy efficient inference for deep learning on ARM processors,arXiv - CS - Performance

当前位置： X-MOL 学术 › arXiv.cs.PF › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

High performance and energy efficient inference for deep learning on ARM processors
arXiv - CS - Performance Pub Date : 2021-05-19 , DOI: arxiv-2105.09187
Adrián Castelló, Sergio Barrachina, Manuel F. Dolz, Enrique S. Quintana-Ortí, Pau San Juan

We evolve PyDTNN, a framework for distributed parallel training of Deep Neural Networks (DNNs), into an efficient inference tool for convolutional neural networks. Our optimization process on multicore ARM processors involves several high-level transformations of the original framework, such as the development and integration of Cython routines to exploit thread-level parallelism; the design and development of micro-kernels for the matrix multiplication, vectorized with ARMs NEON intrinsics, that can accommodate layer fusion; and the appropriate selection of several cache configuration parameters tailored to the memory hierarchy of the target ARM processors. Our experiments evaluate both inference throughput (measured in processed images/s) and inference latency (i.e., time-to-response) as well as energy consumption per image when varying the level of thread parallelism and the processor power modes. The experiments with the new inference engine are reported for the ResNet50 v1.5 model on the ImageNet dataset from the MLPerf suite using the ARM v8.2 cores in the NVIDIA Jetson AGX Xavier board. These results show superior performance compared with the well-spread TFLite from Google and slightly inferior results when compared with ArmNN, the native library from ARM for DNN inference.

中文翻译：

高性能和高能效推理，可在ARM处理器上进行深度学习

我们将PyDTNN（一种用于深度神经网络（DNN）的分布式并行训练的框架）发展为卷积神经网络的有效推理工具。我们在多核ARM处理器上的优化过程涉及原始框架的数个高级转换，例如开发和集成Cython例程以利用线程级并行性。设计和开发用于矩阵乘法的微内核，该内核使用ARM NEON内在函数矢量化，可以适应层融合；and the appropriate selection of several cache configuration parameters tailored to the memory hierarchy of the target ARM processors. 我们的实验同时评估了推理吞吐量（以处理的图像/秒为单位）和推理延迟（即，更改线程并行性级别和处理器功耗模式时的响应时间以及每张图像的能耗。使用NVIDIA Jetson AGX Xavier板上的ARM v8.2内核，针对MLPerf套件中ImageNet数据集上的ResNet50 v1.5模型，报道了使用新推理引擎进行的实验。与Google广泛使用的TFLite相比，这些结果显示出优越的性能，而与ARM用于DNN推理的本机库ArmNN相比，这些结果则略逊一筹。

更新日期：2021-05-20

点击分享查看原文

点击收藏

阅读更多本刊最新论文