当前位置: X-MOL 学术J. Syst. Archit. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Benchmarking vision kernels and neural network inference accelerators on embedded platforms
Journal of Systems Architecture ( IF 4.5 ) Pub Date : 2020-09-25 , DOI: 10.1016/j.sysarc.2020.101896
Murad Qasaimeh , Kristof Denolf , Alireza Khodamoradi , Michaela Blott , Jack Lo , Lisa Halder , Kees Vissers , Joseph Zambreno , Phillip H. Jones

Developing efficient embedded vision applications requires exploring various algorithmic optimization trade-offs and a broad spectrum of hardware architecture choices. This makes navigating the solution space and finding the design points with optimal performance trade-offs a challenge for developers. To help provide a fair baseline comparison, we conducted comprehensive benchmarks of accuracy, run-time, and energy efficiency of a wide range of vision kernels and neural networks on multiple embedded platforms: ARM57 CPU, Nvidia Jetson TX2 GPU and Xilinx ZCU102 FPGA. Each platform utilizes their optimized libraries for vision kernels (OpenCV, VisionWorks and xfOpenCV) and neural networks (OpenCV DNN, TensorRT and Xilinx DPU). For vision kernels, our results show that the GPU achieves an energy/frame reduction ratio of 1.1–3.2× compared to the others for simple kernels. However, for more complicated kernels and complete vision pipelines, the FPGA outperforms the others with energy/frame reduction ratios of 1.2–22.3×. For neural networks [Inception-v2 and ResNet-50, ResNet-18, Mobilenet-v2 and SqueezeNet], it shows that the FPGA achieves a speed up of [2.5, 2.1, 2.6, 2.9 and 2.5]× and an EDP reduction ratio of [1.5, 1.1, 1.4, 2.4 and 1.7]× compared to the GPU FP16 implementations, respectively.



中文翻译:

在嵌入式平台上对视觉内核和神经网络推理加速器进行基准测试

开发高效的嵌入式视觉应用需要探索各种算法优化折衷方案,以及广泛的硬件架构选择。这使得导航解决方案空间和找到具有最佳性能折衷的设计点对于开发人员来说是一个挑战。为了帮助进行公平的基线比较,我们在多个嵌入式平台(ARM57 CPU,Nvidia Jetson TX2 GPU和Xilinx ZCU102 FPGA)上对各种视觉内核和神经网络的准确性,运行时间和能效进行了全面的基准测试。每个平台都将其优化的库用于视觉内核(OpenCV,VisionWorks和xfOpenCV)和神经网络(OpenCV DNN,TensorRT和Xilinx DPU)。对于视觉内核,我们的结果表明GPU实现了1.1 / 3.2的能量/帧减少率×与其他简单内核相比。但是,对于更复杂的内核和完整的视觉管道,FPGA的能耗/帧压缩比为1.2–22.3,因此优于其他内核。×。对于神经网络[Inception-v2和ResNet-50,ResNet-18,Mobilenet-v2和SqueezeNet],它表明FPGA实现了[2.5、2.1、2.6、2.9和2.5]的加速。× EDP​​减少率分别为[1.5、1.1、1.4、2.4和1.7]× 分别与GPU FP16实现相比。

更新日期:2020-09-25
down
wechat
bug