当前位置: X-MOL 学术arXiv.cs.PF › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Understanding Cache Boundness of ML Operators on ARM Processors
arXiv - CS - Performance Pub Date : 2021-02-01 , DOI: arxiv-2102.00932
Bernhard Klein, Christoph Gratl, Manfred Mücke, Holger Fröning

Machine Learning compilers like TVM allow a fast and flexible deployment on embedded CPUs. This enables the use of non-standard operators, which are common in ML compression techniques. However, it is necessary to understand the limitations of typical compute-intense operators in ML workloads to design a proper solution. This is the first in-detail analysis of dense and convolution operators, generated with TVM, that compares to the fundamental hardware limits of embedded ARM processors. Thereby it explains the gap between computational peak performance, theoretical and measured, and real-world state-of-the-art results, created with TVM and openBLAS. Instead, one can see that single-precision general matrix multiply (GEMM) and convolutions are bound by L1-cache-read bandwidth. Explorations of 8-bit and bit-serial quantized operators show that quantization can be used to achieve relevant speedups compared to cache-bound floating-point operators. However, the performance of quantized operators highly depends on the interaction between data layout and bit packing.

中文翻译:

了解ARM处理器上ML运算符的缓存范围

像TVM这样的机器学习编译器可以在嵌入式CPU上快速灵活地进行部署。这样可以使用ML压缩技术中常见的非标准运算符。但是,有必要了解ML工作负载中典型的计算密集型运算符的局限性,以设计适当的解决方案。这是用TVM生成的第一个详细的密集和卷积运算符分析,可与嵌入式ARM处理器的基本硬件限制进行比较。因此,它解释了由TVM和openBLAS创建的计算峰值性能(理论值和测量值)与现实世界中的最新结果之间的差距。取而代之的是,人们可以看到单精度通用矩阵乘法(GEMM)和卷积受L1缓存读取带宽的约束。对8位和位串行量化运算符的研究表明,与高速缓存绑定的浮点运算符相比,量化可用于实现相关的加速。但是,量化运算符的性能高度依赖于数据布局和位打包之间的相互作用。
更新日期:2021-02-02
down
wechat
bug