当前位置: X-MOL 学术Nat. Mach. Intell. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Automatic heterogeneous quantization of deep neural networks for low-latency inference on the edge for particle detectors
Nature Machine Intelligence ( IF 23.8 ) Pub Date : 2021-06-21 , DOI: 10.1038/s42256-021-00356-5
Claudionor N. Coelho , Aki Kuusela , Shan Li , Hao Zhuang , Jennifer Ngadiuba , Thea Klaeboe Aarrestad , Vladimir Loncar , Maurizio Pierini , Adrian Alan Pol , Sioni Summers

Although the quest for more accurate solutions is pushing deep learning research towards larger and more complex algorithms, edge devices demand efficient inference and therefore reduction in model size, latency and energy consumption. One technique to limit model size is quantization, which implies using fewer bits to represent weights and biases. Such an approach usually results in a decline in performance. Here, we introduce a method for designing optimally heterogeneously quantized versions of deep neural network models for minimum-energy, high-accuracy, nanosecond inference and fully automated deployment on chip. With a per-layer, per-parameter type automatic quantization procedure, sampling from a wide range of quantizers, model energy consumption and size are minimized while high accuracy is maintained. This is crucial for the event selection procedure in proton–proton collisions at the CERN Large Hadron Collider, where resources are strictly limited and a latency of \({\mathcal{O}}(1)\,\upmu{\rm{s}}\) is required. Nanosecond inference and a resource consumption reduced by a factor of 50 when implemented on field-programmable gate array hardware are achieved.



中文翻译:

深度神经网络的自动异构量化,用于粒子检测器边缘的低延迟推理

尽管对更准确解决方案的追求正在将深度学习研究推向更大、更复杂的算法,但边缘设备需要有效的推理,因此需要减少模型大小、延迟和能耗。限制模型大小的一种技术是量化,这意味着使用更少的位来表示权重和偏差。这种方法通常会导致性能下降。在这里,我们介绍了一种设计最佳异构量化版本的深度神经网络模型的方法,用于最小能量、高精度、纳秒级推理和片上全自动部署。通过每层、每参数类型的自动量化过程,从广泛的量化器中采样,模型能耗和尺寸最小化,同时保持高精度。\({\mathcal{O}}(1)\,\upmu{\rm{s}}\)是必需的。当在现场可编程门阵列硬件上实施时,实现了纳秒级推断和资源消耗减少了 50 倍。

更新日期:2021-06-21
down
wechat
bug