EdgeBERT: Optimizing On-Chip Inference for Multi-Task NLP,arXiv - CS - Hardware Architecture

当前位置： X-MOL 学术 › arXiv.cs.AR › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

EdgeBERT: Optimizing On-Chip Inference for Multi-Task NLP
arXiv - CS - Hardware Architecture Pub Date : 2020-11-28 , DOI: arxiv-2011.14203
Thierry Tambe, Coleman Hooper, Lillian Pentecost, En-Yu Yang, Marco Donato, Victor Sanh, Alexander M. Rush, David Brooks, Gu-Yeon Wei

Transformer-based language models such as BERT provide significant accuracy improvement to a multitude of natural language processing (NLP) tasks. However, their hefty computational and memory demands make them challenging to deploy to resource-constrained edge platforms with strict latency requirements. We present EdgeBERT an in-depth and principled algorithm and hardware design methodology to achieve minimal latency and energy consumption on multi-task NLP inference. Compared to the ALBERT baseline, we achieve up to 2.4x and 13.4x inference latency and memory savings, respectively, with less than 1%-pt drop in accuracy on several GLUE benchmarks by employing a calibrated combination of 1) entropy-based early stopping, 2) adaptive attention span, 3) movement and magnitude pruning, and 4) floating-point quantization. Furthermore, in order to maximize the benefits of these algorithms in always-on and intermediate edge computing settings, we specialize a scalable hardware architecture wherein floating-point bit encodings of the shareable multi-task embedding parameters are stored in high-density non-volatile memory. Altogether, EdgeBERT enables fully on-chip inference acceleration of NLP workloads with 5.2x, and 157x lower energy than that of an un-optimized accelerator and CUDA adaptations on an Nvidia Jetson Tegra X2 mobile GPU, respectively.

中文翻译：

EdgeBERT：针对多任务NLP优化片上推理

基于变压器的语言模型（例如BERT）可显着提高许多自然语言处理（NLP）任务的准确性。但是，它们庞大的计算和内存需求使其难以部署到具有严格延迟要求的资源受限的边缘平台上。我们向EdgeBERT提供一种深入且有原则的算法和硬件设计方法，以在多任务NLP推理上实现最小的延迟和能耗。与ALBERT基准相比，通过采用1）基于熵的提前停止的校准组合，我们分别达到了2.4倍和13.4倍的推理延迟和内存节省，并且在多个GLUE基准上的准确度下降了不到1％-pt。，2）自适应注意力跨度，3）运动和幅度修剪以及4）浮点量化。此外，为了在始终在线和中间边缘计算设置中最大化这些算法的优势，我们专门设计了一种可伸缩的硬件体系结构，其中可共享的多任务嵌入参数的浮点位编码存储在高密度非易失性存储器中。总之，与未优化的加速器和CUDA适配在Nvidia Jetson Tegra X2移动GPU上的能耗相比，EdgeBERT分别以5.2倍和157倍的低能耗实现了NLP工作负载的完全片上推理加速。

更新日期：2020-12-01

点击分享查看原文

点击收藏

阅读更多本刊最新论文