当前位置: X-MOL 学术arXiv.cs.CL › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Hardware Acceleration of Fully Quantized BERT for Efficient Natural Language Processing
arXiv - CS - Computation and Language Pub Date : 2021-03-04 , DOI: arxiv-2103.02800
Zejian Liu, Gang Li, Jian Cheng

BERT is the most recent Transformer-based model that achieves state-of-the-art performance in various NLP tasks. In this paper, we investigate the hardware acceleration of BERT on FPGA for edge computing. To tackle the issue of huge computational complexity and memory footprint, we propose to fully quantize the BERT (FQ-BERT), including weights, activations, softmax, layer normalization, and all the intermediate results. Experiments demonstrate that the FQ-BERT can achieve 7.94x compression for weights with negligible performance loss. We then propose an accelerator tailored for the FQ-BERT and evaluate on Xilinx ZCU102 and ZCU111 FPGA. It can achieve a performance-per-watt of 3.18 fps/W, which is 28.91x and 12.72x over Intel(R) Core(TM) i7-8700 CPU and NVIDIA K80 GPU, respectively.

中文翻译:

全量化BERT的硬件加速,可实现高效的自然语言处理

BERT是最新的基于Transformer的模型,可在各种NLP任务中实现最先进的性能。在本文中,我们研究了用于边缘计算的FPGA上BERT的硬件加速。为了解决巨大的计算复杂性和内存占用的问题,我们建议对BERT(FQ-BERT)进行完全量化,包括权重,激活,softmax,层归一化以及所有中间结果。实验表明,FQ-BERT可以在重量损失可忽略不计的情况下实现7.94倍压缩。然后,我们提出针对FQ-BERT量身定制的加速器,并在Xilinx ZCU102和ZCU111 FPGA上进行评估。它的每瓦性能可以达到3.18 fps / W,分别比Intel®Core™i7-8700 CPU和NVIDIA K80 GPU分别高28.91倍和12.72倍。
更新日期:2021-03-05
down
wechat
bug