Hardware Accelerator for Multi-Head Attention and Position-Wise Feed-Forward in the Transformer,arXiv - CS - Hardware Architecture

当前位置： X-MOL 学术 › arXiv.cs.AR › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Hardware Accelerator for Multi-Head Attention and Position-Wise Feed-Forward in the Transformer
arXiv - CS - Hardware Architecture Pub Date : 2020-09-18 , DOI: arxiv-2009.08605
Siyuan Lu, Meiqi Wang, Shuang Liang, Jun Lin, and Zhongfeng Wang

Designing hardware accelerators for deep neural networks (DNNs) has been much desired. Nonetheless, most of these existing accelerators are built for either convolutional neural networks (CNNs) or recurrent neural networks (RNNs). Recently, the Transformer model is replacing the RNN in the natural language processing (NLP) area. However, because of intensive matrix computations and complicated data flow being involved, the hardware design for the Transformer model has never been reported. In this paper, we propose the first hardware accelerator for two key components, i.e., the multi-head attention (MHA) ResBlock and the position-wise feed-forward network (FFN) ResBlock, which are the two most complex layers in the Transformer. Firstly, an efficient method is introduced to partition the huge matrices in the Transformer, allowing the two ResBlocks to share most of the hardware resources. Secondly, the computation flow is well designed to ensure the high hardware utilization of the systolic array, which is the biggest module in our design. Thirdly, complicated nonlinear functions are highly optimized to further reduce the hardware complexity and also the latency of the entire system. Our design is coded using hardware description language (HDL) and evaluated on a Xilinx FPGA. Compared with the implementation on GPU with the same setting, the proposed design demonstrates a speed-up of 14.6x in the MHA ResBlock, and 3.4x in the FFN ResBlock, respectively. Therefore, this work lays a good foundation for building efficient hardware accelerators for multiple Transformer networks.

中文翻译：

用于 Transformer 中多头注意力和位置智能前馈的硬件加速器

非常需要为深度神经网络 (DNN) 设计硬件加速器。尽管如此，大多数现有的加速器都是为卷积神经网络 (CNN) 或循环神经网络 (RNN) 构建的。最近，Transformer 模型正在自然语言处理（NLP）领域取代 RNN。然而，由于涉及密集的矩阵计算和复杂的数据流，Transformer 模型的硬件设计从未被报道过。在本文中，我们为两个关键组件提出了第一个硬件加速器，即多头注意 (MHA) ResBlock 和位置智能前馈网络 (FFN) ResBlock，这是 Transformer 中最复杂的两个层. 首先，引入了一种有效的方法来划分 Transformer 中的巨大矩阵，允许两个 ResBlock 共享大部分硬件资源。其次，计算流程设计得很好，以确保脉动阵列的高硬件利用率，这是我们设计中最大的模块。第三，对复杂的非线性函数进行高度优化，进一步降低硬件复杂度和整个系统的延迟。我们的设计使用硬件描述语言 (HDL) 进行编码，并在 Xilinx FPGA 上进行评估。与具有相同设置的 GPU 实现相比，所提出的设计分别在 MHA ResBlock 和 FFN ResBlock 中分别实现了 14.6 倍和 3.4 倍的加速。因此，这项工作为为多个 Transformer 网络构建高效的硬件加速器奠定了良好的基础。计算流程经过精心设计，以确保脉动阵列的高硬件利用率，这是我们设计中最大的模块。第三，对复杂的非线性函数进行高度优化，进一步降低硬件复杂度和整个系统的延迟。我们的设计使用硬件描述语言 (HDL) 进行编码，并在 Xilinx FPGA 上进行评估。与具有相同设置的 GPU 实现相比，所提出的设计分别在 MHA ResBlock 和 FFN ResBlock 中分别实现了 14.6 倍和 3.4 倍的加速。因此，这项工作为为多个 Transformer 网络构建高效的硬件加速器奠定了良好的基础。计算流程经过精心设计，以确保脉动阵列的高硬件利用率，这是我们设计中最大的模块。第三，对复杂的非线性函数进行高度优化，进一步降低硬件复杂度和整个系统的延迟。我们的设计使用硬件描述语言 (HDL) 进行编码，并在 Xilinx FPGA 上进行评估。与具有相同设置的 GPU 实现相比，所提出的设计分别在 MHA ResBlock 和 FFN ResBlock 中分别实现了 14.6 倍和 3.4 倍的加速。因此，这项工作为为多个 Transformer 网络构建高效的硬件加速器奠定了良好的基础。复杂的非线性函数经过高度优化，进一步降低了硬件复杂度和整个系统的延迟。我们的设计使用硬件描述语言 (HDL) 进行编码，并在 Xilinx FPGA 上进行评估。与具有相同设置的 GPU 实现相比，所提出的设计分别在 MHA ResBlock 和 FFN ResBlock 中分别实现了 14.6 倍和 3.4 倍的加速。因此，这项工作为为多个 Transformer 网络构建高效的硬件加速器奠定了良好的基础。复杂的非线性函数经过高度优化，进一步降低了硬件复杂度和整个系统的延迟。我们的设计使用硬件描述语言 (HDL) 进行编码，并在 Xilinx FPGA 上进行评估。与具有相同设置的 GPU 实现相比，所提出的设计分别在 MHA ResBlock 和 FFN ResBlock 中分别实现了 14.6 倍和 3.4 倍的加速。因此，这项工作为为多个 Transformer 网络构建高效的硬件加速器奠定了良好的基础。分别是 FFN ResBlock 中的 4 倍。因此，这项工作为为多个 Transformer 网络构建高效的硬件加速器奠定了良好的基础。分别是 FFN ResBlock 中的 4 倍。因此，这项工作为为多个 Transformer 网络构建高效的硬件加速器奠定了良好的基础。

更新日期：2020-09-21

点击分享查看原文

点击收藏

阅读更多本刊最新论文