Efficient Hardware Architectures for 1D- and MD-LSTM Networks,Journal of Signal Processing Systems

当前位置： X-MOL 学术 › J. Sign. Process. Syst. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Efficient Hardware Architectures for 1D- and MD-LSTM Networks
Journal of Signal Processing Systems ( IF 1.8 ) Pub Date : 2020-07-02 , DOI: 10.1007/s11265-020-01554-x
Vladimir Rybalkin , Chirag Sudarshan , Christian Weis , Jan Lappas , Norbert Wehn , Li Cheng

Recurrent Neural Networks, in particular One-dimensional and Multidimensional Long Short-Term Memory (1D-LSTM and MD-LSTM) have achieved state-of-the-art classification accuracy in many applications such as machine translation, image caption generation, handwritten text recognition, medical imaging and many more. However, high classification accuracy comes at high compute, storage, and memory bandwidth requirements, which make their deployment challenging, especially for energy-constrained platforms such as portable devices. In comparison to CNNs, not so many investigations exist on efficient hardware implementations for 1D-LSTM especially under energy constraints, and there is no research publication on hardware architecture for MD-LSTM. In this article, we present two novel architectures for LSTM inference: a hardware architecture for MD-LSTM, and a DRAM-based Processing-in-Memory (DRAM-PIM) hardware architecture for 1D-LSTM. We present for the first time a hardware architecture for MD-LSTM, and show a trade-off analysis for accuracy and hardware cost for various precisions. We implement the new architecture as an FPGA-based accelerator that outperforms NVIDIA K80 GPU implementation in terms of runtime by up to 84× and energy efficiency by up to 1238× for a challenging dataset for historical document image binarization from DIBCO 2017 contest, and a well known MNIST dataset for handwritten digits recognition. Our accelerator demonstrates highest accuracy and comparable throughput in comparison to state-of-the-art FPGA-based implementations of multilayer perceptron for MNIST dataset. Furthermore, we present a new DRAM-PIM architecture for 1D-LSTM targeting energy efficient compute platforms such as portable devices. The DRAM-PIM architecture integrates the computation units in a close proximity to the DRAM cells in order to maximize the data parallelism and energy efficiency. The proposed DRAM-PIM design is 16.19 × more energy efficient as compared to FPGA implementation. The total chip area overhead of this design is 18 % compared to a commodity 8 Gb DRAM chip. Our experiments show that the DRAM-PIM implementation delivers a throughput of 1309.16 GOp/s for an optical character recognition application.

中文翻译：

一维和MD-LSTM网络的高效硬件架构

递归神经网络，特别是一维和多维长短期记忆（1D-LSTM和MD-LSTM）在许多应用中（例如机器翻译，图像标题生成，手写文本）都达到了最新的分类精度识别，医学成像等等。但是，较高的分类精度来自对计算，存储和内存带宽的高要求，这使其部署具有挑战性，尤其是对于能耗受限的平台（例如便携式设备）。与CNN相比，对于1D-LSTM的高效硬件实现，尤其是在能量限制下，没有进行太多的研究，也没有关于MD-LSTM的硬件体系结构的研究出版物。在本文中，我们介绍了两种用于LSTM推理的新颖架构：MD-LSTM的硬件架构，以及用于1D-LSTM的基于DRAM的内存中处理（DRAM-PIM）硬件体系结构。我们首次展示了用于MD-LSTM的硬件架构，并展示了针对精度和各种精度的硬件成本之间的权衡分析。我们将新架构实现为基于FPGA的加速器，在DIBCO 2017竞赛中用于历史文档图像二值化的具有挑战性的数据集方面，在运行时间方面比NVIDIA K80 GPU实施高出84倍，能效高达1238倍。众所周知的MNIST数据集，用于手写数字识别。与针对MNIST数据集的基于FPGA的最新实现的多层感知器相比，我们的加速器具有最高的准确性和可比的吞吐量。此外，我们针对1D-LSTM提供了一种针对节能计算平台（例如便携式设备）的新型DRAM-PIM体系结构。DRAM-PIM体系结构在靠近DRAM单元的位置集成了计算单元，以使数据并行性和能效最大化。与FPGA实施相比，拟议的DRAM-PIM设计的能效提高了16.19×。与商用8 Gb DRAM芯片相比，该设计的总芯片面积开销为18％。我们的实验表明，DRAM-PIM实现为光学字符识别应用程序提供1309.16 GOp / s的吞吐量。与FPGA实施相比，其能效提高了19倍。与商用8 Gb DRAM芯片相比，该设计的总芯片面积开销为18％。我们的实验表明，DRAM-PIM实现为光学字符识别应用程序提供1309.16 GOp / s的吞吐量。与FPGA实施相比，其能效提高了19倍。与商用8 Gb DRAM芯片相比，该设计的总芯片面积开销为18％。我们的实验表明，DRAM-PIM实现为光学字符识别应用程序提供1309.16 GOp / s的吞吐量。

更新日期：2020-07-02

点击分享查看原文

点击收藏

阅读更多本刊最新论文

全部期刊列表>>