当前位置: X-MOL 学术IEEE Open J. Circuits Syst. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
An Energy Efficient EdgeAI Autoencoder Accelerator for Reinforcement Learning
IEEE Open Journal of Circuits and Systems ( IF 2.4 ) Pub Date : 2021-01-25 , DOI: 10.1109/ojcas.2020.3043737
Nitheesh Kumar Manjunath , Aidin Shiri , Morteza Hosseini , Bharat Prakash , Nicholas R. Waytowich , Tinoosh Mohsenin

In EdgeAI embedded devices that exploit reinforcement learning (RL), it is essential to reduce the number of actions taken by the agent in the real world and minimize the compute-intensive policies learning process. Convolutional autoencoders (AEs) has demonstrated great improvement for speeding up the policy learning time when attached to the RL agent, by compressing the high dimensional input data into a small latent representation for feeding the RL agent. Despite reducing the policy learning time, AE adds a significant computational and memory complexity to the model which contributes to the increase in the total computation and the model size. In this article, we propose a model for speeding up the policy learning process of RL agent with the use of AE neural networks, which engages binary and ternary precision to address the high complexity overhead without deteriorating the policy that an RL agent learns. Binary Neural Networks (BNNs) and Ternary Neural Networks (TNNs) compress weights into 1 and 2 bits representations, which result in significant compression of the model size and memory as well as simplifying multiply-accumulate (MAC) operations. We evaluate the performance of our model in three RL environments including DonkeyCar, Miniworld sidewalk, and Miniworld Object Pickup, which emulate various real-world applications with different levels of complexity. With proper hyperparameter optimization and architecture exploration, TNN models achieve near the same average reward, Peak Signal to Noise Ratio (PSNR) and Mean Squared Error (MSE) performance as the full-precision model while reducing the model size by 10x compared to full-precision and 3x compared to BNNs. However, in BNN models the average reward drops up to 12% - 25% compared to the full-precision even after increasing its model size by 4x. We designed and implemented a scalable hardware accelerator which is configurable in terms of the number of processing elements (PEs) and memory data width to achieve the best power, performance, and energy efficiency trade-off for EdgeAI embedded devices. The proposed hardware implemented on Artix-7 FPGA dissipates 250 μJ energy while meeting 30 frames per second (FPS) throughput requirements. The hardware is configurable to reach an efficiency of over 1 TOP/J on FPGA implementation. The proposed hardware accelerator is synthesized and placed-and-routed in 14 nm FinFET ASIC technology which brings down the power dissipation to 3.9 μJ and maximum throughput of 1,250 FPS. Compared to the state of the art TNN implementations on the same target platform, our hardware is 5x and 4.4x (2.2x if technology scaled) more energy efficient on FPGA and ASIC, respectively.

中文翻译:


用于强化学习的节能 EdgeAI 自动编码器加速器



在利用强化学习 (RL) 的 EdgeAI 嵌入式设备中,必须减少代理在现实世界中采取的操作数量,并最大限度地减少计算密集型策略学习过程。卷积自动编码器 (AE) 在连接到 RL 代理时,通过将高维输入数据压缩为小的潜在表示来馈送 RL 代理,在加快策略学习时间方面取得了巨大的进步。尽管减少了策略学习时间,但 AE 显着增加了模型的计算和内存复杂性,从而导致总计算量和模型大小的增加。在本文中,我们提出了一种使用 AE 神经网络加速 RL 代理的策略学习过程的模型,该模型利用二进制和三元精度来解决高复杂性开销,而不会恶化 RL 代理学习的策略。二元神经网络 (BNN) 和三元神经网络 (TNN) 将权重压缩为 1 位和 2 位表示形式,从而显着压缩模型大小和内存,并简化乘法累加 (MAC) 运算。我们评估了我们的模型在三种 RL 环境中的性能,包括 DonkeyCar、Miniworld sidewalk 和 Miniworld Object Pickup,它们模拟了具有不同复杂程度的各种现实世界应用程序。通过适当的超参数优化和架构探索,TNN 模型实现了与全精度模型几乎相同的平均奖励、峰值信噪比 (PSNR) 和均方误差 (MSE) 性能,同时与全精度模型相比,模型大小减少了 10 倍。与 BNN 相比,精度提高了 3 倍。 然而,在 BNN 模型中,即使模型大小增加了 4 倍,与全精度相比,平均奖励仍下降了 12% - 25%。我们设计并实现了一个可扩展的硬件加速器,该加速器可根据处理元件 (PE) 的数量和内存数据宽度进行配置,以实现 EdgeAI 嵌入式设备的最佳功耗、性能和能效权衡。在 Artix-7 FPGA 上实现的拟议硬件消耗 250 μJ 能量,同时满足每秒 30 帧 (FPS) 的吞吐量要求。硬件可配置为在 FPGA 实现上达到超过 1 TOP/J 的效率。所提出的硬件加速器采用 14 nm FinFET ASIC 技术进行综合和布局布线,将功耗降低至 3.9 μJ,最大吞吐量为 1,250 FPS。与同一目标平台上最先进的 TNN 实现相比,我们的硬件在 FPGA 和 ASIC 上的能效分别提高了 5 倍和 4.4 倍(如果技术扩展则为 2.2 倍)。
更新日期:2021-01-25
down
wechat
bug