当前位置: X-MOL 学术arXiv.cs.AR › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Reconfigurable co-processor architecture with limited numerical precision to accelerate deep convolutional neural networks
arXiv - CS - Hardware Architecture Pub Date : 2021-08-21 , DOI: arxiv-2109.03040
Sasindu Wijeratne, Sandaruwan Jayaweera, Mahesh Dananjaya, Ajith Pasqual

Convolutional Neural Networks (CNNs) are widely used in deep learning applications, e.g. visual systems, robotics etc. However, existing software solutions are not efficient. Therefore, many hardware accelerators have been proposed optimizing performance, power and resource utilization of the implementation. Amongst existing solutions, Field Programmable Gate Array (FPGA) based architecture provides better cost-energy-performance trade-offs as well as scalability and minimizing development time. In this paper, we present a model-independent reconfigurable co-processing architecture to accelerate CNNs. Our architecture consists of parallel Multiply and Accumulate (MAC) units with caching techniques and interconnection networks to exploit maximum data parallelism. In contrast to existing solutions, we introduce limited precision 32 bit Q-format fixed point quantization for arithmetic representations and operations. As a result, our architecture achieved significant reduction in resource utilization with competitive accuracy. Furthermore, we developed an assembly-type microinstructions to access the co-processing fabric to manage layer-wise parallelism, thereby making re-use of limited resources. Finally, we have tested our architecture up to 9x9 kernel size on Xilinx Virtex 7 FPGA, achieving a throughput of up to 226.2 GOp/S for 3x3 kernel size.

中文翻译:

具有有限数值精度的可重构协处理器架构以加速深度卷积神经网络

卷积神经网络 (CNN) 广泛用于深度学习应用,例如视觉系统、机器人等。然而,现有的软件解决方案效率不高。因此,已经提出了许多硬件加速器来优化实现的性能、功率和资源利用率。在现有解决方案中,基于现场可编程门阵列 (FPGA) 的架构提供了更好的成本-能源-性能权衡以及可扩展性和最小化开发时间。在本文中,我们提出了一种独立于模型的可重构协同处理架构来加速 CNN。我们的架构由具有缓存技术和互连网络的并行乘法累加 (MAC) 单元组成,以利用最大的数据并行性。与现有的解决方案相比,我们为算术表示和运算引入了有限精度的 32 位 Q 格式定点量化。结果,我们的架构以具有竞争力的准确性实现了资源利用率的显着降低。此外,我们开发了一种组装型微指令来访问协同处理结构以管理分层并行性,从而重用有限的资源。最后,我们在 Xilinx Virtex 7 FPGA 上测试了高达 9x9 内核大小的架构,对于 3x3 内核大小实现了高达 226.2 GOp/S 的吞吐量。我们开发了一种组装型微指令来访问协同处理结构以管理分层并行性,从而重用有限的资源。最后,我们在 Xilinx Virtex 7 FPGA 上测试了高达 9x9 内核大小的架构,对于 3x3 内核大小实现了高达 226.2 GOp/S 的吞吐量。我们开发了一种组装型微指令来访问协同处理结构以管理分层并行性,从而重用有限的资源。最后,我们在 Xilinx Virtex 7 FPGA 上测试了高达 9x9 内核大小的架构,对于 3x3 内核大小实现了高达 226.2 GOp/S 的吞吐量。
更新日期:2021-09-08
down
wechat
bug