当前位置: X-MOL 学术Microprocess. Microsyst. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
A configurable multiplex data transfer model for asynchronous and heterogeneous FPGA accelerators on single DMA device
Microprocessors and Microsystems ( IF 2.6 ) Pub Date : 2020-06-08 , DOI: 10.1016/j.micpro.2020.103174
Zhangqin Huang , Shuo Zhang , Han Gao , Xiaobo Zhang , Shengqi Yang

To reduce DMA utilization for multiple algorithm IPs on FPGA, a channel configurable and multiplex DMA device (CMDMA) is proposed for asynchronous and heterogeneous algorithm IPs. Firstly, we abstract the entities and data-flow in CMDMA system with a formal description for function definition and work-flow analysis. Then based on the functions and work-flow, we design and implement a prototype of CMDMA, which includes CMDMA software driver (SW) and hardware circuits (HW) of one DMA IP, a configurable input switch (CISwitch), algorithm IPs, and an asynchronous output switch (AOSwitch). The configurable function of CMDMA is implemented by CISwitch through a configuration port in HW-level, and a configurable Round-Robin (CRR) algorithm is proposed to implement channel and input data schedule in SW-level. For output, a channel distinguishable output buffer (ChnDistBuf) is proposed, which is able to deliver channel ID and data size to SW earlier than the end time of an algorithm IP. With a double interrupt coordination method of both ChnDistBuf and algorithm IPs, CMDMA is able to successively store complete output data from different algorithm IPs. With a double interrupt coordination method of both ChnDistBuf and algorithm IPs, CMDMA is able to successively store complete output data from different algorithm IPs. The experiments based on 4 heterogeneous matrix multiplication algorithm IPs on Xilinx Zynq platform show that CMDMA is able to improve about 8%-29% average algorithm acceleration rates on single algorithm IP compared to the exclusive method that one DMA works for one algorithm IP only, and it is able to increase about 10–40 MB/s and 5–15 MB/s of DMA input and output data throughput with multiple algorithm IPs running in parallel. Moreover, the extended LUT and FF resources in CMDMA are 756 and 1219, both of which are about 1% of Zynq platform. Besides, in a double CNN algorithm IPs test on Mnist application, an enhanced function of data broadcasting in CMDMA is able to improve 4 s than the system with 4 exclusive DMA running in parallel, meanwhile reduce 3 DMA utilization and 0.03 W power consumption.



中文翻译:

用于单个DMA设备上的异步和异构FPGA加速器的可配置多路复用数据传输模型

为了减少FPGA上多个算法IP的DMA利用率,提出了一种用于异步和异构算法IP的通道可配置和多路复用DMA设备(CMDMA)。首先,我们对CMDMA系统中的实体和数据流进行抽象,并以正式的描述进行功能定义和工作流分析。然后根据功能和工作流程,设计并实现CMDMA的原型,其中包括CMDMA软件驱动程序(SW)和一个DMA IP的硬件电路(HW),可配置的输入开关(CISwitch),算法IP和异步输出开关(AOSwitch)。CMDMA的可配置功能由CISwitch通过硬件级别的配置端口来实现,并提出了一种可配置的轮询(CRR)算法来实现SW级的信道和输入数据调度。对于输出,提出了一种可区分通道的输出缓冲区(ChnDistBuf),它能够比算法IP的结束时间更早地向SW传递通道ID和数据大小。借助ChnDistBuf和算法IP的双中断协调方法,CMDMA能够连续存储来自不同算法IP的完整输出数据。借助ChnDistBuf和算法IP的双中断协调方法,CMDMA能够连续存储来自不同算法IP的完整输出数据。实验依据 借助ChnDistBuf和算法IP的双中断协调方法,CMDMA能够连续存储来自不同算法IP的完整输出数据。实验依据 借助ChnDistBuf和算法IP的双中断协调方法,CMDMA能够连续存储来自不同算法IP的完整输出数据。实验依据4异构矩阵乘法算法的IP赛灵思ZYNQ平台显示,CMDMA能提高约8% - 29%相比,一个DMA只适用于一个算法IP的独家方法上单一的算法IP平均算法的加速度,并且还能够至约增加10 40 MB / s的5- 15 MB / s的吞吐量并行运行的多个算法的IP DMA输入和输出数据的。此外,CMDMA中扩展的LUT和FF资源分别为7561219,两者均约为1%Zynq平台。此外,在针对Mnist应用的双CNN算法IPs测试中,CMDMA中数据广播的增强功能比并行运行4个独占DMA的系统能够提高4 s ,同时降低了3个DMA的利用率和0. 03 W的功耗。

更新日期:2020-06-08
down
wechat
bug