当前位置: X-MOL 学术arXiv.cs.AR › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
RapidLayout: Fast Hard Block Placement of FPGA-optimized Systolic Arrays using Evolutionary Algorithms
arXiv - CS - Hardware Architecture Pub Date : 2020-02-17 , DOI: arxiv-2002.06998
Niansong Zhang, Xiang Chen, Nachiket Kapre

Evolutionary algorithms can outperform conventional placement algorithms such as simulated annealing, analytical placement as well as manual placement on metrics such as runtime, wirelength, pipelining cost, and clock frequency when mapping FPGA hard block intensive designs such as systolic arrays on Xilinx UltraScale+ FPGAs. For certain hard-block intensive, systolic array accelerator designs, the commercial-grade Xilinx Vivado CAD tool is unable to provide a legal routing solution without tedious manual placement constraints. Instead, we formulate an automatic FPGA placement algorithm for these hard blocks as a multi-objective optimization problem that targets wirelength squared and maximum bounding box size metrics. We build an end-to-end placement and routing flow called RapidLayout using the Xilinx RapidWright framework. RapidLayout runs 5-6$\times$ faster than Vivado with manual constraints and eliminates the weeks-long effort to generate placement constraints manually for the hard blocks. We also perform automated post-placement pipelining of the long wires inside each convolution block to target 650MHz URAM-limited operation. RapidLayout outperforms (1) the simulated annealer in VPR by 33% in runtime, 1.9-2.4$\times$ in wirelength, and 3-4$\times$ in bounding box size, while also (2) beating the analytical placer UTPlaceF by 9.3$\times$ in runtime, 1.8-2.2$\times$ in wirelength, and 2-2.7$\times$ in bounding box size. We employ transfer learning from a base FPGA device to speed-up placement optimization for similar FPGA devices in the UltraScale+ family by 11-14$\times$ than learning the placements from scratch.

中文翻译:

RapidLayout:使用进化算法对 FPGA 优化的脉动阵列进行快速硬块放置

在将 FPGA 硬块密集型设计(例如 Xilinx UltraScale+ FPGA 上的脉动阵列)映射时,进化算法可以在运行时间、线长、流水线成本和时钟频率等指标上优于传统布局算法,例如模拟退火、分析布局以及手动布局。对于某些硬块密集型脉动阵列加速器设计,商业级 Xilinx Vivado CAD 工具无法在没有繁琐的手动布局约束的情况下提供合法的布线解决方案。相反,我们为这些硬块制定了一个自动 FPGA 布局算法,作为一个多目标优化问题,针对线长平方和最大边界框尺寸指标。我们使用 Xilinx RapidWright 框架构建了一个名为 RapidLayout 的端到端布局和布线流程。RapidLayout 的运行速度比采用手动约束的 Vivado 快 5-6$\times$,并且消除了为硬块手动生成放置约束长达数周的工作。我们还对每个卷积块内的长线执行自动贴装后流水线,以 650MHz URAM 限制操作为目标。RapidLayout 在运行时优于 VPR 中的模拟退火器 33%,线长为 1.9-2.4$\times$,边界框大小为 3-4$\times$,同时还 (2) 击败分析放置器 UTPlaceF运行时为 9.3$\times$,线长为 1.8-2.2$\times$,边界框大小为 2-2.7$\times$。我们采用从基础 FPGA 器件的迁移学习来加速 UltraScale+ 系列中类似 FPGA 器件的布局优化,比从头开始学习布局的速度快 11-14 倍。我们还对每个卷积块内的长线执行自动贴装后流水线,以 650MHz URAM 限制操作为目标。RapidLayout 在运行时优于 VPR 中的模拟退火器 33%,线长为 1.9-2.4$\times$,边界框大小为 3-4$\times$,同时还 (2) 击败分析放置器 UTPlaceF运行时为 9.3$\times$,线长为 1.8-2.2$\times$,边界框大小为 2-2.7$\times$。我们采用从基础 FPGA 器件中进行的迁移学习来将 UltraScale+ 系列中类似 FPGA 器件的布局优化速度提高 11-14 倍,比从头开始学习布局的速度快 11-14 倍。我们还对每个卷积块内的长线执行自动贴装后流水线,以 650MHz URAM 限制操作为目标。RapidLayout 在运行时优于 VPR 中的模拟退火器 33%,线长为 1.9-2.4$\times$,边界框尺寸为 3-4$\times$,同时还 (2) 击败了分析放置器 UTPlaceF运行时为 9.3$\times$,线长为 1.8-2.2$\times$,边界框大小为 2-2.7$\times$。我们采用从基础 FPGA 器件的迁移学习来加速 UltraScale+ 系列中类似 FPGA 器件的布局优化,比从头开始学习布局的速度快 11-14 倍。同时还 (2) 在运行时击败分析放置器 UTPlaceF 9.3$\times$,线长为 1.8-2.2$\times$,边界框大小为 2-2.7$\times$。我们采用从基础 FPGA 器件的迁移学习来加速 UltraScale+ 系列中类似 FPGA 器件的布局优化,比从头开始学习布局的速度快 11-14 倍。同时还 (2) 在运行时击败分析放置器 UTPlaceF 9.3$\times$,线长为 1.8-2.2$\times$,边界框大小为 2-2.7$\times$。我们采用从基础 FPGA 器件中进行的迁移学习来将 UltraScale+ 系列中类似 FPGA 器件的布局优化速度提高 11-14 倍,比从头开始学习布局的速度快 11-14 倍。
更新日期:2020-07-21
down
wechat
bug