当前位置: X-MOL 学术ACM Trans. Archit. Code Optim. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
A Conflict-free Scheduler for High-performance Graph Processing on Multi-pipeline FPGAs
ACM Transactions on Architecture and Code Optimization ( IF 1.5 ) Pub Date : 2020-05-30 , DOI: 10.1145/3390523
Qinggang Wang 1 , Long Zheng 1 , Jieshan Zhao 1 , Xiaofei Liao 1 , Hai Jin 1 , Jingling Xue 2
Affiliation  

FPGA-based graph processing accelerators are nowadays equipped with multiple pipelines for hardware acceleration of graph computations. However, their multi-pipeline efficiency can suffer greatly from the considerable overheads caused by the read/write conflicts in their on-chip BRAM from different pipelines, leading to significant performance degradation and poor scalability. In this article, we investigate the underlying causes behind such inter-pipeline read/write conflicts by focusing on multi-pipeline FPGAs for accelerating Sparse Matrix Vector Multiplication (SpMV) arising in graph processing. We exploit our key insight that the problem of eliminating inter-pipeline read/write conflicts for SpMV can be formulated as one of solving a row- and column-wise tiling problem for its associated adjacency matrix. However, how to partition a sparse adjacency matrix obtained from any graph with respect to a set of pipelines by both eliminating all the inter-pipeline read/write conflicts and keeping all the pipelines reasonably load-balanced is challenging. We present a conflict-free scheduler, WaveScheduler, that can dispatch different sub-matrix tiles to different pipelines without any read/write conflict. We also introduce two optimizations that are specifically tailored for graph processing, “degree-aware vertex index renaming” for improving load balancing and “data re-organization” for enabling sequential off-chip memory access, for all the pipelines. Our evaluation on Xilinx®Alveo™ U250 accelerator card with 16 pipelines shows that WaveScheduler can achieve up to 3.57 GTEPS, running much faster than native scheduling and two state-of-the-art FPGA-based graph accelerators (by 6.48× for “native,” 2.54× for HEGP, and 2.11× for ForeGraph), on average. In particular, these performance gains also scale up significantly as the number of pipelines increases.

中文翻译:

用于多流水线 FPGA 上的高性能图形处理的无冲突调度器

如今,基于 FPGA 的图形处理加速器配备了多个管道,用于图形计算的硬件加速。然而,它们的多流水线效率可能会受到来自不同流水线的片上 BRAM 中的读/写冲突导致的相当大的开销的影响,从而导致性能显着下降和可扩展性差。在本文中,我们通过关注用于加速的多流水线 FPGA 来研究这种流水线间读/写冲突背后的根本原因稀疏矩阵向量乘法(SpMV)出现在图形处理中。我们利用我们的关键见解,即消除 SpMV 的流水线间读/写冲突的问题可以表述为解决其相关邻接矩阵的行和列平铺问题之一。然而,如何通过消除所有流水线间的读/写冲突并保持所有流水线合理的负载平衡来对从任何图获得的稀疏邻接矩阵相对于一组流水线进行分区是具有挑战性的。我们提出了一个无冲突的调度器 WaveScheduler,它可以将不同的子矩阵块分派到不同的管道,而不会发生任何读/写冲突。我们还介绍了两个专门为图形处理量身定制的优化,“度感知顶点索引重命名”用于改善负载平衡和“数据重组”以实现对所有管道的顺序片外内存访问。我们对具有 16 个流水线的 Xilinx®Alveo™ U250 加速器卡的评估表明,WaveScheduler 可以实现高达 3.57 GTEPS,运行速度比原生调度和两个最先进的基于 FPGA 的图形加速器(“原生,” HEGP 为 2.54 倍,ForeGraph 为 2.11 倍),平均而言。特别是,随着管道数量的增加,这些性能提升也会显着增加。平均而言,“native”为 48 倍,HEGP 为 2.54 倍,ForeGraph 为 2.11 倍。特别是,随着管道数量的增加,这些性能提升也会显着增加。平均而言,“native”为 48 倍,HEGP 为 2.54 倍,ForeGraph 为 2.11 倍。特别是,随着管道数量的增加,这些性能提升也会显着增加。
更新日期:2020-05-30
down
wechat
bug