Efficient FPGA-based graph processing with hybrid pull-push computational model

Yang, Chengbo; Zheng, Long; Gui, Chuangyi; Jin, Hai

doi:10.1007/s11704-019-9020-5

Efficient FPGA-based graph processing with hybrid pull-push computational model

Research Article
Published: 03 January 2020

Volume 14, article number 144102, (2020)
Cite this article

Frontiers of Computer Science Aims and scope Submit manuscript

Chengbo Yang¹,
Long Zheng¹,
Chuangyi Gui¹ &
…
Hai Jin¹

141 Accesses
7 Citations
Explore all metrics

Abstract

Hybrid pull-push computational model can provide compelling results over either of single one for processing real-world graphs. Programmability and pipeline parallelism of FPGAs make it potential to process different stages of graph iterations. Nevertheless, considering the limited on-chip resources and streamline pipeline computation, the efficiency of hybrid model on FPGAs often suffers due to well-known random access feature of graph processing. In this paper, we present a hybrid graph processing system on FPGAs, which can achieve the best of both worlds. Our approach on FPGAs is unique and novel as follow. First, we propose to use edge block (consisting of edges with the same destination vertex set), which allows to sequentially access edges at block granularity for locality while still preserving the precision. Due to the independence of blocks in the sense that all edges in an inactive block are associated with inactive vertices, this also enables to skip invalid blocks for reducing redundant computation. Second, we consider a large number of vertices and their associated edge-blocks to maintain a predictable execution history. We also present to switch models in advance with few stalls using their state statistics. Our evaluation on a wide variety of graph algorithms for many real-world graphs shows that our approach achieves up to 3.69x speedup over state-of-the-art FPGA-based graph processing systems.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Distributed large-scale graph processing on FPGAs

Article Open access 04 June 2023

Effective runtime scheduling for high-performance graph processing on heterogeneous dataflow architecture

Article 28 July 2020

An adaptive breadth-first search algorithm on integrated architectures

Article 11 August 2018

References

Lumsdaine A, Gregor D, Hendrickson B, Hendrickson B, Berry J. Challenges in parallel graph processing. Parallel Processing Letters, 2007, 17(1): 5–20
Article MathSciNet Google Scholar
Jin H, Yao P, Liao X. Towards dataflow based graph processing. Science China Information Sciences, 2017, 60(12): 274–276
Google Scholar
Malewicz G, Austern M H, Bik A J C, Dehnert J C, Horn I, Leiser N, Czajkowski G. Pregel: a system for large-scale graph processing. In: Proceedings of SIGMOD International Conference on Management of Data. 2010, 135–146
Low Y, Bickson D, Gonzalez J, Guestrin C, Kyrola A, Hellerstein J M. Distributed GraphLab: a framework for machine learning and data mining in the cloud. Proceedings of the VLDB Endowment, 2012, 5(8): 716–727
Article Google Scholar
Gonzalez J E, Low Y, Gu H, Bickson D, Guestrin C. Powergraph: distributed graph-parallel computation on natural graphs. In: Proceedings of USENIX Symposium on Operating Systems Design and Implementation. 2012, 17–30
Kyrola A, Blelloch G E, Guestrin C. GraphChi: large-scale graph computation on just a PC disk-based graph computation. In: Proceedings of USENIX Symposium on Operating Systems Design and Implementation. 2012, 31–46
Roy A, Mihailovic I, Zwaenepoel W. X-stream: edge-centric graph processing using streaming partitions. In: Proceedings of the 24th ACM Symposium on Operating Systems Principles. 2013, 472–488
Beamer S, Asanovic K, Patterson D. Direction-optimizing breadth-first search. Scientific Programming, 2013, 21(3–4): 137–148
Article Google Scholar
Shun J, Blelloch G E. Ligra: a lightweight graph processing framework for shared memory. In: Proceedings of SIGPLAN Symposium on Principles and Practice of Parallel Programming. 2013, 135–146
Nguyen D, Lenharth A, Pingali K. A lightweight infrastructure for graph analytics. In: Proceedings of the 24th ACM Symposium on Operating Systems Principles. 2013, 456–471
Han W S, Lee S, Park K, Lee J H, Kim M S, Kim J, Yu H. TurboGraph: a fast parallel graph engine handling billion-scale graphs in a single PC. In: Proceedings of SIGKDD International Conference on Knowledge Discovery and Data Mining. 2013, 77–85
Gonzalez J E, Xin R S, Dave A, Crankshaw D, Franklin M J, Stoica I. GraphX: graph processing in a distributed dataflow framework. In: Proceedings of USENIX Symposium on Operating Systems Design and Implementation. 2014, 599–613
Zhu X, Han W, Chen W. GridGraph: large-scale graph processing on a single machine using 2-level hierarchical partitioning. In: Proceedings of USENIX Annual Technical Conference. 2015, 375–386
Sengupta D, Song S L, Agarwal K, Schwan K. GraphReduce: processing large-scale graphs on accelerator-based systems. In: Proceedings of International Conference for High Performance Computing, Networking, Storage and Analysis. 2015, 28
Chi Y, Dai G, Wang Y, Sun G, Li G, Yang H. NXgraph: an efficient graph processing system on a single machine. In: Proceedings of the 32nd International Conference on Data Engineering. 2016, 409–420
Zhu X, Chen W, Zheng W, Ma X. Gemini: a computation-centric distributed graph processing system. In: Proceedings of USENIX Symposium on Operating Systems Design and Implementation. 2016, 301–316
Roy A, Bindschaedler L, Malicevic J, Zwaenepoel W. Chaos: scale-out graph processing from secondary storage. In: Proceedings of the 25th Symposium on Operating Systems Principles. 2015, 410–424
Khorasani F, Vora K, Gupta R, Bhuyan L N. CuSha: vertex-centric graph processing on GPUs. In: Proceedings of the 23rd International Symposium on High-performance Parallel and Distributed Computing. 2014, 239–252
Fu Z, Personick M, Thompson B. MapGraph: a high level API for fast development of high performance graph analytics on GPUs. In: Proceedings of Workshop on GRAph Data Management Experiences and Systems. 2014, 1–6
Liu H, Huang H H. Enterprise: breadth-first graph traversal on GPUs. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 2015, 1–12
Ma L, Yang Z, Chen H, Xue J, Dai Y. Garaph: efficient GPU-accelerated graph processing on a single machine with balanced replication. In: Proceedings of USENIX Annual Technical Conference. 2017, 195–207
Ahn J, Hong S, Yoo S, Mutlu O, Choi K. A scalable processing-in-memory accelerator for parallel graph processing. In: Proceedings of International Symposium on Computer Architecture. 2015, 105–117
Ham T J, Wu L, Sundaram N, Satish N, Martonosi M. Graphicionado: a high-performance and energy-efficient accelerator for graph analytics. In: Proceedings of the 49th Annual IEEE/ACM International Symposium on Microarchitecture. 2016, 1–13
Song L, Zhuo Y, Qian X, Li H, Chen Y. GraphR: accelerating graph processing using ReRAM. In: Proceedings of International Symposium on High Performance Computer Architecture. 2018, 531–543
Zhang M, Zhuo Y, Wang C, Gao M, Wu Y, Chen K, Kozyrakis C, Qian X. GraphP: reducing communication for PIM-based graph processing with efficient data partition. In: Proceedings of International Symposium on High Performance Computer Architecture. 2018, 544–557
Nurvitadhi E, Weisz G, Wang Y, Hurkat S, Nguyen M, Hoe J C, Martinez J F, Guestrin C. GraphGen: an FPGA framework for vertex-centric graph computation. In: Proceedings of the 22nd Annual International Symposium on Field-Programmable Custom Computing Machines. 2014, 25–28
Umuroglu Y, Morrison D, Jahre M. Hybrid breadth-first search on a single-chip FPGA-CPU heterogeneous platform. In: Proceedings of the 25th International Conference on Field Programmable Logic and Applications. 2015, 1–8
Zhou S, Chelmis C, Prasanna V K. High-throughput and energy-efficient graph processing on FPGA. In: Proceedings of the 24th IEEE Annual International Symposium on Field-Programmable Custom Computing Machines. 2016, 103–110
Oguntebi T, Olukotun K. GraphOps: a dataflow library for graph analytics acceleration. In: Proceedings of ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. 2016, 111–117
Dai G, Chi Y, Wang Y, Yang H. FPGP: graph processing framework on FPGA a case study of breadth-first search. In: Proceedings of ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. 2016, 105–110
Dai G, Huang T, Chi Y, Xu N, Wang Y, Yang H. ForeGraph: exploring large-scale graph processing on multi-FPGA architecture. In: Proceedings of ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. 2017, 217–226
Zhang J, Khoram S, Li J. Boosting the performance of FPGA-based graph processor using hybrid memory cube: a case for breadth first search. In: Proceedings of ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. 2017, 207–216
Zhang J, Li J. Degree-aware hybrid graph traversal on FPGA-HMC platform. In: Proceedings of ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. 2018, 229–238
Engelhardt N, So H K H. GraVF: a vertex-centric distributed graph processing framework on FPGAs. In: Proceedings of the 26th International Conference on Field-Programmable Custom Computing Machines. 2016, 1–4
Zhou S, Prasanna V K. Accelerating graph analytics on CPU-FPGA heterogeneous platform. In: Proceedings of the 29th International Symposium on Computer Architecture and High Performance Computing. 2017, 137–144
Intel Altera. Altera SDK for OpenCL programming guide. Intel Altera, 2013
Intel Altera. Altera SDK for OpenCL best practices guide version 16.0. Intel Altera, 2016
McCune R R, Weninger T, Madey G. Thinking like a vertex: a survey of vertex-centric frameworks for large-scale distributed graph processing. ACM Computing Surveys, 2015, 48(2): 25
Article Google Scholar
Watts D J, Strogatz S H. Collective dynamics of small-world networks. Nature, 1998, 393(6684): 440
Article Google Scholar
Li A S, Li X C, Pan Y C, Zhang W. Strategies for network security. Science China Information Sciences, 2015, 58(1): 1–14
Article Google Scholar
Langr D, Tvrdik P. Evaluation criteria for sparse matrix storage formats. IEEE Transactions on Parallel and Distributed Systems, 2016, 27(2): 428–440
Article Google Scholar
Wang Z, He B, Zhang W, Jiang S. A performance analysis framework for optimizing OpenCL applications on FPGAs. In: Proceedings of the International Conference on High Performance Computer Architecture. 2016, 114–125
Aaftab M. The OpenCL specification version 1.0. Khronos OpenCL Working Group, 2009
Leskovec J, Krevl A. Snap datasets: stanford large network dataset collection. 2014

Download references

Acknowledgements

This work was supported by the National Key Research and Development Program of China (2018YFB1003502), the National Natural Science Foundation of China (Grant Nos. 61825202, 61832006, and 61702201).

Author information

Authors and Affiliations

National Engineering Research Center for Big Data Technology and System/Service Computing Technology and System Lab/Cluster and Grid Computing Lab, School of Computer Science and Technology, Huazhong University of Science and Technology, Wuhan, 430074, China
Chengbo Yang, Long Zheng, Chuangyi Gui & Hai Jin

Authors

Chengbo Yang
View author publications
You can also search for this author in PubMed Google Scholar
Long Zheng
View author publications
You can also search for this author in PubMed Google Scholar
Chuangyi Gui
View author publications
You can also search for this author in PubMed Google Scholar
Hai Jin
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Long Zheng.

Additional information

Chengbo Yang is now pursuing his master degree in the School of Computer Science and Technology at Huazhong University of Science and Technology (HUST), China. His research interests include graph processing and reinforcement learning.

Long Zheng is now a postdoctoral researcher in the School of Computer Science and Technology at Huazhong University of Science and Technology (HUST), China. He received his PhD degree at HUST in 2016. His current research interests include program analysis, runtime systems, and configurable computer architecture with a particular focus on graph processing.

Chuangyi Gui is currently a PhD candidate in the School of Computer Science and Technology at Huazhong University of Science and Technology (HUST), China. His current research interests include graph processing and reconfigurable computing.

Hai Jin is a Cheung Kung Scholars Chair Professor of computer science and engineering at Huazhong University of Science and Technology (HUST), China. He received his PhD in computer engineering from HUST in 1994. In 1996, he was awarded a German Academic Exchange Service fellowship to visit the Technical University of Chemnitz in Germany. He worked at The University of Hong Kong, China between 1998 and 2000, and as a visiting scholar at the University of Southern California, USA between 1999 and 2000. He was awarded Excellent Youth Award from the National Science Foundation of China in 2001. He is the chief scientist of ChinaGrid, the largest grid computing project in China, and the chief scientists of National 973 Basic Research Program Project of Virtualization Technology of Computing System, and Cloud Security. He is an IEEE Fellow and a member of the ACM. He has co-authored 15 books and published over 600 research papers. His research interests include computer architecture, virtualization technology, cluster computing, and cloud computing, peer-to-peer computing, network storage, and network security.

Electronic supplementary material