Abstract
Data processing systems based on FPGAs offer high performance and energy efficiency for a variety of applications. However, these advantages are achieved through highly specialized designs. The high degree of specialization leads to accelerators with narrow functionality and designs adhering to a rigid execution flow. For multi-tenant systems this limits the scope of applicability of FPGA-based accelerators, because, first, supporting a single operation is unlikely to have any significant impact on the overall performance of the system, and, second, serving multiple users satisfactorily is difficult due to simplistic scheduling policies enforced when using the accelerator. Standard operating system and database management system features that would help address these limitations, such as context-switching, preemptive scheduling, and thread migration are practically non-existent in current FPGA accelerator efforts.
In this work, we propose PipeArch, an open-source project1 for developing FPGA-based accelerators that combine the high efficiency of specialized hardware designs with the generality and functionality known from conventional CPU threads. PipeArch provides programmability and extensibility in the accelerator without losing the advantages of SIMD-parallelism and deep pipelining. PipeArch supports context-switching and thread migration, thereby enabling for the first time new capabilities such as preemptive scheduling in FPGA accelerators within a high-performance data processing setting. We have used PipeArch to implement a variety of machine learning methods for generalized linear model training and recommender systems showing empirically their advantages over a high-end CPU and even over fully specialized FPGA designs.
- [n.d.]. Amazon Employee Access Dataset. https://github.com/owenzhang/Kaggle-AmazonChallenge2013.Google Scholar
- [n.d.]. Amazon F1 Instances. aws.amazon.com/ec2/instance-types/f1/.Google Scholar
- [n.d.]. AWS FPGA Stack Repository. Retrieved from https://github.com/aws/aws-fpga.Google Scholar
- [n.d.]. Baidu FPGA Instances. Retrieved from https://cloud.baidu.com/product/fpga.html.Google Scholar
- [n.d.]. Intel OPAE Framework. Retrieved from opae.github.io.Google Scholar
- [n.d.]. KDD Dataset. Retrieved from https://www.datarobot.com/blog/datarobot-the-2014-kdd-cup.Google Scholar
- [n.d.]. Music (Audio Features) Dataset. Retrieved from https://labrosa.ee.columbia.edu/millionsong.Google Scholar
- [n.d.]. Xilinx VCU1525. Retrieved from www.xilinx.com/products/boards-and-kits/vcu1525-a.html.Google Scholar
- Jason Agron and David Andrews. 2009. Building heterogeneous reconfigurable systems with a hardware microkernel. In Proceedings of the 7th IEEE/ACM International Conference on Hardware/software Codesign and System Synthesis. ACM, 393--402.Google ScholarDigital Library
- Mikhail Asiatici, Nithin George, Kizheppatt Vipin, Suhaib A. Fahmy, and Paolo Ienne. 2017. Virtualized execution runtime for FPGA accelerators in the cloud. IEEE Access 5 (2017), 1900--1910.Google ScholarCross Ref
- James Bennett, Stan Lanning, et al. 2007. The Netflix prize. In Proceedings of the KDD Cup and Workshop, Vol. 2007. New York, NY, 35.Google Scholar
- Alban Bourge, Olivier Muller, and Frédéric Rousseau. 2016. Generating efficient context-switch capable circuits through autonomous design flow. ACM Trans. Reconfig. Technol. Syst. 10, 1 (2016), 1--23.Google ScholarDigital Library
- Doug Burger, Stephen W. Keckler, Kathryn S. McKinley, Mike Dahlin, Lizy K. John, Calvin Lin, Charles R. Moore, James Burrill, Robert G. McDonald, and William Yoder. 2004. Scaling to the end of silicon with EDGE architectures. Computer 37, 7 (2004), 44--55.Google ScholarDigital Library
- Stuart Byma, J. Gregory Steffan, Hadi Bannazadeh, Alberto Leon Garcia, and Paul Chow. 2014. FPGAs in the cloud: Booting virtualized hardware accelerators with OpenStack. In Proceedings of the 2014 IEEE 22nd Annual International Symposium on Field-Programmable Custom Computing Machines. IEEE, 109--116.Google ScholarCross Ref
- Emmanuel J. Candès and Benjamin Recht. 2009. Exact matrix completion via convex optimization. Foundations of Computational Mathematics 9, 6 (2009), 717.Google ScholarCross Ref
- Hui Yan Cheah, Suhaib A. Fahmy, and Douglas L. Maskell. 2012. iDEA: A DSP block based FPGA soft processor. In Proceedings of the 2012 International Conference on Field-Programmable Technology. IEEE, 151--158.Google Scholar
- Fei Chen, Yi Shan, Yu Zhang, Yu Wang, Hubertus Franke, Xiaotao Chang, and Kun Wang. 2014. Enabling FPGAs in the cloud. In Proceedings of the 11th ACM Conference on Computing Frontiers. ACM, 3.Google ScholarDigital Library
- Yao Chen, Jiong He, Xiaofan Zhang, Cong Hao, and Deming Chen. 2019. Cloud-DNN: An open framework for mapping DNN models to cloud FPGAs. In Proceedings of the 2019 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, 73--82.Google ScholarDigital Library
- Wei-Sheng Chin, Yong Zhuang, Yu-Chin Juan, and Chih-Jen Lin. 2015. A fast parallel stochastic gradient method for matrix factorization in shared memory systems. ACM Transactions on Intelligent Systems and Technology (TIST) 6, 1 (2015), 2.Google ScholarDigital Library
- Christopher H. Chou, Aaron Severance, Alex D. Brant, Zhiduo Liu, Saurabh Sant, and Guy G. F. Lemieux. 2011. VEGAS: Soft vector processor with scratchpad memory. In Proceedings of the 19th ACM/SIGDA International Symposium on Field Programmable Gate Arrays. ACM, 15--24.Google Scholar
- Eric Chung, Jeremy Fowers, Kalin Ovtcharov, Michael Papamichael, Adrian Caulfield, Todd Massengill, Ming Liu, Daniel Lo, Shlomi Alkalay, Michael Haselman, et al. 2018. Serving DNNs in real time at datacenter scale with project brainwave. IEEE Micro 38, 2 (2018), 8--20.Google ScholarCross Ref
- Jason Cong, Hui Huang, Chiyuan Ma, Bingjun Xiao, and Peipei Zhou. 2014. A fully pipelined and dynamically composable architecture of CGRA. In Proceedings of the 2014 IEEE 22nd Annual International Symposium on Field-Programmable Custom Computing Machines. IEEE, 9--16.Google ScholarDigital Library
- James Coole and Greg Stitt. 2013. Fast, flexible high-level synthesis from OpenCL using reconfiguration contexts. IEEE Micro 34, 1 (2013), 42--53.Google ScholarCross Ref
- Henk Corporaal. 1997. Microprocessor Architectures: From VLIW to TTA. John Wiley 8 Sons, Inc.Google Scholar
- Kermin Fleming, Hsin-Jung Yang, Michael Adler, and Joel Emer. 2014. The LEAP FPGA operating system. In Proceedings of the 2014 24th International Conference on Field Programmable Logic and Applications (FPL). IEEE, 1--8.Google ScholarCross Ref
- Jeremy Fowers, Kalin Ovtcharov, Michael Papamichael, Todd Massengill, Ming Liu, Daniel Lo, Shlomi Alkalay, Michael Haselman, Logan Adams, Mahdi Ghandi, et al. 2018. A configurable cloud-scale DNN processor for real-time AI. In Proceedings of the 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA). IEEE, 1--14.Google ScholarDigital Library
- Mark Gebhart, Bertrand A. Maher, Katherine E. Coons, Jeff Diamond, Paul Gratz, Mario Marino, Nitya Ranganathan, Behnam Robatmili, Aaron Smith, James Burrill, et al. 2009. An evaluation of the TRIPS computer system. ACM SIGARCH Computer Architecture News 37, 1 (2009), 1--12.Google ScholarDigital Library
- Seth Copen Goldstein, Herman Schmit, Matthew Moe, Mihai Budiu, Srihari Cadambi, R. Reed Taylor, and Ronald Laufer. 1999. PipeRench: A coprocessor for streaming multimedia acceleration. In Proceedings of the 26th International Symposium on Computer Architecture (Cat. No. 99CB36367). IEEE, 28--39.Google ScholarCross Ref
- Venkatraman Govindaraju, Chen-Han Ho, Tony Nowatzki, Jatin Chhugani, Nadathur Satish, Karthikeyan Sankaralingam, and Changkyu Kim. 2012. Dyser: Unifying functionality and parallelism specialization for energy-efficient computing. IEEE Micro 32, 5 (2012), 38--51.Google ScholarDigital Library
- Panu Hamalainen, Jari Heikkinen, Marko Hannikainen, and Timo D. Hamalainen. 2005. Design of transport triggered architecture processors for wireless encryption. In Proceedings of the 8th Euromicro Conference on Digital System Design (DSD’05). IEEE, 144--152.Google Scholar
- Markus Happe, Andreas Traber, and Ariane Keller. 2015. Preemptive hardware multitasking in ReconOS. In Proceedings of the International Symposium on Applied Reconfigurable Computing. Springer, 79--90.Google ScholarCross Ref
- Jan Hoogerbrugge and Henk Corporaal. 1995. Automatic synthesis of transport triggered processors. In Proceedings of the First Ann. Conf. Advanced School for Computing and Imaging, Heijen, The Netherlands.Google Scholar
- S. Idreos, F. Groffen, N. Nes, S. Manegold, S. Mullender, and M. Kersten. 2012. MonetDB: Two decades of research in column-oriented database architectures. Data Engineering 40 (2012).Google Scholar
- Aws Ismail and Lesley Shannon. 2011. FUSE: Front-end user framework for O/S abstraction of hardware accelerators. In Proceedings of the 2011 IEEE 19th Annual International Symposium on Field-Programmable Custom Computing Machines. IEEE, 170--177.Google ScholarDigital Library
- Zsolt István, David Sidler, and Gustavo Alonso. 2016. Runtime parameterizable regular expression operators for databases. In Proceedings of the IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM’16). IEEE, 204--211.Google ScholarCross Ref
- Xabier Iturbe, Khaled Benkrid, Chuan Hong, Ali Ebrahim, Raul Torrego, Imanol Martinez, Tughrul Arslan, and Jon Perez. 2013. R3TOS: A novel reliable reconfigurable real-time operating system for highly adaptive, efficient, and dependable computing on FPGAs. IEEE Transactions on Computers 62, 8 (2013), 1542--1556.Google ScholarDigital Library
- Pekka Jääskeläinen, Aleksi Tervo, Guillermo Payá Vayá, Timo Viitanen, Nicolai Behmann, Jarmo Takala, and Holger Blume. 2018. Transport-triggered soft cores. In Proceedings of the 2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). IEEE, 83--90.Google ScholarCross Ref
- Norman P. Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, et al. 2017. In-datacenter performance analysis of a tensor processing unit. In Proceedings of the 44th Annual International Symposium on Computer Architecture. ACM, 1--12.Google ScholarDigital Library
- Muhammed Al Kadi, Benedikt Janssen, Jones Yudi, and Michael Huebner. 2018. General-purpose computing with soft GPUs on FPGAs. ACM Transactions on Reconfigurable Technology and Systems (TRETS) 11, 1 (2018), 5.Google Scholar
- Nachiket Kapre. 2016. Optimizing soft vector processing in FPGA-based embedded systems. ACM Transactions on Reconfigurable Technology and Systems (TRETS) 9, 3 (2016), 17.Google Scholar
- Nachiket Kapre and Jan Gray. 2015. Hoplite: Building austere overlay NOCs for FPGAs. In Proceedings of the 2015 25th International Conference on Field Programmable Logic and Applications (FPL). IEEE, 1--8.Google ScholarCross Ref
- Kaan Kara, Dan Alistarh, Gustavo Alonso, Onur Mutlu, and Ce Zhang. 2017. FPGA-accelerated dense linear machine learning: A precision-convergence trade-off. In Proceedings of the IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM’17). IEEE, 160--167.Google ScholarCross Ref
- Kaan Kara and Gustavo Alonso. 2016. Fast and robust hashing for database operators. In Proceedings of the 26th International Conference on Field Programmable Logic and Applications (FPL’16). IEEE, 1--4.Google ScholarCross Ref
- Kaan Kara, Ken Eguro, Ce Zhang, and Gustavo Alonso. 2018. ColumnML: Column-store machine learning with on-the-fly data transformation. Proceedings of the VLDB Endowment 12, 4 (2018), 348--361.Google ScholarDigital Library
- Kaan Kara, Jana Giceva, and Gustavo Alonso. 2017. FPGA-based data partitioning. In Proceedings of the 2017 ACM International Conference on Management of Data. ACM, 433--445.Google ScholarDigital Library
- Oliver Knodel, Paul R. Genssler, and Rainer G. Spallek. 2017. Migration of long-running tasks between reconfigurable resources using virtualization. ACM SIGARCH Computer Architecture News 44, 4 (2017), 56--61.Google ScholarDigital Library
- Dirk Koch, Christian Haubelt, and Jürgen Teich. 2007. Efficient hardware checkpointing: Concepts, overhead analysis, and implementation. In Proceedings of the 2007 ACM/SIGDA 15th International Symposium on Field Programmable Gate Arrays. 188--196.Google ScholarDigital Library
- Chris Lattner and Jacques Pienaar. 2019. MLIR primer: A compiler infrastructure for the end of Moore’s law. (2019).Google Scholar
- Cheng Liu, Ho-Cheung Ng, and Hayden Kwok-Hay So. 2015. QuickDough: A rapid FPGA loop accelerator design framework using soft CGRA overlay. In Proceedings of the 2015 International Conference on Field Programmable Technology (FPT). IEEE, 56--63.Google ScholarCross Ref
- Yu Liu, Hantian Zhang, Luyuan Zeng, Wentao Wu, and Ce Zhang. 2018. MLBench: How good are machine learning clouds for binary classification tasks on structured data? Proceedings of the VLDB Endowment 11, 10 (2018), 1220--1232.Google ScholarDigital Library
- Enno Lübbers and Marco Platzner. 2009. ReconOS: Multithreaded programming for reconfigurable computers. ACM Transactions on Embedded Computing Systems (TECS) 9, 1 (2009), 8.Google ScholarDigital Library
- Divya Mahajan, Joon Kyung Kim, Jacob Sacks, Adel Ardalan, Arun Kumar, and Hadi Esmaeilzadeh. 2018. In-RDBMS hardware acceleration of advanced analytics. Proceedings of the VLDB Endowment 11, 11 (2018), 1317--1331.Google ScholarDigital Library
- Aurelio Morales-Villanueva, Rohit Kumar, and Ann Gordon-Ross. 2016. Configuration prefetching and reuse for preemptive hardware multitasking on partially reconfigurable FPGAs. In Proceedings of the 2016 Design, Automation 8 Test in Europe Conference 8 Exhibition (DATE). IEEE, 1505--1508.Google ScholarCross Ref
- Ramadass Nagarajan, Karthikeyan Sankaralingam, Doug Burger, and Stephen W. Keckler. 2001. A design space evaluation of grid processor architectures. In Proceedings of the 34th ACM/IEEE International Symposium on Microarchitecture. MICRO-34. IEEE, 40--51.Google Scholar
- Neal Oliver, Rahul R. Sharma, Stephen Chang, Bhushan Chitlur, Elkin Garcia, Joseph Grecco, Aaron Grier, Nelson Ijih, Yaping Liu, Pratik Marolia, et al. 2011. A reconfigurable computing system based on a cache-coherent fabric. In Proceedings of the International Conference on Reconfigurable Computing and FPGAs (ReConFig’11). IEEE, 80--85.Google ScholarDigital Library
- Muhsen Owaida, Gustavo Alonso, Laura Fogliarini, Anthony Hock-Koon, and Pierre-Etienne Melet. 2019. Lowering the latency of data processing pipelines through FPGA based hardware acceleration. Proceedings of the VLDB Endowment 13, 1 (2019), 71--85.Google ScholarDigital Library
- Muhsen Owaida, David Sidler, Kaan Kara, and Gustavo Alonso. 2017. Centaur: A framework for hybrid CPU-FPGA databases. In Proceedings of the 2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM). IEEE, 211--218.Google ScholarCross Ref
- Muhsen Owaida, Hantian Zhang, Ce Zhang, and Gustavo Alonso. 2017. Scalable inference of decision tree ensembles: Flexible design for CPU-FPGA platforms. In Proceedings of the 27th International Conference on Field Programmable Logic and Applications (FPL’17). IEEE, 1--8.Google ScholarCross Ref
- Sinno Jialin Pan and Qiang Yang. 2009. A survey on transfer learning. IEEE Transactions on Knowledge and Data Engineering 22, 10 (2009), 1345--1359.Google ScholarDigital Library
- Kolin Paul, Chinmaya Dash, and Mansureh Shahraki Moghaddam. 2012. reMORPH: A runtime reconfigurable architecture. In Proceedings of the 2012 15th Euromicro Conference on Digital System Design. IEEE, 26--33.Google ScholarDigital Library
- Andrew Putnam. 2014. Large-scale reconfigurable computing in a microsoft datacenter. In Proceedings of the Hot Chips 26 Symposium (HCS), 2014 IEEE. IEEE, 1--38.Google Scholar
- Andrew Putnam, Adrian M. Caulfield, Eric S. Chung, Derek Chiou, Kypros Constantinides, John Demme, Hadi Esmaeilzadeh, Jeremy Fowers, Gopi Prashanth Gopal, Jan Gray, et al. 2014. A reconfigurable fabric for accelerating large-scale datacenter services. ACM SIGARCH Computer Architecture News 42, 3 (2014), 13--24.Google ScholarDigital Library
- Benjamin Recht and Christopher Ré. 2013. Parallel stochastic gradient algorithms for large-scale matrix completion. Mathematical Programming Computation 5, 2 (2013), 201--226.Google ScholarCross Ref
- Aaron Severance, Joe Edwards, Hossein Omidian, and Guy Lemieux. 2014. Soft vector processors with streaming pipelines. In Proceedings of the 2014 ACM/SIGDA International Symposium on Field-programmable Gate Arrays. ACM, 117--126.Google ScholarDigital Library
- Aaron Severance and Guy Lemieux. 2012. VENICE: A compact vector processor for FPGA applications. In Proceedings of the 2012 International Conference on Field-Programmable Technology. IEEE, 261--268.Google ScholarCross Ref
- Shai Shalev-Shwartz and Ambuj Tewari. 2011. Stochastic methods for L1-regularized loss minimization. Journal of Machine Learning Research 12, Jun (2011), 1865--1892.Google Scholar
- David Sidler, Zsolt István, Muhsen Owaida, Kaan Kara, and Gustavo Alonso. 2017. doppioDB: A hardware accelerated database. In Proceedings of the 2017 ACM International Conference on Management of Data. ACM, 1659--1662.Google ScholarDigital Library
- Bharat Sukhwani, Hong Min, Mathew Thoennes, Parijat Dube, Balakrishna Iyer, Bernard Brezzo, Donna Dillenberger, and Sameh Asaad. 2012. Database analytics acceleration using FPGAs. In Proceedings of the 21st International Conference on Parallel Architectures and Compilation Techniques. ACM, 411--420.Google ScholarDigital Library
- Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. 2016. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2818--2826.Google ScholarCross Ref
- Yaman Umuroglu, Nicholas J. Fraser, Giulio Gambardella, Michaela Blott, Philip Leong, Magnus Jahre, and Kees Vissers. 2017. FINN: A framework for fast, scalable binarized neural network inference. In Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, 65--74.Google ScholarDigital Library
- Anuj Vaishnav, Khoa Dang Pham, and Dirk Koch. 2018. A survey on FPGA virtualization. In Proceedings of the 2018 28th International Conference on Field Programmable Logic and Applications (FPL). IEEE, 131--1317.Google ScholarCross Ref
- Zeke Wang et al. 2019. Accelerating generalized linear models with MLWeaving: A one-size-fits-all system for any-precision learning. Proceedings of the VLDB Endowment 12, 7 (2019), 807--821.Google ScholarDigital Library
- Jagath Weerasinghe, Raphael Polig, Francois Abel, and Christoph Hagleitner. 2016. Network-attached FPGAs for data center applications. In Proceedings of the 2016 International Conference on Field-Programmable Technology (FPT). IEEE, 36--43.Google ScholarCross Ref
- Loring Wirbel. 2014. Xilinx SDAccel Whitepaper.Google Scholar
- Peter Yiannacouras, J. Gregory Steffan, and Jonathan Rose. 2008. VESPA: Portable, scalable, and flexible FPGA-based vector processors. In Proceedings of the 2008 International Conference on Compilers, Architectures and Synthesis for Embedded Systems. ACM, 61--70.Google ScholarDigital Library
- Jiansong Zhang, Yongqiang Xiong, Ningyi Xu, Ran Shu, Bojie Li, Peng Cheng, Guo Chen, and Thomas Moscibroda. 2017. The Feniks FPGA operating system for cloud computing. In Proceedings of the 8th Asia-Pacific Workshop on Systems. ACM, 22.Google ScholarDigital Library
- Tong Zhang. 2004. Solving large scale linear prediction problems using stochastic gradient descent algorithms. In Proceedings of the Twenty-first International Conference on Machine Learning. ACM, 116.Google ScholarDigital Library
- Zhuangdi Zhu, Alex X. Liu, Fan Zhang, and Fei Chen. 2018. FPGA resource pooling in cloud computing. IEEE Transactions on Cloud Computing (2018).Google Scholar
Index Terms
- PipeArch: Generic and Context-Switch Capable Data Processing on FPGAs
Recommendations
PACT HDL: a compiler targeting ASICS and FPGAS with power and performance optimizations
Power aware computingRecently, there has been a focus on high-level languages, C/C++ in particular, for hardware synthesis. At the same time, power dissipation is becoming an important metric in hardware design. This work presents PACT HDL, a C to HDL Compiler with ...
PACT HDL: a C compiler targeting ASICs and FPGAs with power and performance optimizations
CASES '02: Proceedings of the 2002 international conference on Compilers, architecture, and synthesis for embedded systemsChip fabrication technology continues to plunge deeper into sub-micron levels requiring hardware designers to utilize ever-increasing amounts of logic and shorten design time. Toward that end, high-level languages such as C/C++ are becoming popular for ...
Cryptography for Next Generation TLS: Implementing the RFC 7748 Elliptic Curve448 Cryptosystem in Hardware
DAC '17: Proceedings of the 54th Annual Design Automation Conference 2017With RFC 7748 the two elliptic curves Curve25519 and Curve448 were proposed for the next generation of TLS. Both curves were designed and optimized purely for software implementation; their implementation in hardware or physical protection against side-...
Comments