Floating Point CGRA based Ultra-Low Power DSP Accelerator

Prasad, Rohit; Das, Satyajit; Martin, Kevin J. M.; Coussy, Philippe

doi:10.1007/s11265-020-01630-2

Floating Point CGRA based Ultra-Low Power DSP Accelerator

Published: 22 January 2021

Volume 93, pages 1159–1171, (2021)
Cite this article

Journal of Signal Processing Systems Aims and scope Submit manuscript

783 Accesses
5 Citations
Explore all metrics

Abstract

Coarse Grained Reconfigurable Arrays (CGRAs) are emerging as energy efficient accelerators providing a high grade of flexibility in both academia and industry. However, with the recent advancements in algorithms and performance requirements of applications, supporting only integer and logical arithmetic limits the interest of classical/traditional CGRAs. In this paper, we propose a novel CGRA architecture and associated compilation flow supporting both integer and floating-point computations for energy efficient acceleration of DSP applications. Experimental results show that the proposed accelerator achieves a maximum of 4.61× speedup compared to a DSP optimized, ultra low power RISC-V based CPU while executing seizure detection, a representative of wide range of EEG signal processing applications with an area overhead of 1.9×. The proposed CGRA achieves a maximum of 6.5× energy efficiency compared to the single core CPU. While comparing the execution with the multi-core CPU with 8 cores, the proposed CGRA achieves up to 4.4× energy gain.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

An Exploration Framework for Efficient High-Level Synthesis of Support Vector Machines: Case Study on ECG Arrhythmia Detection for Xilinx Zynq SoC

Article 07 March 2017

Vasileios Tsoutsouras, Konstantina Koliogeorgi, … Dimitrios Soudris

Low-energy real FFT architectures and their applications to seizure prediction from EEG

Article 06 September 2022

Sai Sanjeet, Bibhu Datta Sahoo & Keshab K. Parhi

Boosting general-purpose stream processing with reconfigurable hardware

Article Open access 27 February 2024

Alberto Ottimo, Gabriele Mencagli & Marco Danelutto

Notes

In order to maintain a consistency, a single template of PULP-cluster with 8 RI5CY cores has been used to perform all of the experiments in this paper. Pulp-cluster automatically disables other cores not in use.
PULP-cluster includes a shared FPU cluster which itself consists of 4 FPUs and PULP-cluster automatically disables the other FPUs not in use.

References

Balasubramanian, M., & Shrivastava, A. (2020). Crimson: compute-intensive loop acceleration by randomized iterative modulo scheduling and optimized mapping on CGRAs. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 39(11), 3300–3310. https://doi.org/10.1109/TCAD.2020.3022015.
Article Google Scholar
Bouwens, F., Berekovic, M., Kanstein, A., & Gaydadjiev, G. (2007). Architectural exploration of the adres coarse-grained reconfigurable array. In Proceedings of the 3rd international conference on reconfigurable computing: architectures, tools and applications, ARC’07. http://dl.acm.org/citation.cfm?id=1764631.1764633 (pp. 1–13). Berlin: Springer.
Das, S., Martin, K. J., Rossi, D., Coussy, P., & Benini, L. (2018). An energy-efficient integrated programmable array accelerator and compilation flow for near-sensor ultralow power processing. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 38(6), 1095–1108.
Article Google Scholar
Das, S., Peyret, T., Martin, K., Corre, G., Thevenin, M., & Coussy, P. (2016). A scalable design approach to efficiently map applications on cgras. In 2016 IEEE computer society annual symposium on VLSI (ISVLSI) (pp. 655–660), DOI https://doi.org/10.1109/ISVLSI.2016.54, (to appear in print).
Das, S., Rossi, D., Martin, K. J. M., Coussy, P., & Benini, L. (2017). A 142mops/mw integrated programmable array accelerator for smart visual processing. In 2017 IEEE International symposium on circuits and systems (ISCAS). IEEE (pp. 1–4).
De Sutter, B., Raghavan, P., & Lambrechts, A. (2010). Coarse-grained reconfigurable array architectures. In Bhattacharyya, S.S., Deprettere, E.F., Leupers, R., & Takala, J. (Eds.) Handbook of signal processing systems. Springer US (pp. 449–484).
Dinda, P., Bernat, A., & Hetland, C. (2020). Spying on the floating point behavior of existing, unmodified scientific applications. In Proceedings of the 29th international symposium on high-performance parallel and distributed computing, HPDC ’20. Association for Computing Machinery, New York, NY, USA (pp. 5–16), DOI https://doi.org/10.1145/3369583.3392673, (to appear in print).
Exynos 5 Octa (5430): Samsung 2014. Retrieved from (2014). https://www.samsung.com/semiconductor/minisite/exynos/products/mobileprocessor/exynos-5-octa-5430/.
Gautschi, M., Schiavone, P. D., Traber, A., Loi, I., Pullini, A., Rossi, D., Flamand, E., Gürkaynak, F. K., & Benini, L. (2017). Near-threshold risc-v core with dsp extensions for scalable iot endpoint devices. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 25(10), 2700–2713. https://doi.org/10.1109/TVLSI.2017.2654506.
Article Google Scholar
Golub, G. H., & Van der Vorst, H. A. (2001). Eigenvalue computation in the 20th century. In Numerical analysis: historical developments in the 20th century. Elsevier (pp. 209–239).
Govindaraju, V., Ho, C. H., Nowatzki, T., Chhugani, J., Satish, N., Sankaralingam, K., & Kim, C. (2012). Dyser: unifying functionality and parallelism specialization for energy-efficient computing. IEEE Micro, 32(5), 38–51.
Article Google Scholar
IEEE: IEEE standard for floating-point arithmetic. IEEE Std 754-2008 pp. 1–70 (2008).
Intel 2016: Retrieved from https://newsroom.intel.com/news-releases/intel-tsinghua-university-and-montage-technology-collaborate-to-bring-indigenous-data-center-solutions-to-china/https://newsroom.intel.com/news-releases/intel-tsinghua-university-and-montage-technology-collaborate-to-bring-indigenous-data-center-solutions-to-china/https://newsroom.intel.com/news-releases/intel-tsinghua-university-and-montage-technology-collaborate-to-bring-indigenous-data-center-solutions-to-china/.
Khailany, B., Dally, W. J., Kapasi, U. J., Mattson, P., Namkoong, J., Owens, J. D., Towles, B., Chang, A., & Rixner, S. (2001). Imagine: media processing with streams. IEEE Micro, 21(2), 35–46. https://doi.org/10.1109/40.918001.
Article Google Scholar
Kim, S., Park, Y. H., Kim, J., Kim, M., Lee, W., & Lee, S. (2015). Flexible video processing platform for 8k uhd tv. In Hot chips symposium (p. 1).
Le Kernec, J., Fioranelli, F., Ding, C., Zhao, H., Sun, L., Hong, H., Lorandel, J., & Romain, O. (2019). Radar signal processing for sensing in assisted living: the challenges associated with real-time implementation of emerging algorithms. IEEE Signal Processing Magazine, 36(4), 29–41.
Article Google Scholar
Lee, D., Jo, M., Han, K., & Choi, K. (2009). Flora: coarse-grained reconfigurable architecture with floating-point operation capability. In 2009 International conference on field-programmable technology (pp. 376–379), DOI https://doi.org/10.1109/FPT.2009.5377609, (to appear in print).
Lee, M. H., Singh, H., Lu, G., Bagherzadeh, N., Kurdahi, F. J., Eliseu Filho, M., & Alves, V. C. (2000). Design and implementation of the morphosys reconfigurable computing processor. Journal of VLSI Signal Processing Systems for Signal, Image and Video Technology, 24(2–3), 147–164.
Article Google Scholar
Levi, G. (1973). A note on the derivation of maximal common subgraphs of two directed or undirected graphs. Calcolo, 9(4), 341–352.
Article MathSciNet Google Scholar
Liu, D., Yin, S., Luo, G., Shang, J., Liu, L., Wei, S., Feng, Y., & Zhou, S. (2018). Data-flow graph mapping optimization for cgra with deep reinforcement learning. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 38(12), 2271–2283.
Article Google Scholar
Montagna, F., Benatti, S., & Rossi, D. (2017). Flexible, scalable and energy efficient bio-signals processing on the pulp platform: a case study on seizure detection. Journal of Low Power Electronics and Applications, 7(2). https://doi.org/10.3390/jlpea7020016, http://www.mdpi.com/2079-9268/7/2/16.
Nicol, C. (2017). A coarse grain reconfigurable array (CGRA) for statically scheduled data flow computing. Wave Computing White Paper. https://wavecomp.ai/wp-content/uploads/2018/12/WP_CGRA.pdf.
PACT: Retrieved from http://www.pactxpp.com/.
Peyret, T., Corre, G., Thevenin, M., Martin, K., & Coussy, P. (2014). Efficient application mapping on cgras based on backward simultaneous scheduling/binding and dynamic graph transformations. In 2014 IEEE 25th international conference on application-specific systems, architectures and processors (pp. 169–172).
Prabhakar, R., Zhang, Y., Koeplinger, D., Feldman, M., Zhao, T., Hadjis, S., Pedram, A., Kozyrakis, C., & Olukotun, K. (2017). Plasticine: a reconfigurable architecture for parallel patterns. In 2017 ACM/IEEE 44th annual international symposium on computer architecture (ISCA). IEEE (pp. 389–402).
Prasad, R., Das, S., Martin, K. J. M., Tagliavini, G., Coussy, P., Benini, L., & Rossi, D. (2020). Transpire: an energy-efficient transprecision floating-point programmable architecture. In 2020 Design, automation test in Europe conference exhibition (DATE) (pp. 1067–1072).
Pullini, A., Rossi, D., Loi, I., Tagliavini, G., & Benini, L. (2019). Mr.wolf: an energy-precision scalable parallel ultra low power soc for iot edge processing. IEEE Journal of Solid-State Circuits, 54 (7), 1970–1981. https://doi.org/10.1109/JSSC.2019.2912307.
Article Google Scholar
PULP Platform: Open hardware, the way it should be! https://pulp-platform.org/.
PULP SDK: PULP software development kit and tools. https://pulp-platform.org/docs/hipeac/AndreasKurth_pulp_tools.pdf.
Rahimi, A., Loi, I., Kakoee, M. R., & Benini, L. (2011). A fully-synthesizable single-cycle interconnection network for shared-l1 processor clusters. In Design, automation & test in Europe conference & exhibition (DATE), 2011. IEEE (pp. 1– 6).
Rossi, D., Conti, F., Marongiu, A., Pullini, A., Loi, I., Gautschi, M., Tagliavini, G., Capotondi, A., Flatresse, P., & Benini, L. (2015). Pulp: a parallel ultra low power platform for next generation iot applications. In 2015 IEEE Hot chips 27 symposium (HCS) (pp. 1–39).
Sato, T., Watanabe, H., & Shiba, K. (2005). Implementation of dynamically reconfigurable processor dapdna-2. In 2005 IEEE VLSI-TSA International symposium on VLSI design, automation and test, 2005.(VLSI-TSA-DAT). IEEE (pp. 323– 324).
Suzuki, M., Hasegawa, Y., Yamada, Y., Kaneko, N., Deguchi, K., Amano, H., Anjo, K., Motomura, M., Wakabayashi, K., Toi, T., & et al. (2004). Stream applications on the dynamically reconfigurable processor. In Proceedings. 2004 IEEE international conference on field-programmable technology (IEEE cat. no. 04EX921). IEEE (pp. 137–144).
Voitsechov, D., & Etsion, Y. (2018). Inter-thread communication in multithreaded, reconfigurable coarse-grain arrays. arXiv:1801.05178.
Walker, M. J., & Anderson, J. H. (2019). Generic connectivity-based cgra mapping via integer linear programming. In 2019 IEEE 27th annual international symposium on field-programmable custom computing machines (FCCM). IEEE (pp. 65–73).
Wilkinson, J. H., & Reinsch, C. (2012). Handbook for automatic computation: Volume II: linear algebra, vol. 186. Springer Science & Business Media.
Yin, S., Liu, D., Sun, L., Liu, L., & Wei, S. (2017). Dfgnet: mapping dataflow graph onto cgra by a deep learning approach. In 2017 IEEE international symposium on circuits and systems (ISCAS) (pp. 1–4), DOI https://doi.org/10.1109/ISCAS.2017.8050274, (to appear in print).

Download references

Author information

Authors and Affiliations

Lab-STICC, UMR 6285, University of Bretagne-Sud, Lorient, 56100, France
Rohit Prasad, Satyajit Das, Kevin J. M. Martin & Philippe Coussy
Indian Institute of Technology (IIT) Palakkad, Palakkad, India
Satyajit Das

Authors

Rohit Prasad
View author publications
You can also search for this author in PubMed Google Scholar
Satyajit Das
View author publications
You can also search for this author in PubMed Google Scholar
Kevin J. M. Martin
View author publications
You can also search for this author in PubMed Google Scholar
Philippe Coussy
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Rohit Prasad.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Prasad, R., Das, S., Martin, K.J.M. et al. Floating Point CGRA based Ultra-Low Power DSP Accelerator. J Sign Process Syst 93, 1159–1171 (2021). https://doi.org/10.1007/s11265-020-01630-2

Download citation

Received: 17 July 2020
Revised: 04 November 2020
Accepted: 16 December 2020
Published: 22 January 2021
Issue Date: October 2021
DOI: https://doi.org/10.1007/s11265-020-01630-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Floating Point CGRA based Ultra-Low Power DSP Accelerator

Abstract

Access this article

Similar content being viewed by others

An Exploration Framework for Efficient High-Level Synthesis of Support Vector Machines: Case Study on ECG Arrhythmia Detection for Xilinx Zynq SoC

Low-energy real FFT architectures and their applications to seizure prediction from EEG

Boosting general-purpose stream processing with reconfigurable hardware

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Floating Point CGRA based Ultra-Low Power DSP Accelerator

Abstract

Access this article

Similar content being viewed by others

An Exploration Framework for Efficient High-Level Synthesis of Support Vector Machines: Case Study on ECG Arrhythmia Detection for Xilinx Zynq SoC

Low-energy real FFT architectures and their applications to seizure prediction from EEG

Boosting general-purpose stream processing with reconfigurable hardware

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation