Skip to main content

Advertisement

Log in

AxSA: On the Design of High-Performance and Power-Efficient Approximate Systolic Arrays for Matrix Multiplication

  • Published:
Journal of Signal Processing Systems Aims and scope Submit manuscript

Abstract

Compute-bound problems like matrix-matrix multiplication can be accelerated using special purpose hardware scheme such as Systolic Arrays (SAs). However, processing elements in SAs have a long critical path delay, thus limiting the performance benefits of SAs. This paper presents a scheme to achieve high-performance matrix multiplication using SAs. Two approximate matrix multiplier designs (Ax1 and Ax2) of variable accuracy/power are proposed. The proposed designs (8-bit) achieve an improvement of 32% in terms of critical path delay and for scale-up variants (32-bit) the improvement in delay and energy scale upto 64% and 51%, respectively. Moreover, Ax1 and Ax2 have a reduced power-delay product compared to previous approximate matrix multiplier designs. This leads to an improved resolution of the prior accuracy-energy Pareto front; therefore, we define a new Pareto front for approximate matrix multipliers. As a case study, the discrete cosine transform is evaluated. Ax2 achieves the best quality-power trade-off and it exhibits a 5% degradation in structural similarity index (SSIM) with a power saving of 28%.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
Figure 6
Figure 7
Figure 8
Figure 9
Figure 10

Similar content being viewed by others

References

  1. Dennard, R.H. (2015). Past progress and future challenges in LSI technology: from DRAM and scaling to ultra-low-power CMOS. IEEE Solid-State Circuits Magazine, 7(2), 29–38.

    Article  Google Scholar 

  2. Pedram, A., Richardson, S., Horowitz, M., Galal, S., & Kvatinsky, S. (2017). Dark memory and accelerator-rich system optimization in the dark silicon era. IEEE Design & Test, 34(2), 39–50.

    Article  Google Scholar 

  3. Liu, W., Lombardi, F., & Shulte, M. (2020). A retrospective and prospective view of approximate computing [point of view]. Proceedings of the IEEE, 108(3), 394–399.

    Article  Google Scholar 

  4. Yagle, A.E. (1995). Fast algorithms for matrix multiplication using pseudo-number-theoretic-transforms. IEEE Transactions on Signal Processing, 43(1), 71–76.

    Article  Google Scholar 

  5. Li, K., Pan, Y., & Zheng, S.Q. (1998). Fast and processor efficient parallel matrix multiplication algorithms on a linear array with a reconfigurable pipelined bus system. IEEE Transactions on Parallel and Distributed Systems, 9(8), 705–720.

    Article  Google Scholar 

  6. Cohn, H., Kleinberg, R., Szegedy, B., & Umans, C. (2005). Group-theoretic algorithms for matrix multiplication. IEEE International Symposium Foundation of Computer Science, pp. 379–388.

  7. Oh, J., Kim, J., & Moon, B.-R. (2013). On the inequivalence of bilinear algorithms for 3×3 matrix multiplication. Information Processing Letters, 113(17), 640–645.

    Article  MathSciNet  Google Scholar 

  8. Varman, P.J., & Ramakrishnan, I.V. (1986). Synthesis of an optimal family of matrix multiplication algorithms on linear arrays. IEEE Transactions on Computers, 35(11), 989–996.

    Article  Google Scholar 

  9. Chan, S.W., & Wey, C.L. (1988). The design of concurrent error diagnosable sytolic arrays for band matrix multiplications. IEEE Transactions on Computer Aided Design, 7(1), 21–37.

    Article  Google Scholar 

  10. Saha, P., Banerjee, A., Bhattacharyya, P., & Dandapat, A. (2014). Improved matrix multiplier design for high-speed digital signal processing applications. IET Circuits, Devices and Systems, 8(1), 27–37.

    Article  Google Scholar 

  11. Kung, H.T. (1982). Why systolic architectures? Computer, 15(1), 37–46.

    Article  Google Scholar 

  12. Pan, J., Lee, C., Sghaier, A., Zeghid, M., & Xie, J. (2019). Novel systolization of subquadratic space complexity multipliers based on toeplitz matrix–vector product approach. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 27(7), 1614–1622.

    Article  Google Scholar 

  13. Montagne, E., & Surós, R. (2019). Systolic sparse matrix vector multiply in the age of TPUs and accelerators. Spring Simulation Conference (SpringSim), pp. 1–10.

  14. Shomron, G., Horowitz, T., & Weiser U. (2019). SMT-SA: simultaneous multithreading in systolic arrays. IEEE Computer Architecture Letters, 18(2), 99–102.

    Article  Google Scholar 

  15. Olsen, E.B. (2018). RNS hardware matrix multiplier for high precision neural network acceleration: RNS TPU. IEEE International Symposium on Circuits and Systems (ISCAS), pp. 1–5.

  16. Cong, J., & Wang, J. (2018). Automatic interior I/O elimination in systolic array architecture. IEEE 26th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM), pp. 228.

  17. Jouppi, N.P. (2017). In-datacenter performance analysis of a tensor processing unit. ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA), pp. 1–12.

  18. Waris, H., Wang, C., Liu, W., & Lombardi, F. (2019). Design and evaluation of a power-efficient approximate systolic array architecture for matrix multiplication. IEEE International Workshop on Signal Processing Systems (SiPS), pp. 13–18.

  19. Chen, K., Lombardi, F., & Han, J. (2015). Matrix multiplication by an inexact systolic array. IEEE Symposium on Nanoscale Architectures (NANOARCH), pp. 151–156.

  20. Waris, H., Wang, C., & Liu, W. (2019). High-performance approximate half and full adder cells using NAND logic gate. IEICE Electronics Express, 16(6), 20190043.

    Article  Google Scholar 

  21. Yeh, W.-C., & Jen, C.-W. (2000). High-speed booth encoded parallel multiplier design. IEEE Transactions on Computers, 49(7), 692–701.

    Article  MathSciNet  Google Scholar 

  22. Khan, S.A. (2011). Digital Design of Signal Processing Systems: a Practical Approach. New York: Wiley.

    Book  Google Scholar 

  23. Liang, J., Han, J., & Lombardi, F. (2013). New metrics for the reliability of approximate and probabilistic adders. IEEE Transactions on Computers, 62(9), 1760–1771.

    Article  MathSciNet  Google Scholar 

  24. Meher, P.K., Park, S.Y., Mohanty, B.K., Lim, K.S., & Yeo, C. (2014). Efficient Integer DCT Architectures for HEVC. IEEE Transactions on Circuits and Systems for Video Technology, 24(1), 168–178.

    Article  Google Scholar 

Download references

Acknowledgements

This work is supported by grants from the National Natural Science Foundation of China (61871216), the Six Talent Peaks Project in Jiangsu Province (2018-XYDXX-009) and the Fundamental Research Funds for the Central Universities China (No. NE2019102).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Weiqiang Liu.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Waris, H., Wang, C., Liu, W. et al. AxSA: On the Design of High-Performance and Power-Efficient Approximate Systolic Arrays for Matrix Multiplication. J Sign Process Syst 93, 605–615 (2021). https://doi.org/10.1007/s11265-020-01582-7

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11265-020-01582-7

Keywords

Navigation