Abstract
Compute-bound problems like matrix-matrix multiplication can be accelerated using special purpose hardware scheme such as Systolic Arrays (SAs). However, processing elements in SAs have a long critical path delay, thus limiting the performance benefits of SAs. This paper presents a scheme to achieve high-performance matrix multiplication using SAs. Two approximate matrix multiplier designs (Ax1 and Ax2) of variable accuracy/power are proposed. The proposed designs (8-bit) achieve an improvement of 32% in terms of critical path delay and for scale-up variants (32-bit) the improvement in delay and energy scale upto 64% and 51%, respectively. Moreover, Ax1 and Ax2 have a reduced power-delay product compared to previous approximate matrix multiplier designs. This leads to an improved resolution of the prior accuracy-energy Pareto front; therefore, we define a new Pareto front for approximate matrix multipliers. As a case study, the discrete cosine transform is evaluated. Ax2 achieves the best quality-power trade-off and it exhibits a 5% degradation in structural similarity index (SSIM) with a power saving of 28%.
Similar content being viewed by others
References
Dennard, R.H. (2015). Past progress and future challenges in LSI technology: from DRAM and scaling to ultra-low-power CMOS. IEEE Solid-State Circuits Magazine, 7(2), 29–38.
Pedram, A., Richardson, S., Horowitz, M., Galal, S., & Kvatinsky, S. (2017). Dark memory and accelerator-rich system optimization in the dark silicon era. IEEE Design & Test, 34(2), 39–50.
Liu, W., Lombardi, F., & Shulte, M. (2020). A retrospective and prospective view of approximate computing [point of view]. Proceedings of the IEEE, 108(3), 394–399.
Yagle, A.E. (1995). Fast algorithms for matrix multiplication using pseudo-number-theoretic-transforms. IEEE Transactions on Signal Processing, 43(1), 71–76.
Li, K., Pan, Y., & Zheng, S.Q. (1998). Fast and processor efficient parallel matrix multiplication algorithms on a linear array with a reconfigurable pipelined bus system. IEEE Transactions on Parallel and Distributed Systems, 9(8), 705–720.
Cohn, H., Kleinberg, R., Szegedy, B., & Umans, C. (2005). Group-theoretic algorithms for matrix multiplication. IEEE International Symposium Foundation of Computer Science, pp. 379–388.
Oh, J., Kim, J., & Moon, B.-R. (2013). On the inequivalence of bilinear algorithms for 3×3 matrix multiplication. Information Processing Letters, 113(17), 640–645.
Varman, P.J., & Ramakrishnan, I.V. (1986). Synthesis of an optimal family of matrix multiplication algorithms on linear arrays. IEEE Transactions on Computers, 35(11), 989–996.
Chan, S.W., & Wey, C.L. (1988). The design of concurrent error diagnosable sytolic arrays for band matrix multiplications. IEEE Transactions on Computer Aided Design, 7(1), 21–37.
Saha, P., Banerjee, A., Bhattacharyya, P., & Dandapat, A. (2014). Improved matrix multiplier design for high-speed digital signal processing applications. IET Circuits, Devices and Systems, 8(1), 27–37.
Kung, H.T. (1982). Why systolic architectures? Computer, 15(1), 37–46.
Pan, J., Lee, C., Sghaier, A., Zeghid, M., & Xie, J. (2019). Novel systolization of subquadratic space complexity multipliers based on toeplitz matrix–vector product approach. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 27(7), 1614–1622.
Montagne, E., & Surós, R. (2019). Systolic sparse matrix vector multiply in the age of TPUs and accelerators. Spring Simulation Conference (SpringSim), pp. 1–10.
Shomron, G., Horowitz, T., & Weiser U. (2019). SMT-SA: simultaneous multithreading in systolic arrays. IEEE Computer Architecture Letters, 18(2), 99–102.
Olsen, E.B. (2018). RNS hardware matrix multiplier for high precision neural network acceleration: RNS TPU. IEEE International Symposium on Circuits and Systems (ISCAS), pp. 1–5.
Cong, J., & Wang, J. (2018). Automatic interior I/O elimination in systolic array architecture. IEEE 26th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM), pp. 228.
Jouppi, N.P. (2017). In-datacenter performance analysis of a tensor processing unit. ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA), pp. 1–12.
Waris, H., Wang, C., Liu, W., & Lombardi, F. (2019). Design and evaluation of a power-efficient approximate systolic array architecture for matrix multiplication. IEEE International Workshop on Signal Processing Systems (SiPS), pp. 13–18.
Chen, K., Lombardi, F., & Han, J. (2015). Matrix multiplication by an inexact systolic array. IEEE Symposium on Nanoscale Architectures (NANOARCH), pp. 151–156.
Waris, H., Wang, C., & Liu, W. (2019). High-performance approximate half and full adder cells using NAND logic gate. IEICE Electronics Express, 16(6), 20190043.
Yeh, W.-C., & Jen, C.-W. (2000). High-speed booth encoded parallel multiplier design. IEEE Transactions on Computers, 49(7), 692–701.
Khan, S.A. (2011). Digital Design of Signal Processing Systems: a Practical Approach. New York: Wiley.
Liang, J., Han, J., & Lombardi, F. (2013). New metrics for the reliability of approximate and probabilistic adders. IEEE Transactions on Computers, 62(9), 1760–1771.
Meher, P.K., Park, S.Y., Mohanty, B.K., Lim, K.S., & Yeo, C. (2014). Efficient Integer DCT Architectures for HEVC. IEEE Transactions on Circuits and Systems for Video Technology, 24(1), 168–178.
Acknowledgements
This work is supported by grants from the National Natural Science Foundation of China (61871216), the Six Talent Peaks Project in Jiangsu Province (2018-XYDXX-009) and the Fundamental Research Funds for the Central Universities China (No. NE2019102).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Waris, H., Wang, C., Liu, W. et al. AxSA: On the Design of High-Performance and Power-Efficient Approximate Systolic Arrays for Matrix Multiplication. J Sign Process Syst 93, 605–615 (2021). https://doi.org/10.1007/s11265-020-01582-7
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11265-020-01582-7