Optimization of General Matrix Multiply Library for Ternary Weight for Fast DNN Inference

Choi, Seokhyeon; Shim, Kyuhong; Choi, Jungwook; Sung, Wonyong; Shim, Byonghyo

doi:10.1007/s11265-022-01782-3

Optimization of General Matrix Multiply Library for Ternary Weight for Fast DNN Inference

Published: 20 July 2022

Volume 94, pages 929–943, (2022)
Cite this article

Journal of Signal Processing Systems Aims and scope Submit manuscript

Seokhyeon Choi ORCID: orcid.org/0000-0001-6677-7646¹,
Kyuhong Shim¹,
Jungwook Choi²,
Wonyong Sung¹ &
…
Byonghyo Shim¹

361 Accesses
Explore all metrics

Abstract

Efficient implementation of deep neural networks (DNNs) on CPU-based systems is critical owing to the proliferation of applications in embedded and Internet of Things systems. Nowdays, most CPUs are equipped with single instruction multiple data (SIMD) instructions, which are used to implement an efficient general matrix multiply (GEMM) library for accelerating DNN inference. Quantized neural networks are actively investigated to simplify DNN computation and memory requirements; however, the current CPU libraries do not efficiently support arithmetic operations below eight bits. Hence, we developed TernGEMM, a GEMM library composed of SIMD instructions for DNNs with ternary weights and sub-8-bit activations. TernGEMM is implemented using simple logical operations that replace the long-latency multiply–add operation. Instead of fixing the accumulation bit precision as 32-bit, TernGEMM accumulates the partial sums in a bit-incremental manner to exploit parallelism in 8-bit and 16-bit SIMD instructions. Furthermore, we propose different tile sizes for TernGEMM to better support the diverse dimensions of DNNs. Compared with a state-of–the-art reduced precision DNN GEMM library, i.e., GEMMLowp, TernGEMM achieve \(\times\)1.785 to \(\times\)4.147 speedup for ResNet50, MobileNet-V2, and EfficientNet-B0, as evaluated on both Intel and ARM CPUs.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Review of deep learning: concepts, CNN architectures, challenges, applications, future directions

Article Open access 31 March 2021

A survey of the recent architectures of deep convolutional neural networks

Article 21 April 2020

A comprehensive review of Binary Neural Network

Article 30 March 2023

References

Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., & Chen, L.-C. (2018). Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 4510–4520)
Tan, M., Chen, B., Pang, R., Vasudevan, V., Sandler, M., Howard, A., & Le, Q. V. (2019). Mnasnet: Platform-aware neural architecture search for mobile. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 2820–2828)
Tan, M., & Le, Q. (2019). Efficientnet: Rethinking model scaling for convolutional neural networks. In International Conference on Machine Learning (pp. 6105–6114). PMLR
Reed, R. (1993). Pruning algorithms-a survey. IEEE transactions on Neural Networks, 4, 740–747
Article Google Scholar
Hubara, I., Courbariaux, M., Soudry, D., El-Yaniv, R., & Bengio, Y. (2017). Quantized neural networks: Training neural networks with low precision weights and activations. The Journal of Machine Learning Research, 18, 6869–6898
MathSciNet MATH Google Scholar
Sung, W., Shin, S., & Hwang, K. (2015). Resiliency of deep neural networks under quantization. arXiv preprint arXiv:1511.06488
Choi, J., Wang, Z., Venkataramani, S., Chuang, P. I.-J., Srinivasan, V., & Gopalakrishnan, K. (2018). Pact: Parameterized clipping activation for quantized neural networks. arXiv preprint arXiv:1805.06085
Esser, S. K., McKinstry, J. L., Bablani, D., Appuswamy, R., & Modha, D. S. (2019). Learned step size quantization. arXiv preprint arXiv:1902.08153
Jacob, B., & Warden, P. (2017). gemmlowp: A small self-contained low-precision gemm library
Han, Q., Hu, Y., Yu, F., Yang, H., Liu, B., Hu, P., Gong, R., Wang, Y., Wang, R., Luan, Z., et al. (2020). Extremely low-bit convolution optimization for quantized neural network on modern computer architectures. In 49th International Conference on Parallel Processing-ICPP (pp. 1–12)
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770–778)
Hwang, K., & Sung, W. (2014). Fixed-point feedforward deep neural network design using weights+ 1, 0, and- 1. In 2014 IEEE Workshop on Signal Processing Systems (SiPS) (pp. 1–6). IEEE
Shin, S., Boo, Y., & Sung, W. (2017). Fixed-point optimization of deep neural networks with adaptive step size retraining. In 2017 IEEE International conference on acoustics, speech and signal processing (ICASSP) (pp. 1203–1207). IEEE
Li, F., Zhang, B., & Liu, B. (2016). Ternary weight networks. arXiv preprint arXiv:1605.04711
Zhu, C., Han, S., Mao, H., & Dally, W. J. (2016). Trained ternary quantization. arXiv preprint arXiv:1612.01064
Mellempudi, N., Kundu, A., Mudigere, D., Das, D., Kaul, B., & Dubey, P. (2017). Ternary neural networks with fine-grained quantization. arXiv preprint arXiv:1705.01462
Mishra, A., Nurvitadhi, E., Cook, J. J., & Marr, D. (2017). WRPN: Wide reduced-precision networks. arXiv preprint arXiv:1709.01134
Shin, S., Park, J., Boo, Y., & Sung, W. (2020). Hlhlp: Quantized neural networks training for reaching flat minima in loss surface. In Proceedings of the AAAI Conference on Artificial Intelligence (pp. 5784–5791). volume 34
Boo, Y., Shin, S., Choi, J., & Sung, W. (2021). Stochastic precision ensemble: self-knowledge distillation for quantized deep neural networks. In Proceedings of the AAAI Conference on Artificial Intelligence (pp. 6794–6802). volume 35
Bhalgat, Y., Lee, J., Nagel, M., Blankevoort, T., & Kwak, N. (2020). Lsq+: Improving low-bit quantization through learnable offsets and better initialization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (pp. 696–697)
Chellapilla, K., Puri, S., & Simard, P. (2006). High performance convolutional neural networks for document processing. In Tenth International Workshop on Frontiers in Handwriting Recognition. Suvisoft
Dukhan, M., Wu, Y., Lu, H., & Maher, B. (2018). Qnnpack
Lee, J., Chirkov, N., Ignasheva, E., Pisarchyk, Y., Shieh, M., Riccardi, F., Sarokin, R., Kulik, A., & Grundmann, M. (2019). On-device neural net inference with mobile gpus. arXiv preprint arXiv:1907.01989
Umuroglu, Y., & Jahre, M. (2017). Streamlined deployment for quantized neural networks. arXiv preprint arXiv:1709.04060
Rastegari, M., Ordonez, V., Redmon, J., & Farhadi, A. (2016). Xnor-net: Imagenet classification using binary convolutional neural networks. In European conference on computer vision (pp. 525–542). Springer
Courbariaux, M., Hubara, I., Soudry, D., El-Yaniv, R., & Bengio, Y. (2016). Binarized neural networks: Training deep neural networks with weights and activations constrained to+ 1 or-1. arXiv preprint arXiv:1602.02830
Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., & Fei-Fei, L. (2009). Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition (pp. 248–255). IEEE
Jeon, Y., Park, B., Kwon, S. J., Kim, B., Yun, J., & Lee, D. (2020). BiQGEMM: matrix multiplication with lookup table for binary-coding-based quantized dnns. arXiv preprint arXiv:2005.09904
Bao, W., Chang, L.-W., Chen, Y., Deng, K., Agarwal, A., Barsoum, E., & Taha, A. (2019). NGEMM: Optimizing gemm for deep learning via compiler-based techniques. arXiv preprint arXiv:1910.00178
Blackford, L. S., Petitet, A., Pozo, R., Remington, K., Whaley, R. C., Demmel, J., et al. (2002). An updated set of basic linear algebra subprograms (blas). ACM Transactions on Mathematical Software, 28, 135–151
Article MathSciNet Google Scholar
Jia, Y. (2014). Learning semantic image representations at a large scale. Ph.D. thesis UC Berkeley
Intel. Intel intrinsics guide. https://software.intel.com/sites/landingpage/IntrinsicsGuide/. Online, Accessed: 2021-04-22
Han, S., Mao, H., & Dally, W. J. (2015). Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149.

Download references

Acknowledgements

This work was supported in part by Samsung Advanced Institute of Technology, Samsung Electronics Co., Ltd. And this work was supported by Institute of Information & communications Technology Planning & Evaluation(IITP) grant funded by the Korea government(MSIT)(2020-0-01297, Development of Ultra-Low Power Deep Learning Processor Technology using Advanced Data Reuse for Edge Applications). This work was also supported by the NRF grant funded by the Korea government(MSIT) (2020R1A2C2102198).

Author information

Authors and Affiliations

Department of Electrical and Computer Engineering, Seoul National University, Seoul, South Korea
Seokhyeon Choi, Kyuhong Shim, Wonyong Sung & Byonghyo Shim
Department of Electrical Engineering, Hanyang University, Hanyang, South Korea
Jungwook Choi

Authors

Seokhyeon Choi
View author publications
You can also search for this author in PubMed Google Scholar
Kyuhong Shim
View author publications
You can also search for this author in PubMed Google Scholar
Jungwook Choi
View author publications
You can also search for this author in PubMed Google Scholar
Wonyong Sung
View author publications
You can also search for this author in PubMed Google Scholar
Byonghyo Shim
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

All authors contributed to the study’s conception and design. Main code design, experiment, and analysis were performed by Seokhyeon Choi and Kyuhong Shim. The first draft of the manuscript was written by Seokhyeon Choi and all authors commented on previous versions of the manuscript. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Seokhyeon Choi.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Choi, S., Shim, K., Choi, J. et al. Optimization of General Matrix Multiply Library for Ternary Weight for Fast DNN Inference. J Sign Process Syst 94, 929–943 (2022). https://doi.org/10.1007/s11265-022-01782-3

Download citation

Received: 09 January 2022
Revised: 04 April 2022
Accepted: 06 June 2022
Published: 20 July 2022
Issue Date: October 2022
DOI: https://doi.org/10.1007/s11265-022-01782-3

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Optimization of General Matrix Multiply Library for Ternary Weight for Fast DNN Inference

Abstract

Access this article

Similar content being viewed by others

Review of deep learning: concepts, CNN architectures, challenges, applications, future directions

A survey of the recent architectures of deep convolutional neural networks

A comprehensive review of Binary Neural Network

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Optimization of General Matrix Multiply Library for Ternary Weight for Fast DNN Inference

Abstract

Access this article

Similar content being viewed by others

Review of deep learning: concepts, CNN architectures, challenges, applications, future directions

A survey of the recent architectures of deep convolutional neural networks

A comprehensive review of Binary Neural Network

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation