Skip to main content
Log in

Optimization of General Matrix Multiply Library for Ternary Weight for Fast DNN Inference

  • Published:
Journal of Signal Processing Systems Aims and scope Submit manuscript

Abstract

Efficient implementation of deep neural networks (DNNs) on CPU-based systems is critical owing to the proliferation of applications in embedded and Internet of Things systems. Nowdays, most CPUs are equipped with single instruction multiple data (SIMD) instructions, which are used to implement an efficient general matrix multiply (GEMM) library for accelerating DNN inference. Quantized neural networks are actively investigated to simplify DNN computation and memory requirements; however, the current CPU libraries do not efficiently support arithmetic operations below eight bits. Hence, we developed TernGEMM, a GEMM library composed of SIMD instructions for DNNs with ternary weights and sub-8-bit activations. TernGEMM is implemented using simple logical operations that replace the long-latency multiply–add operation. Instead of fixing the accumulation bit precision as 32-bit, TernGEMM accumulates the partial sums in a bit-incremental manner to exploit parallelism in 8-bit and 16-bit SIMD instructions. Furthermore, we propose different tile sizes for TernGEMM to better support the diverse dimensions of DNNs. Compared with a state-of–the-art reduced precision DNN GEMM library, i.e., GEMMLowp, TernGEMM achieve \(\times\)1.785 to \(\times\)4.147 speedup for ResNet50, MobileNet-V2, and EfficientNet-B0, as evaluated on both Intel and ARM CPUs.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
Figure 6
Figure 7
Figure 8
Figure 9

Similar content being viewed by others

References

  1. Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., & Chen, L.-C. (2018). Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 4510–4520)

  2. Tan, M., Chen, B., Pang, R., Vasudevan, V., Sandler, M., Howard, A., & Le, Q. V. (2019). Mnasnet: Platform-aware neural architecture search for mobile. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 2820–2828)

  3. Tan, M., & Le, Q. (2019). Efficientnet: Rethinking model scaling for convolutional neural networks. In International Conference on Machine Learning (pp. 6105–6114). PMLR

  4. Reed, R. (1993). Pruning algorithms-a survey. IEEE transactions on Neural Networks, 4, 740–747

    Article  Google Scholar 

  5. Hubara, I., Courbariaux, M., Soudry, D., El-Yaniv, R., & Bengio, Y. (2017). Quantized neural networks: Training neural networks with low precision weights and activations. The Journal of Machine Learning Research, 18, 6869–6898

    MathSciNet  MATH  Google Scholar 

  6. Sung, W., Shin, S., & Hwang, K. (2015). Resiliency of deep neural networks under quantization. arXiv preprint arXiv:1511.06488

  7. Choi, J., Wang, Z., Venkataramani, S., Chuang, P. I.-J., Srinivasan, V., & Gopalakrishnan, K. (2018). Pact: Parameterized clipping activation for quantized neural networks. arXiv preprint arXiv:1805.06085

  8. Esser, S. K., McKinstry, J. L., Bablani, D., Appuswamy, R., & Modha, D. S. (2019). Learned step size quantization. arXiv preprint arXiv:1902.08153

  9. Jacob, B., & Warden, P. (2017). gemmlowp: A small self-contained low-precision gemm library

  10. Han, Q., Hu, Y., Yu, F., Yang, H., Liu, B., Hu, P., Gong, R., Wang, Y., Wang, R., Luan, Z., et al. (2020). Extremely low-bit convolution optimization for quantized neural network on modern computer architectures. In 49th International Conference on Parallel Processing-ICPP (pp. 1–12)

  11. He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770–778)

  12. Hwang, K., & Sung, W. (2014). Fixed-point feedforward deep neural network design using weights+ 1, 0, and- 1. In 2014 IEEE Workshop on Signal Processing Systems (SiPS) (pp. 1–6). IEEE

  13. Shin, S., Boo, Y., & Sung, W. (2017). Fixed-point optimization of deep neural networks with adaptive step size retraining. In 2017 IEEE International conference on acoustics, speech and signal processing (ICASSP) (pp. 1203–1207). IEEE

  14. Li, F., Zhang, B., & Liu, B. (2016). Ternary weight networks. arXiv preprint arXiv:1605.04711

  15. Zhu, C., Han, S., Mao, H., & Dally, W. J. (2016). Trained ternary quantization. arXiv preprint arXiv:1612.01064

  16. Mellempudi, N., Kundu, A., Mudigere, D., Das, D., Kaul, B., & Dubey, P. (2017). Ternary neural networks with fine-grained quantization. arXiv preprint arXiv:1705.01462

  17. Mishra, A., Nurvitadhi, E., Cook, J. J., & Marr, D. (2017). WRPN: Wide reduced-precision networks. arXiv preprint arXiv:1709.01134

  18. Shin, S., Park, J., Boo, Y., & Sung, W. (2020). Hlhlp: Quantized neural networks training for reaching flat minima in loss surface. In Proceedings of the AAAI Conference on Artificial Intelligence (pp. 5784–5791). volume 34

  19. Boo, Y., Shin, S., Choi, J., & Sung, W. (2021). Stochastic precision ensemble: self-knowledge distillation for quantized deep neural networks. In Proceedings of the AAAI Conference on Artificial Intelligence (pp. 6794–6802). volume 35

  20. Bhalgat, Y., Lee, J., Nagel, M., Blankevoort, T., & Kwak, N. (2020). Lsq+: Improving low-bit quantization through learnable offsets and better initialization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (pp. 696–697)

  21. Chellapilla, K., Puri, S., & Simard, P. (2006). High performance convolutional neural networks for document processing. In Tenth International Workshop on Frontiers in Handwriting Recognition. Suvisoft

  22. Dukhan, M., Wu, Y., Lu, H., & Maher, B. (2018). Qnnpack

  23. Lee, J., Chirkov, N., Ignasheva, E., Pisarchyk, Y., Shieh, M., Riccardi, F., Sarokin, R., Kulik, A., & Grundmann, M. (2019). On-device neural net inference with mobile gpus. arXiv preprint arXiv:1907.01989

  24. Umuroglu, Y., & Jahre, M. (2017). Streamlined deployment for quantized neural networks. arXiv preprint arXiv:1709.04060

  25. Rastegari, M., Ordonez, V., Redmon, J., & Farhadi, A. (2016). Xnor-net: Imagenet classification using binary convolutional neural networks. In European conference on computer vision (pp. 525–542). Springer

  26. Courbariaux, M., Hubara, I., Soudry, D., El-Yaniv, R., & Bengio, Y. (2016). Binarized neural networks: Training deep neural networks with weights and activations constrained to+ 1 or-1. arXiv preprint arXiv:1602.02830

  27. Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., & Fei-Fei, L. (2009). Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition (pp. 248–255). IEEE

  28. Jeon, Y., Park, B., Kwon, S. J., Kim, B., Yun, J., & Lee, D. (2020). BiQGEMM: matrix multiplication with lookup table for binary-coding-based quantized dnns. arXiv preprint arXiv:2005.09904

  29. Bao, W., Chang, L.-W., Chen, Y., Deng, K., Agarwal, A., Barsoum, E., & Taha, A. (2019). NGEMM: Optimizing gemm for deep learning via compiler-based techniques. arXiv preprint arXiv:1910.00178

  30. Blackford, L. S., Petitet, A., Pozo, R., Remington, K., Whaley, R. C., Demmel, J., et al. (2002). An updated set of basic linear algebra subprograms (blas). ACM Transactions on Mathematical Software, 28, 135–151

    Article  MathSciNet  Google Scholar 

  31. Jia, Y. (2014). Learning semantic image representations at a large scale. Ph.D. thesis UC Berkeley

  32. Intel. Intel intrinsics guide. https://software.intel.com/sites/landingpage/IntrinsicsGuide/. Online, Accessed: 2021-04-22

  33. Han, S., Mao, H., & Dally, W. J. (2015). Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149.

Download references

Acknowledgements

This work was supported in part by Samsung Advanced Institute of Technology, Samsung Electronics Co., Ltd. And this work was supported by Institute of Information & communications Technology Planning & Evaluation(IITP) grant funded by the Korea government(MSIT)(2020-0-01297, Development of Ultra-Low Power Deep Learning Processor Technology using Advanced Data Reuse for Edge Applications). This work was also supported by the NRF grant funded by the Korea government(MSIT) (2020R1A2C2102198).

Author information

Authors and Affiliations

Authors

Contributions

All authors contributed to the study’s conception and design. Main code design, experiment, and analysis were performed by Seokhyeon Choi and Kyuhong Shim. The first draft of the manuscript was written by Seokhyeon Choi and all authors commented on previous versions of the manuscript. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Seokhyeon Choi.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Choi, S., Shim, K., Choi, J. et al. Optimization of General Matrix Multiply Library for Ternary Weight for Fast DNN Inference. J Sign Process Syst 94, 929–943 (2022). https://doi.org/10.1007/s11265-022-01782-3

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11265-022-01782-3

Keywords

Navigation