Abstract
Efficient implementation of deep neural networks (DNNs) on CPU-based systems is critical owing to the proliferation of applications in embedded and Internet of Things systems. Nowdays, most CPUs are equipped with single instruction multiple data (SIMD) instructions, which are used to implement an efficient general matrix multiply (GEMM) library for accelerating DNN inference. Quantized neural networks are actively investigated to simplify DNN computation and memory requirements; however, the current CPU libraries do not efficiently support arithmetic operations below eight bits. Hence, we developed TernGEMM, a GEMM library composed of SIMD instructions for DNNs with ternary weights and sub-8-bit activations. TernGEMM is implemented using simple logical operations that replace the long-latency multiply–add operation. Instead of fixing the accumulation bit precision as 32-bit, TernGEMM accumulates the partial sums in a bit-incremental manner to exploit parallelism in 8-bit and 16-bit SIMD instructions. Furthermore, we propose different tile sizes for TernGEMM to better support the diverse dimensions of DNNs. Compared with a state-of–the-art reduced precision DNN GEMM library, i.e., GEMMLowp, TernGEMM achieve \(\times\)1.785 to \(\times\)4.147 speedup for ResNet50, MobileNet-V2, and EfficientNet-B0, as evaluated on both Intel and ARM CPUs.
Similar content being viewed by others
References
Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., & Chen, L.-C. (2018). Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 4510–4520)
Tan, M., Chen, B., Pang, R., Vasudevan, V., Sandler, M., Howard, A., & Le, Q. V. (2019). Mnasnet: Platform-aware neural architecture search for mobile. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 2820–2828)
Tan, M., & Le, Q. (2019). Efficientnet: Rethinking model scaling for convolutional neural networks. In International Conference on Machine Learning (pp. 6105–6114). PMLR
Reed, R. (1993). Pruning algorithms-a survey. IEEE transactions on Neural Networks, 4, 740–747
Hubara, I., Courbariaux, M., Soudry, D., El-Yaniv, R., & Bengio, Y. (2017). Quantized neural networks: Training neural networks with low precision weights and activations. The Journal of Machine Learning Research, 18, 6869–6898
Sung, W., Shin, S., & Hwang, K. (2015). Resiliency of deep neural networks under quantization. arXiv preprint arXiv:1511.06488
Choi, J., Wang, Z., Venkataramani, S., Chuang, P. I.-J., Srinivasan, V., & Gopalakrishnan, K. (2018). Pact: Parameterized clipping activation for quantized neural networks. arXiv preprint arXiv:1805.06085
Esser, S. K., McKinstry, J. L., Bablani, D., Appuswamy, R., & Modha, D. S. (2019). Learned step size quantization. arXiv preprint arXiv:1902.08153
Jacob, B., & Warden, P. (2017). gemmlowp: A small self-contained low-precision gemm library
Han, Q., Hu, Y., Yu, F., Yang, H., Liu, B., Hu, P., Gong, R., Wang, Y., Wang, R., Luan, Z., et al. (2020). Extremely low-bit convolution optimization for quantized neural network on modern computer architectures. In 49th International Conference on Parallel Processing-ICPP (pp. 1–12)
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770–778)
Hwang, K., & Sung, W. (2014). Fixed-point feedforward deep neural network design using weights+ 1, 0, and- 1. In 2014 IEEE Workshop on Signal Processing Systems (SiPS) (pp. 1–6). IEEE
Shin, S., Boo, Y., & Sung, W. (2017). Fixed-point optimization of deep neural networks with adaptive step size retraining. In 2017 IEEE International conference on acoustics, speech and signal processing (ICASSP) (pp. 1203–1207). IEEE
Li, F., Zhang, B., & Liu, B. (2016). Ternary weight networks. arXiv preprint arXiv:1605.04711
Zhu, C., Han, S., Mao, H., & Dally, W. J. (2016). Trained ternary quantization. arXiv preprint arXiv:1612.01064
Mellempudi, N., Kundu, A., Mudigere, D., Das, D., Kaul, B., & Dubey, P. (2017). Ternary neural networks with fine-grained quantization. arXiv preprint arXiv:1705.01462
Mishra, A., Nurvitadhi, E., Cook, J. J., & Marr, D. (2017). WRPN: Wide reduced-precision networks. arXiv preprint arXiv:1709.01134
Shin, S., Park, J., Boo, Y., & Sung, W. (2020). Hlhlp: Quantized neural networks training for reaching flat minima in loss surface. In Proceedings of the AAAI Conference on Artificial Intelligence (pp. 5784–5791). volume 34
Boo, Y., Shin, S., Choi, J., & Sung, W. (2021). Stochastic precision ensemble: self-knowledge distillation for quantized deep neural networks. In Proceedings of the AAAI Conference on Artificial Intelligence (pp. 6794–6802). volume 35
Bhalgat, Y., Lee, J., Nagel, M., Blankevoort, T., & Kwak, N. (2020). Lsq+: Improving low-bit quantization through learnable offsets and better initialization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (pp. 696–697)
Chellapilla, K., Puri, S., & Simard, P. (2006). High performance convolutional neural networks for document processing. In Tenth International Workshop on Frontiers in Handwriting Recognition. Suvisoft
Dukhan, M., Wu, Y., Lu, H., & Maher, B. (2018). Qnnpack
Lee, J., Chirkov, N., Ignasheva, E., Pisarchyk, Y., Shieh, M., Riccardi, F., Sarokin, R., Kulik, A., & Grundmann, M. (2019). On-device neural net inference with mobile gpus. arXiv preprint arXiv:1907.01989
Umuroglu, Y., & Jahre, M. (2017). Streamlined deployment for quantized neural networks. arXiv preprint arXiv:1709.04060
Rastegari, M., Ordonez, V., Redmon, J., & Farhadi, A. (2016). Xnor-net: Imagenet classification using binary convolutional neural networks. In European conference on computer vision (pp. 525–542). Springer
Courbariaux, M., Hubara, I., Soudry, D., El-Yaniv, R., & Bengio, Y. (2016). Binarized neural networks: Training deep neural networks with weights and activations constrained to+ 1 or-1. arXiv preprint arXiv:1602.02830
Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., & Fei-Fei, L. (2009). Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition (pp. 248–255). IEEE
Jeon, Y., Park, B., Kwon, S. J., Kim, B., Yun, J., & Lee, D. (2020). BiQGEMM: matrix multiplication with lookup table for binary-coding-based quantized dnns. arXiv preprint arXiv:2005.09904
Bao, W., Chang, L.-W., Chen, Y., Deng, K., Agarwal, A., Barsoum, E., & Taha, A. (2019). NGEMM: Optimizing gemm for deep learning via compiler-based techniques. arXiv preprint arXiv:1910.00178
Blackford, L. S., Petitet, A., Pozo, R., Remington, K., Whaley, R. C., Demmel, J., et al. (2002). An updated set of basic linear algebra subprograms (blas). ACM Transactions on Mathematical Software, 28, 135–151
Jia, Y. (2014). Learning semantic image representations at a large scale. Ph.D. thesis UC Berkeley
Intel. Intel intrinsics guide. https://software.intel.com/sites/landingpage/IntrinsicsGuide/. Online, Accessed: 2021-04-22
Han, S., Mao, H., & Dally, W. J. (2015). Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149.
Acknowledgements
This work was supported in part by Samsung Advanced Institute of Technology, Samsung Electronics Co., Ltd. And this work was supported by Institute of Information & communications Technology Planning & Evaluation(IITP) grant funded by the Korea government(MSIT)(2020-0-01297, Development of Ultra-Low Power Deep Learning Processor Technology using Advanced Data Reuse for Edge Applications). This work was also supported by the NRF grant funded by the Korea government(MSIT) (2020R1A2C2102198).
Author information
Authors and Affiliations
Contributions
All authors contributed to the study’s conception and design. Main code design, experiment, and analysis were performed by Seokhyeon Choi and Kyuhong Shim. The first draft of the manuscript was written by Seokhyeon Choi and all authors commented on previous versions of the manuscript. All authors read and approved the final manuscript.
Corresponding author
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Choi, S., Shim, K., Choi, J. et al. Optimization of General Matrix Multiply Library for Ternary Weight for Fast DNN Inference. J Sign Process Syst 94, 929–943 (2022). https://doi.org/10.1007/s11265-022-01782-3
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11265-022-01782-3