skip to main content
research-article
Open Access

Runtime Design Space Exploration and Mapping of DCNNs for the Ultra-Low-Power Orlando SoC

Published:29 May 2020Publication History
Skip Abstract Section

Abstract

Recent trends in deep convolutional neural networks (DCNNs) impose hardware accelerators as a viable solution for computer vision and speech recognition. The Orlando SoC architecture from STMicroelectronics targets exactly this class of problems by integrating hardware-accelerated convolutional blocks together with DSPs and on-chip memory resources to enable energy-efficient designs of DCNNs. The main advantage of the Orlando platform is to have runtime configurable convolutional accelerators that can adapt to different DCNN workloads. This opens new challenges for mapping the computation to the accelerators and for managing the on-chip resources efficiently. In this work, we propose a runtime design space exploration and mapping methodology for runtime resource management in terms of on-chip memory, convolutional accelerators, and external bandwidth. Experimental results are reported in terms of power/performance scalability, Pareto analysis, mapping adaptivity, and accelerator utilization for the Orlando architecture mapping the VGG-16, Tiny-Yolo(v2), and MobileNet topologies.

References

  1. T. Chen, Z. Du, N. Sun, J. Wang, C. Wu, Y. Chen, and O. Temam. 2015. A high-throughput neural network accelerator. IEEE Micro 35, 3 (May 2015), 24--32. DOI:https://doi.org/10.1109/MM.2015.41Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Tianqi Chen, Thierry Moreau, Ziheng Jiang, Haichen Shen, Eddie Q. Yan, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. 2018. TVM: End-to-end optimization stack for deep learning. arxiv:1802.04799.Google ScholarGoogle Scholar
  3. Yunji Chen, Tao Luo, Shaoli Liu, Shijin Zhang, Liqiang He, Jia Wang, Ling Li, et al. 2014. DaDianNao: A machine-learning supercomputer. In Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’14). IEEE, Los Alamitos, CA, 609--622. DOI:https://doi.org/10.1109/MICRO.2014.58Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Y. H. Chen, T. Krishna, J. S. Emer, and V. Sze. 2017. Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks. IEEE Journal of Solid-State Circuits 52, 1 (Jan. 2017), 127--138. DOI:https://doi.org/10.1109/JSSC.2016.2616357Google ScholarGoogle ScholarCross RefCross Ref
  5. Giuseppe Desoli, Thomas Boesch, Surinder Pal-Singh, and Nitin Chawla. 2018. A new scalable architecture to accelerate deep convolutional neural networks for low power IoT applications. In Proceedings of Embedded World 2018.Google ScholarGoogle Scholar
  6. Giuseppe Desoli, Nitin Chawla, Thomas Boesch, Surinder-Pal Singh, Elio Guidetti, Fabio De Ambroggi, Tommaso Majo, et al. 2017. 14.1 A 2.9TOPS/W deep convolutional neural network SoC in FD-SOI 28nm for intelligent embedded systems. In Proceedings of the 2017 IEEE International Solid-State Circuits Conference (ISSCC’17). IEEE, Los Alamitos, CA, 238--239. DOI:https://doi.org/10.1109/ISSCC.2017.7870349Google ScholarGoogle ScholarCross RefCross Ref
  7. Ahmet Erdem, Cristina Silvano, Thomas Boesch, Andrea C. Ornstein, Surinder-Pal Singh, and Giuseppe Desoli. 2018. Design space exploration for orlando ultra low-power convolutional neural network SoC. In Proceedings of the 29th IEEE International Conference on Application-Specific Systems, Architectures, and Processors (ASAP’18). IEEE, Los Alamitos, CA, 1--7. DOI:https://doi.org/10.1109/ASAP.2018.8445096Google ScholarGoogle ScholarCross RefCross Ref
  8. Mingyu Gao, Jing Pu, Xuan Yang, Mark Horowitz, and Christos Kozyrakis. 2017. TETRIS: Scalable and efficient neural network acceleration with 3D memory. In Proceedings of the 22nd International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’17). ACM, New York, NY, 751--764. DOI:https://doi.org/10.1145/3037697.3037702Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Mingyu Gao, Xuan Yang, Jing Pu, Mark Horowitz, and Christos Kozyrakis. 2019. TANGRAM: Optimized coarse-grained dataflow for scalable NN accelerators. In Proceedings of the 24th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’19). ACM, New York, NY, 807--820. DOI:https://doi.org/10.1145/3297858.3304014Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Yijin Guan, Hao Liang, Ningyi Xu, Wenqiang Wang, Shaoshuai Shi, Xi Chen, Guangyu Sun, Wei Zhang, and Jason Cong. 2017. FP-DNN: An automated framework for mapping deep neural networks onto FPGAs with RTL-HLS hybrid templates. In Proceedings of the 25th IEEE Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM’17). IEEE, Los Alamitos, CA, 152--159. DOI:https://doi.org/10.1109/FCCM.2017.25Google ScholarGoogle ScholarCross RefCross Ref
  11. Song Han, Huizi Mao, and William J. Dally. 2015. Deep compression: Compressing deep neural network with pruning, trained quantization and Huffman coding. arxiv:1510.00149.Google ScholarGoogle Scholar
  12. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Deep residual learning for image recognition. arxiv:1512.03385.Google ScholarGoogle Scholar
  13. Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. 2017. MobileNets: Efficient convolutional neural networks for mobile vision applications. arxiv:1704.04861.Google ScholarGoogle Scholar
  14. Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems 25 (NIPS’12). 1106--1114. http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Hyoukjun Kwon, Ananda Samajdar, and Tushar Krishna. 2018. MAERI: Enabling flexible dataflow mapping over DNN accelerators via reconfigurable interconnects. In Proceedings of the 23rd International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’18). ACM, New York, NY, 461--475. DOI:https://doi.org/10.1145/3173162.3173176Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Chris Leary and Todd Wang. 2017. XLA: TensorFlow, Compiled. Retrieved July 17, 2019 from https://www.tensorflow.org/xla. -->Google ScholarGoogle Scholar
  17. Jiajun Li, Guihai Yan, Wenyan Lu, Shuhao Jiang, Shijun Gong, Jingya Wu, and Xiaowei Li. 2018. SmartShuttle: Optimizing off-chip memory accesses for deep learning accelerators. In Proceedings of the 2018 Design, Automation, and Test in Europe Conference and Exhibition (DATE’18). IEEE, Los Alamitos, CA, 343--348. DOI:https://doi.org/10.23919/DATE.2018.8342033Google ScholarGoogle ScholarCross RefCross Ref
  18. Wenyan Lu, Guihai Yan, Jiajun Li, Shijun Gong, Yinhe Han, and Xiaowei Li. 2017. FlexFlow: A flexible dataflow accelerator architecture for convolutional neural networks. In Proceedings of the 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA’17). IEEE, Los Alamitos, CA, 553--564. DOI:https://doi.org/10.1109/HPCA.2017.29Google ScholarGoogle ScholarCross RefCross Ref
  19. Angshuman Parashar, Priyanka Raina, Yakun Sophia Shao, Yu-Hsin Chen, Victor A. Ying, Anurag Mukkara, Rangharajan Venkatesan, Brucek Khailany, Stephen W. Keckler, and Joel S. Emer. 2019. Timeloop: A systematic approach to DNN accelerator evaluation. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS’19). IEEE, Los Alamitos, CA, 304--315. DOI:https://doi.org/10.1109/ISPASS.2019.00042Google ScholarGoogle Scholar
  20. Joseph Redmon and Ali Farhadi. 2016. YOLO9000: Better, faster, stronger. arxiv:1612.08242.Google ScholarGoogle Scholar
  21. Nadav Rotem, Jordan Fix, Saleem Abdulrasool, Summer Deng, Roman Dzhabarov, James Hegeman, Roman Levenstein, et al. 2018. Glow: Graph lowering compiler techniques for neural networks. arxiv:1805.00907.Google ScholarGoogle Scholar
  22. Ananda Samajdar, Yuhao Zhu, Paul N. Whatmough, Matthew Mattina, and Tushar Krishna. 2018. SCALE-Sim: Systolic CNN accelerator. arxiv:1811.02883.Google ScholarGoogle Scholar
  23. Hardik Sharma, Jongse Park, Divya Mahajan, Emmanuel Amaro, Joon Kyung Kim, Chenkai Shao, Asit Mishra, and Hadi Esmaeilzadeh. 2016. From high-level deep neural models to FPGAs. In Proceedings of the 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’16). IEEE, Los Alamitos, CA, 1--12.Google ScholarGoogle ScholarCross RefCross Ref
  24. C. Silvano, W. Fornaciari, G. Palermo, V. Zaccaria, F. Castro, M. Martinez, S. Bocchio, et al. 2011. MULTICUBE: Multi-objective design space exploration of multi-core architectures. In VLSI 2010 Annual Symposium, N. Voros, A. Mukherjee, N. Sklavos, K. Masselos, and M. Huebner (Eds.). Springer Netherlands, Dordrecht, 47--63.Google ScholarGoogle ScholarCross RefCross Ref
  25. J. Sim, J. S. Park, M. Kim, D. Bae, Y. Choi, and L. S. Kim. 2016. 14.6 A 1.42TOPS/W deep convolutional neural network recognition processor for intelligent IoE systems. In Proceedings of the 2016 IEEE International Solid-State Circuits Conference (ISSCC’16). IEEE, Los Alamitos, CA, 264--265. DOI:https://doi.org/10.1109/ISSCC.2016.7418008Google ScholarGoogle ScholarCross RefCross Ref
  26. Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arxiv:1409.1556.Google ScholarGoogle Scholar
  27. Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott E. Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2014. Going deeper with convolutions. arxiv:1409.4842.Google ScholarGoogle Scholar
  28. S. I. Venieris and C. Bouganis. 2016. fpgaConvNet: A framework for mapping convolutional neural networks on FPGAs. In Proceedings of the 2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM’16). IEEE, Los Alamitos, CA, 40--47. DOI:https://doi.org/10.1109/FCCM.2016.22Google ScholarGoogle ScholarCross RefCross Ref
  29. Xuechao Wei, Cody Hao Yu, Peng Zhang, Youxiang Chen, Yuxin Wang, Han Hu, Yun Liang, and J. Cong. 2017. Automated systolic array architecture synthesis for high throughput CNN inference on FPGAs. In Proceedings of the 2017 54th ACM/EDAC/IEEE Design Automation Conference (DAC’17). ACM, New York, NY, 1--6. DOI:https://doi.org/10.1145/3061639.3062207Google ScholarGoogle Scholar
  30. XLA Team et al. 2019. XLA: Domain-specific compiler for linear algebra that optimizes TensorFlow computations.Google ScholarGoogle Scholar
  31. Chen Zhang, Peng Li, Guangyu Sun, Yijin Guan, Bingjun Xiao, and Jason Cong. 2015. Optimizing FPGA-based accelerator design for deep convolutional neural networks. In Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA’15). ACM, New York, NY, 161--170. DOI:https://doi.org/10.1145/2684746.2689060Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Zhongyuan Zhao, Hyoukjun Kwon, Sachit Kuhar, Weiguang Sheng, Zhigang Mao, and Tushar Krishna. 2019. mRNA: Enabling efficient mapping space exploration for a reconfiguration neural accelerator. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS’19). IEEE, Los Alamitos, CA, 282--292. DOI:https://doi.org/10.1109/ISPASS.2019.00040Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Runtime Design Space Exploration and Mapping of DCNNs for the Ultra-Low-Power Orlando SoC

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in

          Full Access

          • Published in

            cover image ACM Transactions on Architecture and Code Optimization
            ACM Transactions on Architecture and Code Optimization  Volume 17, Issue 2
            June 2020
            169 pages
            ISSN:1544-3566
            EISSN:1544-3973
            DOI:10.1145/3403597
            Issue’s Table of Contents

            Copyright © 2020 ACM

            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 29 May 2020
            • Online AM: 7 May 2020
            • Accepted: 1 January 2020
            • Revised: 1 November 2019
            • Received: 1 March 2019
            Published in taco Volume 17, Issue 2

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • research-article
            • Research
            • Refereed

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader

          HTML Format

          View this article in HTML Format .

          View HTML Format