Runtime Design Space Exploration and Mapping of DCNNs for the Ultra-Low-Power Orlando SoC

Authors:
Ahmet Erdem

Politecnico di Milano, Milan, Italy

Politecnico di Milano, Milan, Italy

0000-0001-5140-6463
View Profile

,
Cristina Silvano

Politecnico di Milano, Milan, Italy

Politecnico di Milano, Milan, Italy

0000-0003-1668-0883
View Profile

,
Thomas Boesch

STMicroelectronics, Geneva, Switzerland

STMicroelectronics, Geneva, Switzerland
View Profile

,
Andrea Carlo Ornstein

STMicroelectronics, Geneva, Switzerland

STMicroelectronics, Geneva, Switzerland
View Profile

,
Surinder-Pal Singh

STMicroelectronics, Geneva, Switzerland

STMicroelectronics, Geneva, Switzerland
View Profile

,
Giuseppe Desoli

STMicroelectronics, Geneva, Switzerland

STMicroelectronics, Geneva, Switzerland
View Profile

ACM Transactions on Architecture and Code Optimization Volume 17 Issue 2Article No.: 11pp 1–25https://doi.org/10.1145/3379933

Published:29 May 2020Publication History

ACM Transactions on Architecture and Code Optimization

Abstract

Recent trends in deep convolutional neural networks (DCNNs) impose hardware accelerators as a viable solution for computer vision and speech recognition. The Orlando SoC architecture from STMicroelectronics targets exactly this class of problems by integrating hardware-accelerated convolutional blocks together with DSPs and on-chip memory resources to enable energy-efficient designs of DCNNs. The main advantage of the Orlando platform is to have runtime configurable convolutional accelerators that can adapt to different DCNN workloads. This opens new challenges for mapping the computation to the accelerators and for managing the on-chip resources efficiently. In this work, we propose a runtime design space exploration and mapping methodology for runtime resource management in terms of on-chip memory, convolutional accelerators, and external bandwidth. Experimental results are reported in terms of power/performance scalability, Pareto analysis, mapping adaptivity, and accelerator utilization for the Orlando architecture mapping the VGG-16, Tiny-Yolo(v2), and MobileNet topologies.

References

T. Chen, Z. Du, N. Sun, J. Wang, C. Wu, Y. Chen, and O. Temam. 2015. A high-throughput neural network accelerator. IEEE Micro 35, 3 (May 2015), 24--32. DOI:https://doi.org/10.1109/MM.2015.41Google ScholarDigital Library
Tianqi Chen, Thierry Moreau, Ziheng Jiang, Haichen Shen, Eddie Q. Yan, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. 2018. TVM: End-to-end optimization stack for deep learning. arxiv:1802.04799.Google Scholar
Yunji Chen, Tao Luo, Shaoli Liu, Shijin Zhang, Liqiang He, Jia Wang, Ling Li, et al. 2014. DaDianNao: A machine-learning supercomputer. In Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’14). IEEE, Los Alamitos, CA, 609--622. DOI:https://doi.org/10.1109/MICRO.2014.58Google ScholarDigital Library
Y. H. Chen, T. Krishna, J. S. Emer, and V. Sze. 2017. Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks. IEEE Journal of Solid-State Circuits 52, 1 (Jan. 2017), 127--138. DOI:https://doi.org/10.1109/JSSC.2016.2616357Google ScholarCross Ref
Giuseppe Desoli, Thomas Boesch, Surinder Pal-Singh, and Nitin Chawla. 2018. A new scalable architecture to accelerate deep convolutional neural networks for low power IoT applications. In Proceedings of Embedded World 2018.Google Scholar
Giuseppe Desoli, Nitin Chawla, Thomas Boesch, Surinder-Pal Singh, Elio Guidetti, Fabio De Ambroggi, Tommaso Majo, et al. 2017. 14.1 A 2.9TOPS/W deep convolutional neural network SoC in FD-SOI 28nm for intelligent embedded systems. In Proceedings of the 2017 IEEE International Solid-State Circuits Conference (ISSCC’17). IEEE, Los Alamitos, CA, 238--239. DOI:https://doi.org/10.1109/ISSCC.2017.7870349Google ScholarCross Ref
Ahmet Erdem, Cristina Silvano, Thomas Boesch, Andrea C. Ornstein, Surinder-Pal Singh, and Giuseppe Desoli. 2018. Design space exploration for orlando ultra low-power convolutional neural network SoC. In Proceedings of the 29th IEEE International Conference on Application-Specific Systems, Architectures, and Processors (ASAP’18). IEEE, Los Alamitos, CA, 1--7. DOI:https://doi.org/10.1109/ASAP.2018.8445096Google ScholarCross Ref
Mingyu Gao, Jing Pu, Xuan Yang, Mark Horowitz, and Christos Kozyrakis. 2017. TETRIS: Scalable and efficient neural network acceleration with 3D memory. In Proceedings of the 22nd International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’17). ACM, New York, NY, 751--764. DOI:https://doi.org/10.1145/3037697.3037702Google ScholarDigital Library
Mingyu Gao, Xuan Yang, Jing Pu, Mark Horowitz, and Christos Kozyrakis. 2019. TANGRAM: Optimized coarse-grained dataflow for scalable NN accelerators. In Proceedings of the 24th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’19). ACM, New York, NY, 807--820. DOI:https://doi.org/10.1145/3297858.3304014Google ScholarDigital Library
Yijin Guan, Hao Liang, Ningyi Xu, Wenqiang Wang, Shaoshuai Shi, Xi Chen, Guangyu Sun, Wei Zhang, and Jason Cong. 2017. FP-DNN: An automated framework for mapping deep neural networks onto FPGAs with RTL-HLS hybrid templates. In Proceedings of the 25th IEEE Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM’17). IEEE, Los Alamitos, CA, 152--159. DOI:https://doi.org/10.1109/FCCM.2017.25Google ScholarCross Ref
Song Han, Huizi Mao, and William J. Dally. 2015. Deep compression: Compressing deep neural network with pruning, trained quantization and Huffman coding. arxiv:1510.00149.Google Scholar
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Deep residual learning for image recognition. arxiv:1512.03385.Google Scholar
Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. 2017. MobileNets: Efficient convolutional neural networks for mobile vision applications. arxiv:1704.04861.Google Scholar
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems 25 (NIPS’12). 1106--1114. http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.Google ScholarDigital Library
Hyoukjun Kwon, Ananda Samajdar, and Tushar Krishna. 2018. MAERI: Enabling flexible dataflow mapping over DNN accelerators via reconfigurable interconnects. In Proceedings of the 23rd International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’18). ACM, New York, NY, 461--475. DOI:https://doi.org/10.1145/3173162.3173176Google ScholarDigital Library
Chris Leary and Todd Wang. 2017. XLA: TensorFlow, Compiled. Retrieved July 17, 2019 from https://www.tensorflow.org/xla. -->Google Scholar
Jiajun Li, Guihai Yan, Wenyan Lu, Shuhao Jiang, Shijun Gong, Jingya Wu, and Xiaowei Li. 2018. SmartShuttle: Optimizing off-chip memory accesses for deep learning accelerators. In Proceedings of the 2018 Design, Automation, and Test in Europe Conference and Exhibition (DATE’18). IEEE, Los Alamitos, CA, 343--348. DOI:https://doi.org/10.23919/DATE.2018.8342033Google ScholarCross Ref
Wenyan Lu, Guihai Yan, Jiajun Li, Shijun Gong, Yinhe Han, and Xiaowei Li. 2017. FlexFlow: A flexible dataflow accelerator architecture for convolutional neural networks. In Proceedings of the 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA’17). IEEE, Los Alamitos, CA, 553--564. DOI:https://doi.org/10.1109/HPCA.2017.29Google ScholarCross Ref
Angshuman Parashar, Priyanka Raina, Yakun Sophia Shao, Yu-Hsin Chen, Victor A. Ying, Anurag Mukkara, Rangharajan Venkatesan, Brucek Khailany, Stephen W. Keckler, and Joel S. Emer. 2019. Timeloop: A systematic approach to DNN accelerator evaluation. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS’19). IEEE, Los Alamitos, CA, 304--315. DOI:https://doi.org/10.1109/ISPASS.2019.00042Google Scholar
Joseph Redmon and Ali Farhadi. 2016. YOLO9000: Better, faster, stronger. arxiv:1612.08242.Google Scholar
Nadav Rotem, Jordan Fix, Saleem Abdulrasool, Summer Deng, Roman Dzhabarov, James Hegeman, Roman Levenstein, et al. 2018. Glow: Graph lowering compiler techniques for neural networks. arxiv:1805.00907.Google Scholar
Ananda Samajdar, Yuhao Zhu, Paul N. Whatmough, Matthew Mattina, and Tushar Krishna. 2018. SCALE-Sim: Systolic CNN accelerator. arxiv:1811.02883.Google Scholar
Hardik Sharma, Jongse Park, Divya Mahajan, Emmanuel Amaro, Joon Kyung Kim, Chenkai Shao, Asit Mishra, and Hadi Esmaeilzadeh. 2016. From high-level deep neural models to FPGAs. In Proceedings of the 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’16). IEEE, Los Alamitos, CA, 1--12.Google ScholarCross Ref
C. Silvano, W. Fornaciari, G. Palermo, V. Zaccaria, F. Castro, M. Martinez, S. Bocchio, et al. 2011. MULTICUBE: Multi-objective design space exploration of multi-core architectures. In VLSI 2010 Annual Symposium, N. Voros, A. Mukherjee, N. Sklavos, K. Masselos, and M. Huebner (Eds.). Springer Netherlands, Dordrecht, 47--63.Google ScholarCross Ref
J. Sim, J. S. Park, M. Kim, D. Bae, Y. Choi, and L. S. Kim. 2016. 14.6 A 1.42TOPS/W deep convolutional neural network recognition processor for intelligent IoE systems. In Proceedings of the 2016 IEEE International Solid-State Circuits Conference (ISSCC’16). IEEE, Los Alamitos, CA, 264--265. DOI:https://doi.org/10.1109/ISSCC.2016.7418008Google ScholarCross Ref
Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arxiv:1409.1556.Google Scholar
Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott E. Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2014. Going deeper with convolutions. arxiv:1409.4842.Google Scholar
S. I. Venieris and C. Bouganis. 2016. fpgaConvNet: A framework for mapping convolutional neural networks on FPGAs. In Proceedings of the 2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM’16). IEEE, Los Alamitos, CA, 40--47. DOI:https://doi.org/10.1109/FCCM.2016.22Google ScholarCross Ref
Xuechao Wei, Cody Hao Yu, Peng Zhang, Youxiang Chen, Yuxin Wang, Han Hu, Yun Liang, and J. Cong. 2017. Automated systolic array architecture synthesis for high throughput CNN inference on FPGAs. In Proceedings of the 2017 54th ACM/EDAC/IEEE Design Automation Conference (DAC’17). ACM, New York, NY, 1--6. DOI:https://doi.org/10.1145/3061639.3062207Google Scholar
XLA Team et al. 2019. XLA: Domain-specific compiler for linear algebra that optimizes TensorFlow computations.Google Scholar
Chen Zhang, Peng Li, Guangyu Sun, Yijin Guan, Bingjun Xiao, and Jason Cong. 2015. Optimizing FPGA-based accelerator design for deep convolutional neural networks. In Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA’15). ACM, New York, NY, 161--170. DOI:https://doi.org/10.1145/2684746.2689060Google ScholarDigital Library
Zhongyuan Zhao, Hyoukjun Kwon, Sachit Kuhar, Weiguang Sheng, Zhigang Mao, and Tushar Krishna. 2019. mRNA: Enabling efficient mapping space exploration for a reconfiguration neural accelerator. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS’19). IEEE, Los Alamitos, CA, 282--292. DOI:https://doi.org/10.1109/ISPASS.2019.00040Google ScholarCross Ref

Index Terms

Runtime Design Space Exploration and Mapping of DCNNs for the Ultra-Low-Power Orlando SoC
1. Computer systems organization
  1. Architectures
    1. Other architectures
      1. Heterogeneous (hybrid) systems
      2. Neural networks
  2. Embedded and cyber-physical systems
    1. System on a chip
2. Hardware
  1. Very large scale integration design
    1. On-chip resource management

Recommendations

Optimizing Loop Operation and Dataflow in FPGA Acceleration of Deep Convolutional Neural Networks
FPGA '17: Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

As convolution layers contribute most operations in convolutional neural network (CNN) algorithms, an effective convolution acceleration scheme significantly affects the efficiency and performance of a hardware CNN accelerator. Convolution in CNNs ...
Read More
A Low-Power Deconvolutional Accelerator for Convolutional Neural Network Based Segmentation on FPGA: Abstract Only
FPGA '18: Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

Convolutional Neural Networks (CNNs) based algorithms have been successful in solving image recognition problems, showing very large accuracy improvement. In recent years, deconvolution layers are widely used as key components in the state-of-the-art ...
Read More
SystemC-based HW/SW co-simulation platform for system-on-chip (SoC) design space exploration

The development of digital designs today is much more complex than before, as they now impose more severe demands and require greater number of functionalities to be conceived. The current approach, based on the register transfer level (RTL) design ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Architecture and Code Optimization Volume 17, Issue 2
June 2020
169 pages
ISSN:1544-3566
EISSN:1544-3973
DOI:10.1145/3403597
Editor:
David Kaeli
Northeastern University, USA
Issue’s Table of Contents
Copyright © 2020 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 29 May 2020
- Online AM: 7 May 2020
- Accepted: 1 January 2020
- Revised: 1 November 2019
- Received: 1 March 2019
Published in taco Volume 17, Issue 2

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Ultra low-power embedded systems
convolutional neural networks
design space exploration
hardware acceleration
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 1
  Total Citations
  View Citations
- 785
  Total Downloads
- Downloads (Last 12 months)160
- Downloads (Last 6 weeks)27
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

Runtime Design Space Exploration and Mapping of DCNNs for the Ultra-Low-Power Orlando SoC

ACM Transactions on Architecture and Code Optimization

Abstract

References

Cited By

Index Terms

Recommendations

Optimizing Loop Operation and Dataflow in FPGA Acceleration of Deep Convolutional Neural Networks

A Low-Power Deconvolutional Accelerator for Convolutional Neural Network Based Segmentation on FPGA: Abstract Only

SystemC-based HW/SW co-simulation platform for system-on-chip (SoC) design space exploration