Abstract
Recent trends in deep convolutional neural networks (DCNNs) impose hardware accelerators as a viable solution for computer vision and speech recognition. The Orlando SoC architecture from STMicroelectronics targets exactly this class of problems by integrating hardware-accelerated convolutional blocks together with DSPs and on-chip memory resources to enable energy-efficient designs of DCNNs. The main advantage of the Orlando platform is to have runtime configurable convolutional accelerators that can adapt to different DCNN workloads. This opens new challenges for mapping the computation to the accelerators and for managing the on-chip resources efficiently. In this work, we propose a runtime design space exploration and mapping methodology for runtime resource management in terms of on-chip memory, convolutional accelerators, and external bandwidth. Experimental results are reported in terms of power/performance scalability, Pareto analysis, mapping adaptivity, and accelerator utilization for the Orlando architecture mapping the VGG-16, Tiny-Yolo(v2), and MobileNet topologies.
- T. Chen, Z. Du, N. Sun, J. Wang, C. Wu, Y. Chen, and O. Temam. 2015. A high-throughput neural network accelerator. IEEE Micro 35, 3 (May 2015), 24--32. DOI:https://doi.org/10.1109/MM.2015.41Google ScholarDigital Library
- Tianqi Chen, Thierry Moreau, Ziheng Jiang, Haichen Shen, Eddie Q. Yan, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. 2018. TVM: End-to-end optimization stack for deep learning. arxiv:1802.04799.Google Scholar
- Yunji Chen, Tao Luo, Shaoli Liu, Shijin Zhang, Liqiang He, Jia Wang, Ling Li, et al. 2014. DaDianNao: A machine-learning supercomputer. In Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’14). IEEE, Los Alamitos, CA, 609--622. DOI:https://doi.org/10.1109/MICRO.2014.58Google ScholarDigital Library
- Y. H. Chen, T. Krishna, J. S. Emer, and V. Sze. 2017. Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks. IEEE Journal of Solid-State Circuits 52, 1 (Jan. 2017), 127--138. DOI:https://doi.org/10.1109/JSSC.2016.2616357Google ScholarCross Ref
- Giuseppe Desoli, Thomas Boesch, Surinder Pal-Singh, and Nitin Chawla. 2018. A new scalable architecture to accelerate deep convolutional neural networks for low power IoT applications. In Proceedings of Embedded World 2018.Google Scholar
- Giuseppe Desoli, Nitin Chawla, Thomas Boesch, Surinder-Pal Singh, Elio Guidetti, Fabio De Ambroggi, Tommaso Majo, et al. 2017. 14.1 A 2.9TOPS/W deep convolutional neural network SoC in FD-SOI 28nm for intelligent embedded systems. In Proceedings of the 2017 IEEE International Solid-State Circuits Conference (ISSCC’17). IEEE, Los Alamitos, CA, 238--239. DOI:https://doi.org/10.1109/ISSCC.2017.7870349Google ScholarCross Ref
- Ahmet Erdem, Cristina Silvano, Thomas Boesch, Andrea C. Ornstein, Surinder-Pal Singh, and Giuseppe Desoli. 2018. Design space exploration for orlando ultra low-power convolutional neural network SoC. In Proceedings of the 29th IEEE International Conference on Application-Specific Systems, Architectures, and Processors (ASAP’18). IEEE, Los Alamitos, CA, 1--7. DOI:https://doi.org/10.1109/ASAP.2018.8445096Google ScholarCross Ref
- Mingyu Gao, Jing Pu, Xuan Yang, Mark Horowitz, and Christos Kozyrakis. 2017. TETRIS: Scalable and efficient neural network acceleration with 3D memory. In Proceedings of the 22nd International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’17). ACM, New York, NY, 751--764. DOI:https://doi.org/10.1145/3037697.3037702Google ScholarDigital Library
- Mingyu Gao, Xuan Yang, Jing Pu, Mark Horowitz, and Christos Kozyrakis. 2019. TANGRAM: Optimized coarse-grained dataflow for scalable NN accelerators. In Proceedings of the 24th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’19). ACM, New York, NY, 807--820. DOI:https://doi.org/10.1145/3297858.3304014Google ScholarDigital Library
- Yijin Guan, Hao Liang, Ningyi Xu, Wenqiang Wang, Shaoshuai Shi, Xi Chen, Guangyu Sun, Wei Zhang, and Jason Cong. 2017. FP-DNN: An automated framework for mapping deep neural networks onto FPGAs with RTL-HLS hybrid templates. In Proceedings of the 25th IEEE Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM’17). IEEE, Los Alamitos, CA, 152--159. DOI:https://doi.org/10.1109/FCCM.2017.25Google ScholarCross Ref
- Song Han, Huizi Mao, and William J. Dally. 2015. Deep compression: Compressing deep neural network with pruning, trained quantization and Huffman coding. arxiv:1510.00149.Google Scholar
- Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Deep residual learning for image recognition. arxiv:1512.03385.Google Scholar
- Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. 2017. MobileNets: Efficient convolutional neural networks for mobile vision applications. arxiv:1704.04861.Google Scholar
- Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems 25 (NIPS’12). 1106--1114. http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.Google ScholarDigital Library
- Hyoukjun Kwon, Ananda Samajdar, and Tushar Krishna. 2018. MAERI: Enabling flexible dataflow mapping over DNN accelerators via reconfigurable interconnects. In Proceedings of the 23rd International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’18). ACM, New York, NY, 461--475. DOI:https://doi.org/10.1145/3173162.3173176Google ScholarDigital Library
- Chris Leary and Todd Wang. 2017. XLA: TensorFlow, Compiled. Retrieved July 17, 2019 from https://www.tensorflow.org/xla. -->Google Scholar
- Jiajun Li, Guihai Yan, Wenyan Lu, Shuhao Jiang, Shijun Gong, Jingya Wu, and Xiaowei Li. 2018. SmartShuttle: Optimizing off-chip memory accesses for deep learning accelerators. In Proceedings of the 2018 Design, Automation, and Test in Europe Conference and Exhibition (DATE’18). IEEE, Los Alamitos, CA, 343--348. DOI:https://doi.org/10.23919/DATE.2018.8342033Google ScholarCross Ref
- Wenyan Lu, Guihai Yan, Jiajun Li, Shijun Gong, Yinhe Han, and Xiaowei Li. 2017. FlexFlow: A flexible dataflow accelerator architecture for convolutional neural networks. In Proceedings of the 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA’17). IEEE, Los Alamitos, CA, 553--564. DOI:https://doi.org/10.1109/HPCA.2017.29Google ScholarCross Ref
- Angshuman Parashar, Priyanka Raina, Yakun Sophia Shao, Yu-Hsin Chen, Victor A. Ying, Anurag Mukkara, Rangharajan Venkatesan, Brucek Khailany, Stephen W. Keckler, and Joel S. Emer. 2019. Timeloop: A systematic approach to DNN accelerator evaluation. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS’19). IEEE, Los Alamitos, CA, 304--315. DOI:https://doi.org/10.1109/ISPASS.2019.00042Google Scholar
- Joseph Redmon and Ali Farhadi. 2016. YOLO9000: Better, faster, stronger. arxiv:1612.08242.Google Scholar
- Nadav Rotem, Jordan Fix, Saleem Abdulrasool, Summer Deng, Roman Dzhabarov, James Hegeman, Roman Levenstein, et al. 2018. Glow: Graph lowering compiler techniques for neural networks. arxiv:1805.00907.Google Scholar
- Ananda Samajdar, Yuhao Zhu, Paul N. Whatmough, Matthew Mattina, and Tushar Krishna. 2018. SCALE-Sim: Systolic CNN accelerator. arxiv:1811.02883.Google Scholar
- Hardik Sharma, Jongse Park, Divya Mahajan, Emmanuel Amaro, Joon Kyung Kim, Chenkai Shao, Asit Mishra, and Hadi Esmaeilzadeh. 2016. From high-level deep neural models to FPGAs. In Proceedings of the 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’16). IEEE, Los Alamitos, CA, 1--12.Google ScholarCross Ref
- C. Silvano, W. Fornaciari, G. Palermo, V. Zaccaria, F. Castro, M. Martinez, S. Bocchio, et al. 2011. MULTICUBE: Multi-objective design space exploration of multi-core architectures. In VLSI 2010 Annual Symposium, N. Voros, A. Mukherjee, N. Sklavos, K. Masselos, and M. Huebner (Eds.). Springer Netherlands, Dordrecht, 47--63.Google ScholarCross Ref
- J. Sim, J. S. Park, M. Kim, D. Bae, Y. Choi, and L. S. Kim. 2016. 14.6 A 1.42TOPS/W deep convolutional neural network recognition processor for intelligent IoE systems. In Proceedings of the 2016 IEEE International Solid-State Circuits Conference (ISSCC’16). IEEE, Los Alamitos, CA, 264--265. DOI:https://doi.org/10.1109/ISSCC.2016.7418008Google ScholarCross Ref
- Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arxiv:1409.1556.Google Scholar
- Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott E. Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2014. Going deeper with convolutions. arxiv:1409.4842.Google Scholar
- S. I. Venieris and C. Bouganis. 2016. fpgaConvNet: A framework for mapping convolutional neural networks on FPGAs. In Proceedings of the 2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM’16). IEEE, Los Alamitos, CA, 40--47. DOI:https://doi.org/10.1109/FCCM.2016.22Google ScholarCross Ref
- Xuechao Wei, Cody Hao Yu, Peng Zhang, Youxiang Chen, Yuxin Wang, Han Hu, Yun Liang, and J. Cong. 2017. Automated systolic array architecture synthesis for high throughput CNN inference on FPGAs. In Proceedings of the 2017 54th ACM/EDAC/IEEE Design Automation Conference (DAC’17). ACM, New York, NY, 1--6. DOI:https://doi.org/10.1145/3061639.3062207Google Scholar
- XLA Team et al. 2019. XLA: Domain-specific compiler for linear algebra that optimizes TensorFlow computations.Google Scholar
- Chen Zhang, Peng Li, Guangyu Sun, Yijin Guan, Bingjun Xiao, and Jason Cong. 2015. Optimizing FPGA-based accelerator design for deep convolutional neural networks. In Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA’15). ACM, New York, NY, 161--170. DOI:https://doi.org/10.1145/2684746.2689060Google ScholarDigital Library
- Zhongyuan Zhao, Hyoukjun Kwon, Sachit Kuhar, Weiguang Sheng, Zhigang Mao, and Tushar Krishna. 2019. mRNA: Enabling efficient mapping space exploration for a reconfiguration neural accelerator. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS’19). IEEE, Los Alamitos, CA, 282--292. DOI:https://doi.org/10.1109/ISPASS.2019.00040Google ScholarCross Ref
Index Terms
- Runtime Design Space Exploration and Mapping of DCNNs for the Ultra-Low-Power Orlando SoC
Recommendations
Optimizing Loop Operation and Dataflow in FPGA Acceleration of Deep Convolutional Neural Networks
FPGA '17: Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate ArraysAs convolution layers contribute most operations in convolutional neural network (CNN) algorithms, an effective convolution acceleration scheme significantly affects the efficiency and performance of a hardware CNN accelerator. Convolution in CNNs ...
A Low-Power Deconvolutional Accelerator for Convolutional Neural Network Based Segmentation on FPGA: Abstract Only
FPGA '18: Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate ArraysConvolutional Neural Networks (CNNs) based algorithms have been successful in solving image recognition problems, showing very large accuracy improvement. In recent years, deconvolution layers are widely used as key components in the state-of-the-art ...
SystemC-based HW/SW co-simulation platform for system-on-chip (SoC) design space exploration
The development of digital designs today is much more complex than before, as they now impose more severe demands and require greater number of functionalities to be conceived. The current approach, based on the register transfer level (RTL) design ...
Comments