当前位置: X-MOL 学术arXiv.cs.DC › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Dataflow Aware Mapping of Convolutional Neural Networks Onto Many-Core Platforms With Network-on-Chip Interconnect
arXiv - CS - Distributed, Parallel, and Cluster Computing Pub Date : 2020-06-18 , DOI: arxiv-2006.12274
Andreas Bytyn, Ren\'e Ahlsdorf, Rainer Leupers, Gerd Ascheid

Machine intelligence, especially using convolutional neural networks (CNNs), has become a large area of research over the past years. Increasingly sophisticated hardware accelerators are proposed that exploit e.g. the sparsity in computations and make use of reduced precision arithmetic to scale down the energy consumption. However, future platforms require more than just energy efficiency: Scalability is becoming an increasingly important factor. The required effort for physical implementation grows with the size of the accelerator making it more difficult to meet target constraints. Using many-core platforms consisting of several homogeneous cores can alleviate the aforementioned limitations with regard to physical implementation at the expense of an increased dataflow mapping effort. While the dataflow in CNNs is deterministic and can therefore be optimized offline, the problem of finding a suitable scheme that minimizes both runtime and off-chip memory accesses is a challenging task which becomes even more complex if an interconnect system is involved. This work presents an automated mapping strategy starting at the single-core level with different optimization targets for minimal runtime and minimal off-chip memory accesses. The strategy is then extended towards a suitable many-core mapping scheme and evaluated using a scalable system-level simulation with a network-on-chip interconnect. Design space exploration is performed by mapping the well-known CNNs AlexNet and VGG-16 to platforms of different core counts and computational power per core in order to investigate the trade-offs. Our mapping strategy and system setup is scaled starting from the single core level up to 128 cores, thereby showing the limits of the selected approach.

中文翻译:

卷积神经网络到具有片上网络互连的多核平台的数据流感知映射

机器智能,尤其是使用卷积神经网络 (CNN),在过去几年中已成为一个很大的研究领域。提出了越来越复杂的硬件加速器,它们利用例如计算中的稀疏性并利用降低精度的算法来按比例降低能耗。然而,未来的平台需要的不仅仅是能源效率:可扩展性正成为一个越来越重要的因素。物理实现所需的工作量随着加速器的大小而增加,这使得满足目标约束变得更加困难。使用由多个同构内核组成的众核平台可以减轻上述物理实现方面的限制,但代价是增加了数据流映射工作。虽然 CNN 中的数据流是确定性的,因此可以离线优化,但找到一个合适的方案来最大限度地减少运行时间和片外存储器访问的问题是一项具有挑战性的任务,如果涉及互连系统,这将变得更加复杂。这项工作提出了一种从单核级别开始的自动映射策略,具有不同的优化目标,以实现最少的运行时间和最少的片外内存访问。然后将该策略扩展到合适的众核映射方案,并使用具有片上网络互连的可扩展系统级仿真进行评估。设计空间探索是通过将著名的 CNNs AlexNet 和 VGG-16 映射到不同核心数量和每个核心计算能力的平台来进行的,以研究权衡。
更新日期:2020-06-23
down
wechat
bug