当前位置: X-MOL 学术arXiv.cs.AR › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
On the Difficulty of Designing Processor Arrays for Deep Neural Networks
arXiv - CS - Hardware Architecture Pub Date : 2020-06-24 , DOI: arxiv-2006.14008
Kevin Stehle and G\"unther Schindler and Holger Fr\"oning

Systolic arrays are a promising computing concept which is in particular inline with CMOS technology trends and linear algebra operations found in the processing of artificial neural networks. The recent success of such deep learning methods in a wide set of applications has led to a variety of models, which albeit conceptual similar as based on convolutions and fully-connected layers, in detail show a huge diversity in operations due to a large design space: An operand's dimension varies substantially since it depends on design principles such as receptive field size, number of features, striding, dilating and grouping of features. Last, recent networks extent previously plain feedforward models by various connectivity, such as in ResNet or DenseNet. The problem of choosing an optimal systolic array configuration cannot be solved analytically, thus instead methods and tools are required that facilitate a fast and accurate reasoning about optimality in terms of total cycles, utilization, and amount of data movements. In this work we introduce Camuy, a lightweight model of a weight-stationary systolic array for linear algebra operations that allows quick explorations of different configurations, such as systolic array dimensions and input/output bitwidths. Camuy aids accelerator designers in either finding optimal configurations for a particular network architecture or for robust performance across a variety of network architectures. It offers simple integration into existing machine learning tool stacks (e.g TensorFlow) through custom operators. We present an analysis of popular DNN models to illustrate how it can estimate required cycles, data movement costs, as well as systolic array utilization, and show how the progress in network architecture design impacts the efficiency of inference on accelerators based on systolic arrays.

中文翻译:

深度神经网络处理器阵列设计难点

收缩阵列是一种很有前途的计算概念,它特别符合 CMOS 技术趋势和在人工神经网络处理中发现的线性代数运算。这种深度学习方法最近在广泛的应用中取得的成功导致了各种模型,尽管它们在概念上类似于基于卷积和全连接层的模型,但由于设计空间很大,因此在细节上显示出巨大的操作多样性:操作数的维度变化很大,因为它取决于设计原则,例如感受野大小、特征数量、跨步、扩张和特征分组。最后,最近的网络通过各种连接扩展了以前简单的前馈模型,例如在 ResNet 或 DenseNet 中。选择最佳收缩阵列配置的问题无法通过分析来解决,因此需要一些方法和工具来促进对总周期、利用率和数据移动量方面的最优性进行快速准确的推理。在这项工作中,我们介绍了 Camuy,这是一种用于线性代数运算的权重平稳脉动阵列的轻量级模型,它允许快速探索不同的配置,例如脉动阵列维度和输入/输出位宽。Camuy 可帮助加速器设计人员为特定网络架构寻找最佳配置,或为各种网络架构寻找稳健的性能。它通过自定义运算符提供了与现有机器学习工具堆栈(例如 TensorFlow)的简单集成。
更新日期:2020-06-26
down
wechat
bug