当前位置: X-MOL 学术IEEE Trans. Comput. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
FPDeep: Scalable Acceleration of CNN Training on Deeply-Pipelined FPGA Clusters
IEEE Transactions on Computers ( IF 3.7 ) Pub Date : 2020-01-01 , DOI: 10.1109/tc.2020.3000118
Tianqi Wang , Tong Geng , Ang Li , Xi Jin , Martin Herbordt

Deep convolutional Neural Networks (CNNs) have revolutionized numerous applications, but the demand for ever more performance remains unabated. Scaling CNN computations to larger clusters is generally done by distributing tasks in batch mode using methods such as distributed synchronous SGD. Among the issues with this approach is that, to make the distributed cluster work with high utilization, the workload distributed to each node must be large; this implies nontrivial growth in the SGD mini-batch size. In this article we propose a framework, called FPDeep, which uses a hybrid of model and layer parallelism to configure distributed reconfigurable clusters to train CNNs. This approach has numerous benefits. First, the design does not suffer from performance loss due to batch size growth. Second, work and storage are balanced among nodes through novel workload and weight partitioning schemes. Part of the mechanism is the surprising finding that it is preferable to store excess weights in neighboring devices rather than in local off-chip memory. Third, the entire system is a fine-grained pipeline. This leads to high parallelism and utilization and also minimizes the time that features need to be cached while waiting for back-propagation. As a result, storage demand is reduced to the point where only on-chip memory is used for the convolution layers. And fourth, we find that the simplest topology, a 1D array, is preferred for interconnecting the FPGAs thus enabling widespread applicability. We evaluate FPDeep with the Alexnet, VGG-16, and VGG-19 benchmarks. Results show that FPDeep has good scalability to a large number of FPGAs, with the limiting factor being the FPGA-to-FPGA bandwidth. But with 250 Gb/s bidirectional bandwidth per FPGA, which is easily supported by current generation FPGAs, FPDeep performance shows linearity up to 100 FPGAs. Energy efficiency is evaluated with respect to GOPs/J. FPDeep provides, on average, 6.4× higher energy efficiency than comparable GPU servers.

中文翻译:

FPDeep:深度流水线 FPGA 集群上 CNN 训练的可扩展加速

深度卷积神经网络 (CNN) 已经彻底改变了许多应用程序,但对更高性能的需求仍然有增无减。将 CNN 计算扩展到更大的集群通常是通过使用分布式同步 SGD 等方法以批处理模式分配任务来完成的。这种方法的问题之一是,要使分布式集群高利用率地工作,分配到每个节点的工作量必须很大;这意味着 SGD 小批量大小的非平凡增长。在本文中,我们提出了一个名为 FPDeep 的框架,它使用模型和层并行性的混合来配置分布式可重构集群来训练 CNN。这种方法有很多好处。首先,该设计不会因批量大小增长而遭受性能损失。第二,通过新颖的工作负载和权重分区方案,在节点之间平衡工作和存储。该机制的一部分是令人惊讶的发现,即最好将多余的权重存储在相邻设备中而不是本地片外存储器中。第三,整个系统是一个细粒度的管道。这导致了高并行度和利用率,并且还最大限度地减少了在等待反向传播时需要缓存特征的时间。结果,存储需求减少到仅将片上存储器用于卷积层的程度。第四,我们发现最简单的拓扑结构,一维阵列,更适合用于互连 FPGA,从而实现广泛的适用性。我们使用 Alexnet、VGG-16 和 VGG-19 基准评估 FPDeep。结果表明 FPDeep 对大量 FPGA 具有良好的可扩展性,限制因素是 FPGA 到 FPGA 的带宽。但是,由于每个 FPGA 具有 250 Gb/s 的双向带宽,当前这一代 FPGA 很容易支持这一点,因此 FPDeep 性能显示出高达 100 个 FPGA 的线性度。能源效率是根据 GOPs/J 来评估的。FPDeep 的能效比同类 GPU 服务器平均高 6.4 倍。
更新日期:2020-01-01
down
wechat
bug