Deep Partitioned Training From Near-Storage Computing to DNN Accelerators,IEEE Computer Architecture Letters

当前位置： X-MOL 学术 › IEEE Comput. Archit. Lett. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Deep Partitioned Training From Near-Storage Computing to DNN Accelerators
IEEE Computer Architecture Letters ( IF 1.4 ) Pub Date : 2021-05-19 , DOI: 10.1109/lca.2021.3081752
Yongjoo Jang , Sejin Kim , Daehoon Kim , Sungjin Lee , Jaeha Kung

In this letter, we present deep partitioned training to accelerate computations involved in training DNN models. This is the first work that partitions a DNN model across storage devices, an NPU and a host CPU forming a unified compute node for training workloads. To validate the benefit of using the proposed system during DNN training, a trace-based simulator or an FPGA prototype is used to estimate the overall performance and obtain the layer index to be partitioned that provides the minimum latency. As a case study, we select two benchmarks, i.e., vision-related tasks and a recommendation system. As a result, the training time reduces by 12.2 ~ 31.0 percent with four near-storage computing devices in vision-related tasks with a mini-batch size of 512 and 40.6 ~ 44.7 percent with one near-storage computing device in the selected recommendation system with a mini-batch size of 64.

中文翻译：

从近存储计算到 DNN 加速器的深度分区训练

在这封信中，我们提出了深度分区训练，以加速训练 DNN 模型所涉及的计算。这是第一个将 DNN 模型跨存储设备、NPU 和主机 CPU 分区的工作，形成用于训练工作负载的统一计算节点。为了验证在 DNN 训练期间使用所提出的系统的好处，使用基于轨迹的模拟器或 FPGA 原型来估计整体性能并获得提供最小延迟的要分区的层索引。作为案例研究，我们选择两个基准，即视觉相关任务和推荐系统。结果，在小批量大小为512的视觉相关任务中，使用四台近存储计算设备时，训练时间减少了12.2%~31.0%；在所选推荐系统中，使用一台近存储计算设备时，训练时间减少了40.6%~44.7%。小批量大小为 64。

更新日期：2021-05-19

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南11