当前位置: X-MOL 学术arXiv.cs.DC › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Analyzing and Mitigating Data Stalls in DNN Training
arXiv - CS - Distributed, Parallel, and Cluster Computing Pub Date : 2020-07-14 , DOI: arxiv-2007.06775
Jayashree Mohan, Amar Phanishayee, Ashish Raniwala, Vijay Chidambaram

Training Deep Neural Networks (DNNs) is resource-intensive and time-consuming. While prior research has explored many different ways of reducing DNN training time spanning different layers of the system stack, the impact of input data pipeline, i.e., fetching raw data items from storage and performing data pre-processing in memory, has been relatively unexplored. This paper makes the following contributions: (1) We present the first comprehensive analysis of how the input data pipeline affects the training time of widely-used Deep Neural Networks (DNNs). We analyze nine different models across three tasks and four datasets while varying factors such as the amount of memory, number of CPU threads, storage device, GPU generation etc on servers that are a part of a large production cluster at Microsoft. We find that in many cases, DNN training time is dominated by data stall time: time spent waiting for data to be fetched and pre-processed. (2) We build a tool, DS-Analyzer to precisely measure data stalls using a differential technique, and perform predictive what-if analysis on data stalls. (3) Finally, based on the insights from our analysis, we design and implement three simple but effective techniques in a data-loading library, CoorDL, to mitigate data stalls. Our experiments on a range of DNN tasks, models, datasets, and hardware configurations show that when PyTorch uses CoorDL instead of the state-of-the-art DALI data loading library, DNN training time is reduced significantly (by as much as 5x on a single server).

中文翻译:

分析和缓解 DNN 训练中的数据停滞

训练深度神经网络 (DNN) 是资源密集型且耗时的。虽然先前的研究已经探索了许多不同的方法来减少跨越系统堆栈不同层的 DNN 训练时间,但输入数据管道的影响,即从存储中获取原始数据项并在内存中执行数据预处理,相对来说还没有得到探索。本文做出了以下贡献:(1)我们首次全面分析了输入数据管道如何影响广泛使用的深度神经网络(DNN)的训练时间。我们分析了三个任务和四个数据集的九个不同模型,同时分析了作为 Microsoft 大型生产集群一部分的服务器上的内存量、CPU 线程数、存储设备、GPU 生成等因素。我们发现在很多情况下,DNN 训练时间由数据停滞时间决定:等待数据被获取和预处理所花费的时间。(2) 我们构建了一个工具 DS-Analyzer 来使用差分技术精确测量数据停滞,并对数据停滞进行预测性假设分析。(3) 最后,基于我们分析的见解,我们在数据加载库 CoorDL 中设计并实现了三种简单但有效的技术,以缓解数据停滞。我们在一系列 DNN 任务、模型、数据集和硬件配置上的实验表明,当 PyTorch 使用 CoorDL 而不是最先进的 DALI 数据加载库时,DNN 训练时间显着减少(在单个服务器)。DS-Analyzer 使用差分技术精确测量数据停顿,并对数据停顿执行预测性假设分析。(3) 最后,基于我们分析的见解,我们在数据加载库 CoorDL 中设计并实现了三种简单但有效的技术,以缓解数据停滞。我们在一系列 DNN 任务、模型、数据集和硬件配置上的实验表明,当 PyTorch 使用 CoorDL 而不是最先进的 DALI 数据加载库时,DNN 训练时间显着减少(在单个服务器)。DS-Analyzer 使用差分技术精确测量数据停顿,并对数据停顿执行预测性假设分析。(3) 最后,基于我们分析的见解,我们在数据加载库 CoorDL 中设计并实现了三种简单但有效的技术,以缓解数据停滞。我们在一系列 DNN 任务、模型、数据集和硬件配置上的实验表明,当 PyTorch 使用 CoorDL 而不是最先进的 DALI 数据加载库时,DNN 训练时间显着减少(在单个服务器)。
更新日期:2020-09-02
down
wechat
bug