Towards Latency-aware DNN Optimization with GPU Runtime Analysis and Tail Effect Elimination,arXiv - CS - Hardware Architecture

当前位置： X-MOL 学术 › arXiv.cs.AR › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Towards Latency-aware DNN Optimization with GPU Runtime Analysis and Tail Effect Elimination
arXiv - CS - Hardware Architecture Pub Date : 2020-11-08 , DOI: arxiv-2011.03897
Fuxun Yu, Zirui Xu, Tong Shen, Dimitrios Stamoulis, Longfei Shangguan, Di Wang, Rishi Madhok, Chunshui Zhao, Xin Li, Nikolaos Karianakis, Dimitrios Lymberopoulos, Ang Li, ChenChen Liu, Yiran Chen, Xiang Chen

Despite the superb performance of State-Of-The-Art (SOTA) DNNs, the increasing computational cost makes them very challenging to meet real-time latency and accuracy requirements. Although DNN runtime latency is dictated by model property (e.g., architecture, operations), hardware property (e.g., utilization, throughput), and more importantly, the effective mapping between these two, many existing approaches focus only on optimizing model property such as FLOPS reduction and overlook the mismatch between DNN model and hardware properties. In this work, we show that the mismatch between the varied DNN computation workloads and GPU capacity can cause the idle GPU tail effect, leading to GPU under-utilization and low throughput. As a result, the FLOPs reduction cannot bring effective latency reduction, which causes sub-optimal accuracy versus latency trade-offs. Motivated by this, we propose a GPU runtime-aware DNN optimization methodology to eliminate such GPU tail effect adaptively on GPU platforms. Our methodology can be applied on top of existing SOTA DNN optimization approaches to achieve better latency and accuracy trade-offs. Experiments show 11%-27% latency reduction and 2.5%-4.0% accuracy improvement over several SOTA DNN pruning and NAS methods, respectively

中文翻译：

通过 GPU 运行时分析和尾部效应消除实现延迟感知 DNN 优化

尽管最先进 (SOTA) DNN 具有出色的性能，但不断增加的计算成本使得它们很难满足实时延迟和准确性要求。尽管 DNN 运行时延迟取决于模型属性（例如架构、操作）、硬件属性（例如利用率、吞吐量），更重要的是这两者之间的有效映射，但许多现有方法仅关注优化模型属性，例如 FLOPS减少并忽略 DNN 模型和硬件属性之间的不匹配。在这项工作中，我们表明不同的 DNN 计算工作负载和 GPU 容量之间的不匹配会导致空闲 GPU 尾部效应，导致 GPU 利用率不足和低吞吐量。因此，减少 FLOPs 并不能带来有效的延迟减少，这会导致不理想的准确性与延迟的权衡。受此启发，我们提出了一种 GPU 运行时感知 DNN 优化方法，以在 GPU 平台上自适应地消除这种 GPU 尾部效应。我们的方法可以应用于现有的 SOTA DNN 优化方法之上，以实现更好的延迟和准确性权衡。实验表明，与几种 SOTA DNN 修剪和 NAS 方法相比，延迟分别降低了 11%-27% 和准确度提高了 2.5%-4.0%

更新日期：2020-11-10

点击分享查看原文

点击收藏

阅读更多本刊最新论文

全部期刊列表>>