当前位置: X-MOL 学术arXiv.cs.PF › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
A Full-stack Accelerator Search Technique for Vision Applications
arXiv - CS - Performance Pub Date : 2021-05-26 , DOI: arxiv-2105.12842
Dan Zhang, Safeen Huda, Ebrahim Songhori, Quoc Le, Anna Goldie, Azalia Mirhoseini

The rapidly-changing ML model landscape presents a unique opportunity for building hardware accelerators optimized for specific datacenter-scale workloads. We propose Full-stack Accelerator Search Technique (FAST), a hardware accelerator search framework that defines a broad optimization environment covering key design decisions within the hardware-software stack, including hardware datapath, software scheduling, and compiler passes such as operation fusion and tensor padding. Although FAST can be used on any number and type of deep learning workload, in this paper we focus on optimizing for a single or small set of vision models, resulting in significantly faster and more power-efficient designs relative to a general purpose ML accelerator. When evaluated on EfficientNet, ResNet50v2, and OCR inference performance relative to a TPU-v3, designs generated by FAST optimized for single workloads can improve Perf/TDP (peak power) by over 6x in the best case and 4x on average. On a limited workload subset, FAST improves Perf/TDP 2.85x on average, with a reduction to 2.35x for a single design optimized over the set of workloads. In addition, we demonstrate a potential 1.8x speedup opportunity for TPU-v3 with improved scheduling.

中文翻译:

一种用于视觉应用的全栈加速器搜索技术

快速变化的 ML 模型格局为构建针对特定数据中心规模工作负载优化的硬件加速器提供了独特的机会。我们提出了全栈加速器搜索技术 (FAST),这是一种硬件加速器搜索框架,它定义了一个广泛的优化环境,涵盖了硬件-软件堆栈中的关键设计决策,包括硬件数据路径、软件调度和编译器传递,例如运算融合和张量填充。尽管 FAST 可用于任何数量和类型的深度学习工作负载,但在本文中,我们专注于针对单个或一小组视觉模型进行优化,从而实现相对于通用 ML 加速器而言更快、更节能的设计。在 EfficientNet、ResNet50v2 和相对于 TPU-v3 的 OCR 推理性能上进行评估时,FAST 生成的针对单个工作负载优化的设计可以将 Perf/TDP(峰值功率)在最佳情况下提高 6 倍以上,平均提高 4 倍。在有限的工作负载子集上,FAST 将 Perf/TDP 平均提高了 2.85 倍,对于在一组工作负载上优化的单个设计,性能/TDP 降低到 2.35 倍。此外,我们通过改进的调度展示了 TPU-v3 潜在的 1.8 倍加速机会。
更新日期:2021-05-28
down
wechat
bug