PolyDL,ACM Transactions on Architecture and Code Optimization

当前位置： X-MOL 学术 › ACM Trans. Archit. Code Optim. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

PolyDL
ACM Transactions on Architecture and Code Optimization ( IF 1.6 ) Pub Date : 2021-01-08 , DOI: 10.1145/3433103
Sanket Tavarageri ₁ , Alexander Heinecke ₁ , Sasikanth Avancha ₁ , Bharat Kaul ₁ , Gagandeep Goyal ₂ , Ramakrishna Upadrasta ₂

Affiliation

Deep Neural Networks (DNNs) have revolutionized many aspects of our lives. The use of DNNs is becoming ubiquitous, including in software for image recognition, speech recognition, speech synthesis, language translation, to name a few. The training of DNN architectures, however, is computationally expensive. Once the model is created, its use in the intended application—the inference task, is computationally heavy too and the inference needs to be fast for real time use. For obtaining high performance today, the code of Deep Learning (DL) primitives optimized for specific architectures by expert programmers exposed via libraries is the norm. However, given the constant emergence of new DNN architectures, creating hand optimized code is expensive, slow and is not scalable. To address this performance-productivity challenge, in this article we present compiler algorithms to automatically generate high-performance implementations of DL primitives that closely match the performance of hand optimized libraries. We develop novel data reuse analysis algorithms using the polyhedral model to derive efficient execution schedules automatically. In addition, because most DL primitives use some variant of matrix multiplication at their core, we develop a flexible framework where it is possible to plug in library implementations of the same in lieu of a subset of the loops. We show that such a hybrid compiler plus a minimal library-use approach results in state-of-the-art performance. We develop compiler algorithms to also perform operator fusions that reduce data movement through the memory hierarchy of the computer system. Using Convolution Neural Network (CNN) models and matrix multiplication operations, we demonstrate that our approach automatically creates high performing DNN building blocks whose performance matches the performance of hand-crafted kernels of Intel’s oneDNN library on high end CPUs. At the same time, our techniques take only a fraction of time (1/20 or less) compared to AutoTVM, a deep learning auto-tuner to create optimized implementations.

中文翻译：

聚DL

深度神经网络 (DNN) 彻底改变了我们生活的许多方面。DNN 的使用正变得无处不在，包括用于图像识别、语音识别、语音合成、语言翻译等软件。然而，DNN 架构的训练在计算上是昂贵的。创建模型后，其在预期应用（推理任务）中的使用也需要大量计算，并且推理需要快速进行实时使用。为了获得当今的高性能，由专家程序员通过库公开为特定架构优化的深度学习 (DL) 原语代码是常态。然而，鉴于新的 DNN 架构的不断出现，创建手动优化的代码成本高昂、速度慢且不可扩展。为了解决这一性能-生产力挑战，在本文中，我们将介绍编译器算法，以自动生成与手动优化库的性能非常匹配的 DL 原语的高性能实现。我们使用多面体模型开发新的数据重用分析算法，以自动得出有效的执行计划。此外，由于大多数 DL 原语在其核心使用矩阵乘法的一些变体，我们开发了一个灵活的框架，可以在其中插入相同的库实现来代替循环的子集。我们展示了这种混合编译器加上最少的库使用方法会产生最先进的性能。我们开发编译器算法来执行运算符融合，以减少通过计算机系统的内存层次结构的数据移动。使用卷积神经网络 (CNN) 模型和矩阵乘法运算，我们证明我们的方法自动创建高性能 DNN 构建块，其性能与高端 CPU 上英特尔 oneDNN 库的手工内核的性能相匹配。同时，与 AutoTVM（一种用于创建优化实现的深度学习自动调谐器）相比，我们的技术只需要一小部分时间（1/20 或更短）。

更新日期：2021-01-08

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>