当前位置: X-MOL 学术IEEE Micro › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
A Programmable Approach to Neural Network Compression
IEEE Micro ( IF 3.6 ) Pub Date : 2020-09-01 , DOI: 10.1109/mm.2020.3012391
Vinu Joseph 1 , Ganesh L. Gopalakrishnan 1 , Saurav Muralidharan 2 , Michael Garland 2 , Animesh Garg 3
Affiliation  

Deep neural networks (DNNs) frequently contain far more weights, represented at a higher precision, than are required for the specific task, which they are trained to perform. Consequently, they can often be compressed using techniques such as weight pruning and quantization that reduce both the model size and inference time without appreciable loss in accuracy. However, finding the best compression strategy and corresponding target sparsity for a given DNN, hardware platform, and optimization objective currently requires expensive, frequently manual, trial-and-error experimentation. In this article, we introduce a programmable system for model compression called CONDENSA. Users programmatically compose simple operators, in Python, to build more complex and practically interesting compression strategies. Given a strategy and user-provided objective (such as minimization of running time), CONDENSA uses a novel Bayesian optimization-based algorithm to automatically infer desirable sparsities. Our experiments on four real-world DNNs demonstrate memory footprint and hardware runtime throughput improvements of 188× and 2.59×, respectively, using at most ten samples per search.

中文翻译:

一种可编程的神经网络压缩方法

深度神经网络 (DNN) 通常包含更多的权重,以更高的精度表示,而不是训练它们执行的特定任务所需的权重。因此,它们通常可以使用诸如权重修剪和量化之类的技术进行压缩,这些技术可以减少模型大小和推理时间,而不会显着降低准确性。然而,为给定的 DNN、硬件平台和优化目标找到最佳压缩策略和相应的目标稀疏性,目前需要昂贵的、经常是手动的、反复试验的实验。在本文中,我们介绍了一个用于模型压缩的可编程系统,称为 CONDENSA。用户在 Python 中以编程方式组合简单的运算符,以构建更复杂且实际有趣的压缩策略。给定策略和用户提供的目标(例如最小化运行时间),CONDENSA 使用一种新颖的基于贝叶斯优化的算法来自动推断所需的稀疏性。我们在四个真实世界 DNN 上的实验表明,每次搜索最多使用 10 个样本,内存占用和硬件运行时吞吐量分别提高了 188 倍和 2.59 倍。
更新日期:2020-09-01
down
wechat
bug