Accelerating Deep Learning Inference via Learned Caches,arXiv - CS - Performance

当前位置： X-MOL 学术 › arXiv.cs.PF › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Accelerating Deep Learning Inference via Learned Caches
arXiv - CS - Performance Pub Date : 2021-01-18 , DOI: arxiv-2101.07344
Arjun Balasubramanian, Adarsh Kumar, Yuhan Liu, Han Cao, Shivaram Venkataraman, Aditya Akella

Deep Neural Networks (DNNs) are witnessing increased adoption in multiple domains owing to their high accuracy in solving real-world problems. However, this high accuracy has been achieved by building deeper networks, posing a fundamental challenge to the low latency inference desired by user-facing applications. Current low latency solutions trade-off on accuracy or fail to exploit the inherent temporal locality in prediction serving workloads. We observe that caching hidden layer outputs of the DNN can introduce a form of late-binding where inference requests only consume the amount of computation needed. This enables a mechanism for achieving low latencies, coupled with an ability to exploit temporal locality. However, traditional caching approaches incur high memory overheads and lookup latencies, leading us to design learned caches - caches that consist of simple ML models that are continuously updated. We present the design of GATI, an end-to-end prediction serving system that incorporates learned caches for low-latency DNN inference. Results show that GATI can reduce inference latency by up to 7.69X on realistic workloads.

中文翻译：

通过学习的缓存加速深度学习推理

深度神经网络（DNN）由于在解决实际问题方面具有很高的准确性，因此在多个领域中的采用率不断提高。但是，通过构建更深的网络已经实现了这种高精度，这对面向用户的应用程序所需的低延迟推断提出了根本性的挑战。当前的低延迟解决方案需要权衡准确性，或者无法在预测服务工作负载中利用固有的时间局部性。我们观察到，缓存DNN的隐藏层输出会引入后期绑定的形式，其中推理请求仅消耗所需的计算量。这实现了一种实现低延迟的机制，并具有利用时间局部性的能力。但是，传统的缓存方法会导致较高的内存开销和查找延迟，带领我们设计了学习型缓存-由不断更新的简单ML模型组成的缓存。我们介绍了GATI的设计，这是一个端到端的预测服务系统，该系统结合了用于低延迟DNN推理的学习缓存。结果表明，在实际工作负载下，GATI可以将推理延迟降低多达7.69倍。

更新日期：2021-01-20

点击分享查看原文

点击收藏

阅读更多本刊最新论文