E2bird: Enhanced Elastic Batch for Improving Responsiveness and Throughput of Deep Learning Services,IEEE Transactions on Parallel and Distributed Systems

当前位置： X-MOL 学术 › IEEE Trans. Parallel Distrib. Syst. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

E2bird: Enhanced Elastic Batch for Improving Responsiveness and Throughput of Deep Learning Services
IEEE Transactions on Parallel and Distributed Systems ( IF 5.3 ) Pub Date : 2021-06-01 , DOI: 10.1109/tpds.2020.3047638
Weihao Cui , Quan Chen , Han Zhao , Mengze Wei , Xiaoxin Tang , Minyi Guo

We aim to tackle existing problems about deep learning serving on GPUs in the view of the system. GPUs have been widely adopted to serve online deep learning-based services that have stringent QoS(Quality-of-Service) requirements. However, emerging deep learning serving systems often result in poor responsiveness and low throughput of the inferences that damage user experience and increase the number of GPUs required to host an online service. Our investigation shows that the poor batching operation and the lack of data transfer-computation overlap are the root causes of the poor responsiveness and low throughput. To this end, we propose E

$^2$

bird, a deep learning serving system that is comprised of a GPU-resident memory pool, a multi-granularity inference engine, and an elastic batch scheduler. The memory pool eliminates the unnecessary waiting of the batching operation and enables data transfer-computation overlap. The inference engine enables concurrent execution of different batches, improving the GPU resource utilization. The batch scheduler organizes inferences elastically to guarantee the QoS. Our experimental results on an Nvidia Titan RTX GPU show that E

$^2$

bird reduces the response latency of inferences by up to 82.4 percent and improves the throughput by up to 62.8 percent while guaranteeing the QoS target compared with TensorFlow Serving.

中文翻译：

E2bird：增强型 Elastic Batch，用于提高深度学习服务的响应能力和吞吐量

我们的目标是从系统的角度解决在 GPU 上服务的深度学习存在的问题。GPU 已被广泛用于服务具有严格 QoS（服务质量）要求的基于在线深度学习的服务。然而，新兴的深度学习服务系统通常会导致推理的响应能力差和吞吐量低，从而损害用户体验并增加托管在线服务所需的 GPU 数量。我们的调查表明，糟糕的批处理操作和缺乏数据传输-计算重叠是响应能力差和吞吐量低的根本原因。为此，我们建议 E

$^2$

Bird 是一种深度学习服务系统，由 GPU 驻留内存池、多粒度推理引擎和弹性批处理调度器组成。内存池消除了不必要的批处理操作等待，并使数据传输-计算重叠。推理引擎支持不同批次的并发执行，提高GPU资源利用率。批处理调度器弹性地组织推理以保证 QoS。我们在 Nvidia Titan RTX GPU 上的实验结果表明 E

$^2$

与 TensorFlow Serving 相比，bird 将推理的响应延迟降低了 82.4%，吞吐量提高了 62.8%，同时保证了 QoS 目标。

更新日期：2021-06-01

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南