当前位置: X-MOL 学术Perform. Eval. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Queueing analysis of GPU-based inference servers with dynamic batching: A closed-form characterization
Performance Evaluation ( IF 2.2 ) Pub Date : 2021-05-01 , DOI: 10.1016/j.peva.2020.102183
Yoshiaki Inoue

GPU-accelerated computing is a key technology to realize high-speed inference servers using deep neural networks (DNNs). An important characteristic of GPU-based inference is that the computational efficiency, in terms of the processing speed and energy consumption, drastically increases by processing multiple jobs together in a batch. In this paper, we formulate GPU-based inference servers as a batch service queueing model with batch-size dependent processing times. We first show that the energy efficiency of the server monotonically increases with the arrival rate of inference jobs, which suggests that it is energy-efficient to operate the inference server under a utilization level as high as possible within a latency requirement of inference jobs. We then derive a closed-form upper bound for the mean latency, which provides a simple characterization of the latency performance. Through simulation and numerical experiments, we show that the exact value of the mean latency is well approximated by this upper bound.

中文翻译:

具有动态批处理的基于 GPU 的推理服务器的队列分析:一种封闭形式的表征

GPU 加速计算是使用深度神经网络 (DNN) 实现高速推理服务器的关键技术。基于 GPU 的推理的一个重要特征是,通过批量处理多个作业,在处理速度和能耗方面的计算效率大幅提高。在本文中,我们将基于 GPU 的推理服务器制定为具有批量大小相关处理时间的批量服务排队模型。我们首先表明服务器的能源效率随着推理作业的到达率单调增加,这表明在推理作业的延迟要求内,在尽可能高的利用率下运行推理服务器是节能的。然后我们推导出平均延迟的封闭形式上限,它提供了延迟性能的简单表征。通过模拟和数值实验,我们表明平均延迟的确切值很好地近似于这个上限。
更新日期:2021-05-01
down
wechat
bug