当前位置: X-MOL 学术arXiv.cs.PF › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Perseus: Characterizing Performance and Cost of Multi-Tenant Serving for CNN Models
arXiv - CS - Performance Pub Date : 2019-12-05 , DOI: arxiv-1912.02322
Matthew LeMay and Shijian Li and Tian Guo

Deep learning models are increasingly used for end-user applications, supporting both novel features such as facial recognition, and traditional features, e.g. web search. To accommodate high inference throughput, it is common to host a single pre-trained Convolutional Neural Network (CNN) in dedicated cloud-based servers with hardware accelerators such as Graphics Processing Units (GPUs). However, GPUs can be orders of magnitude more expensive than traditional Central Processing Unit (CPU) servers. These resources could also be under-utilized facing dynamic workloads, which may result in inflated serving costs. One potential way to alleviate this problem is by allowing hosted models to share the underlying resources, which we refer to as multi-tenant inference serving. One of the key challenges is maximizing the resource efficiency for multi-tenant serving given hardware with diverse characteristics, models with unique response time Service Level Agreement (SLA), and dynamic inference workloads. In this paper, we present Perseus, a measurement framework that provides the basis for understanding the performance and cost trade-offs of multi-tenant model serving. We implemented Perseus in Python atop a popular cloud inference server called Nvidia TensorRT Inference Server. Leveraging Perseus, we evaluated the inference throughput and cost for serving various models and demonstrated that multi-tenant model serving led to up to 12% cost reduction.

中文翻译:

Perseus:表征 CNN 模型多租户服务的性能和成本

深度学习模型越来越多地用于终端用户应用程序,支持面部识别等新功能和网络搜索等传统功能。为了适应高推理吞吐量,通常在具有图形处理单元 (GPU) 等硬件加速器的基于云的专用服务器中托管单个预训练的卷积神经网络 (CNN)。但是,GPU 可能比传统的中央处理单元 (CPU) 服务器贵几个数量级。面对动态工作负载,这些资源也可能未得到充分利用,这可能导致服务成本膨胀。缓解此问题的一种潜在方法是允许托管模型共享底层资源,我们将其称为多租户推理服务。关键挑战之一是最大限度地提高多租户服务的资源效率,为具有不同特性的硬件、具有独特响应时间服务水平协议 (SLA) 的模型和动态推理工作负载。在本文中,我们介绍了 Perseus,这是一个衡量框架,它为理解多租户模型服务的性能和成本权衡提供了基础。我们在名为 Nvidia TensorRT 推理服务器的流行云推理服务器上用 Python 实现了 Perseus。利用 Perseus,我们评估了为各种模型提供服务的推理吞吐量和成本,并证明多租户模型服务可将成本降低 12%。我们展示了 Perseus,这是一个衡量框架,它为理解多租户模型服务的性能和成本权衡提供了基础。我们在名为 Nvidia TensorRT 推理服务器的流行云推理服务器上用 Python 实现了 Perseus。利用 Perseus,我们评估了为各种模型提供服务的推理吞吐量和成本,并证明多租户模型服务可将成本降低 12%。我们展示了 Perseus,这是一个衡量框架,它为理解多租户模型服务的性能和成本权衡提供了基础。我们在名为 Nvidia TensorRT 推理服务器的流行云推理服务器上用 Python 实现了 Perseus。利用 Perseus,我们评估了为各种模型提供服务的推理吞吐量和成本,并证明多租户模型服务可将成本降低 12%。
更新日期:2020-04-01
down
wechat
bug