Horus: Interference-Aware and Prediction-Based Scheduling in Deep Learning Systems,IEEE Transactions on Parallel and Distributed Systems

当前位置： X-MOL 学术 › IEEE Trans. Parallel Distrib. Syst. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Horus: Interference-Aware and Prediction-Based Scheduling in Deep Learning Systems
IEEE Transactions on Parallel and Distributed Systems ( IF 5.6 ) Pub Date : 2021-05-11 , DOI: 10.1109/tpds.2021.3079202
Gingfung Yeung , Damian Borowiec , Renyu Yang , Adrian Friday , Richard Harper , Peter Garraghan

To accelerate the training of Deep Learning (DL) models, clusters of machines equipped with hardware accelerators such as GPUs are leveraged to reduce execution time. State-of-the-art resource managers are needed to increase GPU utilization and maximize throughput. While co-locating DL jobs on the same GPU has been shown to be effective, this can incur interference causing slowdown. In this article we propose Horus: an interference-aware and prediction-based resource manager for DL systems. Horus proactively predicts GPU utilization of heterogeneous DL jobs extrapolated from the DL model's computation graph features, removing the need for online profiling and isolated reserved GPUs. Through micro-benchmarks and job co-location combinations across heterogeneous GPU hardware, we identify GPU utilization as a general proxy metric to determine good placement decisions, in contrast to current approaches which reserve isolated GPUs to perform online profiling and directly measure GPU utilization for each unique submitted job. Our approach promotes high resource utilization and makespan reduction; via real-world experimentation and large-scale trace driven simulation, we demonstrate that Horus outperforms other DL resource managers by up to 61.5 percent for GPU resource utilization, 23.7-30.7 percent for makespan reduction and 68.3 percent in job wait time reduction.

中文翻译：

Horus：深度学习系统中的干扰感知和基于预测的调度

为了加速深度学习 (DL) 模型的训练，利用配备 GPU 等硬件加速器的机器集群来减少执行时间。需要最先进的资源管理器来提高 GPU 利用率并最大化吞吐量。虽然在同一 GPU 上协同定位 DL 作业已被证明是有效的，但这可能会产生干扰，导致速度减慢。在本文中，我们提出了 Horus：一种用于深度学习系统的干扰感知和基于预测的资源管理器。 Horus 主动预测从 DL 模型的计算图特征推断出的异构 DL 作业的 GPU 利用率，从而无需在线分析和隔离的预留 GPU。通过跨异构 GPU 硬件的微基准测试和作业共置组合，我们将 GPU 利用率确定为通用代理指标，以确定良好的布局决策，这与当前保留隔离 GPU 来执行在线分析并直接测量每个 GPU 利用率的方法形成鲜明对比。唯一提交的作业。我们的方法可提高资源利用率并缩短完工时间；通过现实世界的实验和大规模跟踪驱动的模拟，我们证明了 Horus 在 GPU 资源利用率方面优于其他 DL 资源管理器，高达 61.5%，在完工时间缩短方面优于其他 DL 资源管理器 23.7-30.7%，在作业等待时间方面缩短 68.3%。

更新日期：2021-05-11

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南11