Learning-Driven Interference-Aware Workload Parallelization for Streaming Applications in Heterogeneous Cluster,IEEE Transactions on Parallel and Distributed Systems

当前位置： X-MOL 学术 › IEEE Trans. Parallel Distrib. Syst. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Learning-Driven Interference-Aware Workload Parallelization for Streaming Applications in Heterogeneous Cluster
IEEE Transactions on Parallel and Distributed Systems ( IF 5.3 ) Pub Date : 2021-01-01 , DOI: 10.1109/tpds.2020.3008725
Haitao Zhang , Xin Geng , Huadong Ma

In the past few years, with the rapid development of CPU-GPU heterogeneous computing, the issue of task scheduling in the heterogeneous cluster has attracted a great deal of attention. This problem becomes more challenging with the need for efficient co-execution of tasks on the GPUs. However, the uncertainty of heterogeneous cluster and the interference caused by resource contention among co-executing tasks can lead to the unbalanced use of computing resource and further cause the degradation in performance of computing platform. In this article, we propose a two-stage task scheduling approach for streaming applications based on deep reinforcement learning and neural collaborative filtering, which considers fine-grained task division and task interference on the GPU. Specifically, the Learning-Driven Workload Parallelization (LDWP) method selects an appropriate execution node for the mutually independent tasks. By using the deep Q-network, the cluster-level scheduling model is online learned to perform the current optimal scheduling actions according to the runtime status of cluster environments and characteristics of tasks. The Interference-Aware Workload Parallelization (IAWP) method assigns subtasks with dependencies to the appropriate computing units, taking into account the interference of subtasks on the GPU by using neural collaborative filtering. For making the learning of neural network more efficient, we use pre-training in the two-stage scheduler. Besides, we use transfer learning technology to efficiently rebuild task scheduling model referring to the existing model. We evaluate our learning-driven and interference-aware task scheduling approach on a prototype platform with other widely used methods. The experimental results show that the proposed strategy can averagely improve the throughout for distributed computing system by 26.9 percent and improve the GPU resource utilization by around 14.7 percent.

中文翻译：

异构集群中流式应用程序的学习驱动的干扰感知工作负载并行化

近年来，随着CPU-GPU异构计算的飞速发展，异构集群中的任务调度问题备受关注。由于需要在 GPU 上有效地共同执行任务，这个问题变得更具挑战性。然而，异构集群的不确定性和协同执行任务之间资源竞争带来的干扰，会导致计算资源的使用不均衡，进而导致计算平台性能的下降。在本文中，我们提出了一种基于深度强化学习和神经协同过滤的流式应用程序两阶段任务调度方法，该方法考虑了 GPU 上的细粒度任务划分和任务干扰。具体来说，Learning-Driven Workload Parallelization (LDWP) 方法为相互独立的任务选择合适的执行节点。通过使用深度Q网络，在线学习集群级调度模型，根据集群环境运行时状态和任务特征，执行当前最优调度动作。Interference-Aware Workload Parallelization (IAWP) 方法将具有依赖关系的子任务分配给适当的计算单元，通过使用神经协同过滤来考虑子任务对 GPU 的干扰。为了使神经网络的学习更有效，我们在两阶段调度器中使用预训练。此外，我们使用迁移学习技术参考现有模型高效地重建任务调度模型。我们在原型平台上使用其他广泛使用的方法评估我们的学习驱动和干扰感知任务调度方法。实验结果表明，所提出的策略平均可以将分布式计算系统的吞吐量提高26.9%，GPU资源利用率提高14.7%左右。

更新日期：2021-01-01

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>