Deep and reinforcement learning for automated task scheduling in large-scale cloud computing systems,Concurrency and Computation: Practice and Experience

当前位置： X-MOL 学术 › Concurr. Comput. Pract. Exp. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Deep and reinforcement learning for automated task scheduling in large-scale cloud computing systems
Concurrency and Computation: Practice and Experience ( IF 2 ) Pub Date : 2020-07-27 , DOI: 10.1002/cpe.5919
Gaith Rjoub ₁ , Jamal Bentahar ₁ , Omar Abdel Wahab ₂ , Ahmed Saleh Bataineh ₁

Affiliation

Cloud computing is undeniably becoming the main computing and storage platform for today's major workloads. From Internet of things and Industry 4.0 workloads to big data analytics and decision-making jobs, cloud systems daily receive a massive number of tasks that need to be simultaneously and efficiently mapped onto the cloud resources. Therefore, deriving an appropriate task scheduling mechanism that can both minimize tasks' execution delay and cloud resources utilization is of prime importance. Recently, the concept of cloud automation has emerged to reduce the manual intervention and improve the resource management in large-scale cloud computing workloads. In this article, we capitalize on this concept and propose four deep and reinforcement learning-based scheduling approaches to automate the process of scheduling large-scale workloads onto cloud computing resources, while reducing both the resource consumption and task waiting time. These approaches are: reinforcement learning (RL), deep Q networks, recurrent neural network long short-term memory (RNN-LSTM), and deep reinforcement learning combined with LSTM (DRL-LSTM). Experiments conducted using real-world datasets from Google Cloud Platform revealed that DRL-LSTM outperforms the other three approaches. The experiments also showed that DRL-LSTM minimizes the CPU usage cost up to 67% compared with the shortest job first (SJF), and up to 35% compared with both the round robin (RR) and improved particle swarm optimization (PSO) approaches. Moreover, our DRL-LSTM solution decreases the RAM memory usage cost up to 72% compared with the SJF, up to 65% compared with the RR, and up to 31.25% compared with the improved PSO.

中文翻译：

大规模云计算系统中自动化任务调度的深度学习和强化学习

不可否认，云计算正在成为当今主要工作负载的主要计算和存储平台。从物联网和工业 4.0 工作负载到大数据分析和决策工作，云系统每天都会收到大量需要同时高效映射到云资源的任务。因此，推导出一个合适的任务调度机制，既可以最大限度地减少任务的执行延迟，又可以最大限度地减少云资源利用率，这一点至关重要。最近，出现了云自动化的概念，以减少人工干预并改善大规模云计算工作负载中的资源管理。在本文中，我们利用这一概念并提出了四种基于深度和强化学习的调度方法，以自动化将大规模工作负载调度到云计算资源的过程，同时减少资源消耗和任务等待时间。这些方法是：强化学习 (RL)、深度 Q 网络、循环神经网络长短期记忆 (RNN-LSTM) 和结合 LSTM 的深度强化学习 (DRL-LSTM)。使用来自 Google Cloud Platform 的真实世界数据集进行的实验表明，DRL-LSTM 优于其他三种方法。实验还表明，DRL-LSTM 将 CPU 使用成本降至最低，最高可达 67 循环神经网络长短期记忆（RNN-LSTM），以及结合LSTM的深度强化学习（DRL-LSTM）。使用来自 Google Cloud Platform 的真实世界数据集进行的实验表明，DRL-LSTM 优于其他三种方法。实验还表明，DRL-LSTM 将 CPU 使用成本降至最低，最高可达 67 循环神经网络长短期记忆（RNN-LSTM），以及结合LSTM的深度强化学习（DRL-LSTM）。使用来自 Google Cloud Platform 的真实世界数据集进行的实验表明，DRL-LSTM 优于其他三种方法。实验还表明，DRL-LSTM 将 CPU 使用成本降至最低，最高可达 67%与最短作业优先 (SJF)相比，与循环 (RR) 和改进的粒子群优化 (PSO) 方法相比高达 35 %。此外，与 SJF 相比，我们的 DRL-LSTM 解决方案将 RAM 内存使用成本降低了 72 %，与 RR 相比降低了 65 %，与改进的 PSO 相比降低了 31.25 %。

更新日期：2020-07-27

点击分享查看原文

点击收藏

阅读更多本刊最新论文

全部期刊列表>>