当前位置: X-MOL 学术Sustain. Comput. Inform. Syst. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
PowerCoord: Power capping coordination for multi-CPU/GPU servers using reinforcement learning
Sustainable Computing: Informatics and Systems ( IF 3.8 ) Pub Date : 2020-07-03 , DOI: 10.1016/j.suscom.2020.100412
Reza Azimi , Chao Jing , Sherief Reda

Modern supercomputers and cloud providers rely on server nodes that are equipped with multiple CPU sockets and general purpose GPUs (GPGPUs) to handle the high demand for intensive computations. These nodes consume much higher power than commodity servers, and integrating them with power capping systems used in modern clusters presents new challenges. In this paper, we propose a new power capping controller, PowerCoord, that is specifically designed for servers with multiple CPU and GPU sockets that are running multiple jobs at a time. PowerCoord coordinates among the various power domains (e.g., CPU sockets and GPUs) inside a node server to meet target power caps, while seeking to maximize throughput. Our approach also takes into consideration job deadlines and priorities. Because performance modeling for co-located jobs is error-prone, PowerCoord uses a learning method. PowerCoord has a number of heuristic policies to allocate power among the various CPUs and GPUs, and it uses reinforcement learning for policy selection during runtime. Based on the observed state of the system, PowerCoord shifts the distribution of selected policies. We implement our power cap controller on a real multi-CPU/GPU server with low overhead, and we demonstrate that it is able to meet target power caps while maximizing the throughput, and balancing other demands such as priorities and deadlines. Our results show PowerCoord improves the server throughput on average by 18% compared with the case when power is not coordinated among CPU/GPU domains. Also, PowerCoord improves the server throughput on average by 11% compared with prior work that uses a heuristic approach to coordinate the power among domains.



中文翻译:

PowerCoord:使用强化学习为多CPU / GPU服务器设置功率上限协调

现代超级计算机和云提供商依赖于配备有多个CPU插槽和通用GPU(GPGPU)的服务器节点来满足对密集计算的高需求。这些节点比商用服务器消耗的功率高得多,并且将它们与现代集群中使用的功率限额系统集成在一起提出了新的挑战。在本文中,我们提出了一种新的功率限额控制器PowerCoord,该控制器专门为具有多个CPU和GPU插槽的服务器同时运行多个作业而设计。PowerCoord在节点服务器内部的各种电源域(例如,CPU插槽和GPU)之间进行协调,以满足目标功率上限,同时力求最大程度地提高吞吐量。我们的方法还考虑了工作期限和优先事项。由于针对同一地点的作业的性能建模容易出错,PowerCoord使用一种学习方法。PowerCoord具有许多启发式策略,可以在各种CPU和GPU之间分配功率,并且它在运行时使用强化学习进行策略选择。根据观察到的系统状态,PowerCoord会更改所选策略的分布。我们在真正的多CPU / GPU服务器上以低开销实现了功率限额控制器,并且证明了该控制器能够满足目标功率限额,同时最大程度地提高吞吐量,并平衡优先级和截止日期等其他需求。我们的结果表明,与不协调CPU / GPU域之间电源的情况相比,PowerCoord将服务器吞吐量平均提高了18%。也,

更新日期:2020-07-25
down
wechat
bug