当前位置: X-MOL 学术Cluster Comput. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
DiG: enabling out-of-band scalable high-resolution monitoring for data-center analytics, automation and control (extended)
Cluster Computing ( IF 4.4 ) Pub Date : 2021-01-07 , DOI: 10.1007/s10586-020-03219-7
Antonio Libri , Andrea Bartolini , Luca Benini

Data centers are increasing in size and complexity, and we need scalable approaches to support their automated analysis and control. Performance counters and power consumption are their key “vital signs”. State-of-the-Art (SoA) monitoring systems provide built-in tools to collect performance measurements, and custom solutions to get insight on their power consumption. However, with the increase in measurement resolution (in time and space) and the ensuing huge amount of measurement data to handle, new challenges arise, such as bottlenecks on the network bandwidth, storage and software overhead on the monitoring units. To face these challenges we propose a novel monitoring platform for data centers, which enables real-time high-resolution profiling (i.e., all available performance counters and the entire signal bandwidth of the power consumption at the plug—sampling up to 20 \(\upmu {\hbox {s}}\)—with an error below 1%) and analytics, both at the edge (node-level analysis) and on a centralized unit (cluster-level analysis). The monitoring infrastructure is completely out-of-band, scalable, technology agnostic and low cost, and it is already installed in a SoA high-performance compute cluster (i.e., D.A.V.I.D.E. —18th in Green500 November 2017).



中文翻译:

DiG:为数据中心分析,自动化和控制实现带外可扩展的高分辨率监控(扩展)

数据中心的规模和复杂性不断增加,我们需要可扩展的方法来支持其自动分析和控制。性能计数器和功耗是其关键的“重要标志”。最先进的(SoA)监视系统提供了内置工具来收集性能测量结果,并提供自定义解决方案以了解其功耗。但是,随着测量分辨率(时间和空间)的增加以及随之而来的大量测量数据的处理,出现了新的挑战,例如网络带宽瓶颈,监视单元的存储和软件开销。为应对这些挑战,我们为数据中心提出了一种新颖的监控平台,该平台可实现实时高分辨率分析(即,\(\ upmu {\ hbox {s}} \) -误差在1%以下)和分析,无论是在边缘(节点级分析)还是在集中式单元(集群级分析)上。监控基础架构完全带外,可扩展,技术不可知且成本低廉,并且已安装在SoA高性能计算集群中(即DAVIDE — Green500中的第18位,2017年11月)。

更新日期:2021-01-08
down
wechat
bug