Spatially Bursty I/O on Supercomputers: Causes, Impacts and Solutions,IEEE Transactions on Parallel and Distributed Systems

当前位置： X-MOL 学术 › IEEE Trans. Parallel Distrib. Syst. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Spatially Bursty I/O on Supercomputers: Causes, Impacts and Solutions
IEEE Transactions on Parallel and Distributed Systems ( IF 5.6 ) Pub Date : 2020-12-01 , DOI: 10.1109/tpds.2020.3005572
Jie Yu , Wenxiang Yang , Fang Wang , Dezun Dong , Jinghua Feng , Yuqi Li

Understanding the I/O characteristics of supercomputers is crucial for grasping accurate I/O workloads and uncovering potential I/O inefficiency. We collect and analyze I/O traces from two production supercomputers, and find that the I/O traffic peaks in the system not only occur in short periods of time but also originate from a minority of adjacent compute nodes, which we call spatially bursty I/O. Since modern supercomputers widely adopt I/O forwarding architecture, in which an I/O node performs I/O on behalf of a subset of compute nodes in the vicinity, spatially bursty I/O will cause significant load imbalance and underutilization on the I/O nodes. To address such problems, we quantitatively analyze the two causes of spatially bursty I/O, including uneven I/O distribution on job's processes and uneven job nodes distribution on the system. Two different solutions are proposed to mobilize more I/O nodes to participate in job's I/O activity. (1) We change the I/O node mapping, making adjacent compute nodes use different I/O nodes instead of a same one. (2) According to the job's I/O characteristics extracted from history I/O traces, we distribute the compute nodes of data-intensive jobs more sparsely to utilize more I/O nodes. Extensive evaluations of both solutions show that they can further exploit the potential of I/O forwarding layer. We have deployed the proposed I/O node mapping on a production supercomputer for 11 months. Our experience finds that it can effectively promote I/O performance, balance loads, and alleviate I/O interference.

中文翻译：

超级计算机上的空间突发 I/O：原因、影响和解决方案

了解超级计算机的 I/O 特性对于掌握准确的 I/O 工作负载和发现潜在的 I/O 低效率至关重要。我们收集并分析了来自两台生产超级计算机的 I/O 跟踪，发现系统中的 I/O 流量峰值不仅发生在短时间内，而且源自少数相邻的计算节点，我们称之为空间突发 I /O。由于现代超级计算机广泛采用 I/O 转发架构，其中一个 I/O 节点代表附近的计算节点子集执行 I/O，空间突发 I/O 将导致显着的负载不平衡和 I/O 上的未充分利用。 O 节点。为了解决这些问题，我们定量分析了空间突发I/O的两个原因，包括作业进程的I/O分布不均匀和系统上的作业节点分布不均匀。提出了两种不同的解决方案来调动更多的 I/O 节点来参与作业的 I/O 活动。(1) 我们改变了I/O节点映射，使相邻的计算节点使用不同的I/O节点而不是相同的。(2)根据从历史I/O跟踪中提取的作业的I/O特征，我们更稀疏地分布数据密集型作业的计算节点，以利用更多的I/O节点。对这两种解决方案的广泛评估表明，它们可以进一步利用 I/O 转发层的潜力。我们已经在生产超级计算机上部署了建议的 I/O 节点映射 11 个月。我们的经验发现它可以有效提升I/O性能，平衡负载，减轻I/O干扰。(1) 我们改变了I/O节点映射，使相邻的计算节点使用不同的I/O节点而不是相同的。(2)根据从历史I/O跟踪中提取的作业的I/O特征，我们更稀疏地分布数据密集型作业的计算节点，以利用更多的I/O节点。对这两种解决方案的广泛评估表明，它们可以进一步利用 I/O 转发层的潜力。我们已经在生产超级计算机上部署了建议的 I/O 节点映射 11 个月。我们的经验发现它可以有效提升I/O性能，平衡负载，减轻I/O干扰。(1) 我们改变了I/O节点映射，使相邻的计算节点使用不同的I/O节点而不是相同的。(2)根据从历史I/O跟踪中提取的作业的I/O特征，我们更稀疏地分布数据密集型作业的计算节点，以利用更多的I/O节点。对这两种解决方案的广泛评估表明，它们可以进一步利用 I/O 转发层的潜力。我们已经在生产超级计算机上部署了建议的 I/O 节点映射 11 个月。我们的经验发现它可以有效提升I/O性能，平衡负载，减轻I/O干扰。我们更稀疏地分布数据密集型作业的计算节点，以利用更多的 I/O 节点。对这两种解决方案的广泛评估表明，它们可以进一步利用 I/O 转发层的潜力。我们已经在生产超级计算机上部署了建议的 I/O 节点映射 11 个月。我们的经验发现它可以有效提升I/O性能，平衡负载，减轻I/O干扰。我们更稀疏地分布数据密集型作业的计算节点，以利用更多的 I/O 节点。对这两种解决方案的广泛评估表明，它们可以进一步利用 I/O 转发层的潜力。我们已经在生产超级计算机上部署了建议的 I/O 节点映射 11 个月。我们的经验发现它可以有效提升I/O性能，平衡负载，减轻I/O干扰。

更新日期：2020-12-01

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11