当前位置: X-MOL 学术IEEE Micro › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
High-Quality Fault Resiliency in Fat-Trees
IEEE Micro ( IF 3.6 ) Pub Date : 2020-01-01 , DOI: 10.1109/mm.2019.2949978
John Gliksberg , Antoine Capra , Alexandre Louvet , Pedro Javier Garcia , Devan Sohier

Coupling regular topologies with optimized routing algorithms is key in pushing the performance of interconnection networks of supercomputers. In this article, we present Dmodc, a fast deterministic routing algorithm for parallel generalized fat trees (PGFTs), which minimizes congestion risk even under massive network degradation caused by equipment failure. Dmodc computes forwarding tables with a closed-form arithmetic formula by relying on a fast preprocessing phase. This allows complete rerouting of networks with tens of thousands of nodes in less than a second. In turn, this greatly helps centralized fabric management react to faults with high-quality routing tables and has no impact on running applications in current and future very large scale high-performance computing clusters.

中文翻译:

胖树中的高质量故障恢复能力

将常规拓扑与优化的路由算法相结合是推动超级计算机互连网络性能的关键。在本文中,我们提出了 Dmodc,这是一种用于并行广义胖树 (PGFT) 的快速确定性路由算法,即使在设备故障导致的大规模网络退化的情况下,它也能最大限度地降低拥塞风险。Dmodc 依靠快速预处理阶段使用封闭式算术公式计算转发表。这允许在不到一秒的时间内完全重新路由具有数万个节点的网络。反过来,这极大地有助于集中式结构管理对具有高质量路由表的故障做出反应,并且对当前和未来超大规模高性能计算集群中运行的应用程序没有影响。
更新日期:2020-01-01
down
wechat
bug