Dart: Divide and Specialize for Fast Response to Congestion in RDMA-Based Datacenter Networks,IEEE/ACM Transactions on Networking

当前位置： X-MOL 学术 › IEEE ACM Trans. Netw. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Dart: Divide and Specialize for Fast Response to Congestion in RDMA-Based Datacenter Networks
IEEE/ACM Transactions on Networking ( IF 3.7 ) Pub Date : 2020-01-14 , DOI: 10.1109/tnet.2019.2961671
Jiachen Xue , Muhammad Usama Chaudhry , Balajee Vamanan , T. N. Vijaykumar , Mithuna Thottethodi

Though Remote Direct Memory Access (RDMA) promises to reduce datacenter network latencies significantly compared to TCP (e.g., 10

$\times$

), end-to-end congestion control in the presence of incasts is a challenge. Targeting the full generality of the congestion problem, previous schemes rely on slow, iterative convergence to the appropriate sending rates (e.g., TIMELY takes 50 RTTs). Several papers have shown that even in oversubscribed datacenter networks most congestion occurs at the receiver. Accordingly, we propose a divide-and-specialize approach, called Dart , which isolates the common case of receiver congestion and further subdivides the remaining in-network congestion into the simpler spatially-localized and the harder spatially-dispersed cases. For receiver congestion, we propose direct apportioning of sending rates (DASR) in which a receiver for

$n$

senders directs each sender to cut its rate by a factor of

$n$

, converging in only one RTT. For the spatially-localized case, Dart provides fast (under one RTT) response by adding novel switch hardware for in-order flow deflection (IOFD) because RDMA disallows packet reordering on which previous load balancing schemes rely. For the uncommon spatially-dispersed case, Dart falls back to DCQCN. Small-scale testbed measurements and at-scale simulations, respectively, show that Dart achieves 60% (2.5

$\times$

) and 79% (4.8

$\times$

) lower

$99^{th}$

-percentile latency, and similar and 58% higher throughput than InfiniBand, and TIMELY and DCQCN.

中文翻译：

Dart：划分并专门研究基于RDMA的数据中心网络中的拥塞快速响应

尽管与TCP相比，远程直接内存访问（RDMA）有望大大减少数据中心网络延迟（例如10

$ \ times $

），在存在播报的情况下进行端到端的拥塞控制是一个挑战。针对拥塞问题的全部普遍性，先前的方案依赖于缓慢的迭代收敛到适当的发送速率（例如，TIMELY需要50个RTT）。几篇论文表明，即使在超额订购的数据中心网络中，大多数拥塞也发生在接收方。因此，我们提出了一种分而治之的方法，称为镖，它隔离了接收器拥塞的常见情况，并将剩余的网络内拥塞细分为空间上更简单的情况和空间上较难分散的情况。对于接收器拥塞，我们建议直接分配发送率（DASR）其中的接收器

$ n $

发件人指示每个发件人将其费率降低

$ n $

，仅汇入一个RTT。对于空间定位的情况，Dart通过添加新颖的开关硬件来提供快速（在一个RTT下）响应。有序流向偏转（IOFD）因为RDMA不允许以前的负载平衡方案所依赖的数据包重新排序。对于不常见的空间分散情况，Dart会退回DCQCN。小型试验台测量和大规模仿真分别表明Dart达到了60％（2.5

$ \ times $

）和79％（4.8

$ \ times $

）降低

$ 99 ^ {th} $

-延迟百分率，吞吐量比InfiniBand，TIMELY和DCQCN相似且高58％。

更新日期：2020-02-18

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文