当前位置: X-MOL 学术ACM Trans. Archit. Code Optim. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Network Interface Architecture for Remote Indirect Memory Access (RIMA) in Datacenters
ACM Transactions on Architecture and Code Optimization ( IF 1.6 ) Pub Date : 2020-05-30 , DOI: 10.1145/3374215
Jiachen Xue 1 , T. N. Vijaykumar 2 , Mithuna Thottethodi 2
Affiliation  

Remote Direct Memory Access (RDMA) fabrics such as InfiniBand and Converged Ethernet report latency shorter by a factor of 50 than TCP. As such, RDMA is a potential replacement for TCP in datacenters (DCs) running low-latency applications, such as Web search and memcached. InfiniBand’s Shared Receive Queues (SRQs), which use two-sided send/recv verbs (i.e., channel semantics ), reduce the amount of pre-allocated, pinned memory (despite optimizations such as InfiniBand’s on-demand paging (ODP)) for message buffers. However, SRQs are limited fundamentally to a single message size per queue, which incurs either memory wastage or significant programmer burden for typical DC traffic of an arbitrary number (level of burstiness) of messages of arbitrary size. We propose remote indirect memory access (RIMA) , which avoids these pitfalls by providing (1) network interface card (NIC) microarchitecture support for novel queue semantics and (2) a new “verb” called append . To append a sender’s message to a shared queue, the receiver NIC atomically increments the queue’s tail pointer by the incoming message’s size and places the message in the newly created space. As in traditional RDMA, the NIC is responsible for pointer lookup, address translation, and enforcing virtual memory protections. This indirection of specifying a queue (and not its tail pointer, which remains hidden from senders) handles the typical DC traffic of an arbitrary sender sending an arbitrary number of messages of arbitrary size. Because RIMA’s simple hardware adds only 1--2 ns to the multi-\mu s message latency, RIMA achieves the same message latency and throughput as InfiniBand SRQ with unlimited buffering. Running memcached traffic on a 30-node InfiniBand cluster, we show that at similar, low programmer effort, RIMA achieves significantly smaller memory footprint than SRQ. However, while SRQ can be crafted to minimize memory footprint by expending significant programming effort, RIMA provides those benefits with little programmer effort. For memcached traffic, a high-performance key-value cache ( FastKV ) using RIMA achieves either 3× lower 96 th-percentile latency or significantly better throughput or memory footprint than FastKV using RDMA.

中文翻译:

数据中心远程间接内存访问 (RIMA) 的网络接口架构

InfiniBand 和融合以太网等远程直接内存访问 (RDMA) 结构报告的延迟比 TCP 短 50 倍。因此,RDMA 是运行低延迟应用程序(例如 Web 搜索和 memcached)的数据中心 (DC) 中 TCP 的潜在替代品。InfiniBand 的共享接收队列 (SRQ),它使用两侧的发送/接收动词(即,通道语义),减少消息缓冲区的预分配、固定内存量(尽管进行了优化,例如 InfiniBand 的按需分页 (ODP))。但是,SRQ 从根本上限制为每个队列的单个消息大小,这会导致内存浪费或对于任意大小的消息的任意数量(突发级别)的典型 DC 流量的显着程序员负担。我们建议远程间接内存访问 (RIMA),它通过提供 (1) 网络接口卡 (NIC) 微架构支持来避免这些陷阱队列语义(2) 一个新的“动词”,叫做附加. 要将发送者的消息附加到共享队列,接收者 NIC 会自动将队列的尾部指针增加传入消息的大小,并将消息放置在新创建的空间中。与传统 RDMA 一样,NIC 负责指针查找、地址转换和执行虚拟内存保护。这间接指定一个队列(而不是它的尾指针,它仍然对发件人隐藏)处理发送任意数量的任意大小的任意消息的任意发件人的典型 DC 流量。因为 RIMA 的简单硬件只为 multi-\mu 增加了 1--2 nss消息延迟,RIMA 实现了与无限缓冲的 InfiniBand SRQ 相同的消息延迟和吞吐量。在 30 节点 InfiniBand 集群上运行 memcached 流量,我们表明,在类似的低程序员工作量下,RIMA 实现的内存占用显着小于 SRQ。然而,虽然 SRQ 可以通过花费大量的编程工作来最大限度地减少内存占用,但 RIMA 只需很少的编程工作就可以提供这些好处。对于 memcached 流量,高性能键值缓存 (快速KV) 使用 RIMA 可实现比 96 百分位延迟低 3 倍或显着更好的吞吐量或内存占用快速KV使用 RDMA。
更新日期:2020-05-30
down
wechat
bug