Data prefetch for fast NDN software routers based on hash table-based forwarding tables

doi:10.1016/j.comnet.2020.107188

Computer Networks

Volume 173, 22 May 2020, 107188

https://doi.org/10.1016/j.comnet.2020.107188 Get rights and content

Abstract

The goal of the paper is to present the ideal form of an NDN forwarding engine on a commercial off-the-shelf (COTS) computer. In this paper, we design a reference forwarding engine by selecting well-established high-speed techniques and then analyze a state-of-the-art prototype implementation to identify its performance bottleneck. The microarchitectural analysis at the level of CPU pipelines and instructions reveals that dynamic random access memory (DRAM) access latency is one of bottlenecks for high-speed forwarding engines. Based on the analysis result, we design two prefetch-friendly packet processing techniques to hide DRAM access latency. The prototype employing the techniques achieves a forwarding rate exceeding 40 million packets per second on a COTS computer.

Introduction

A software router, which is built on a hardware platform based on a commercial off-the-shelf (COTS) computer, becomes feasible because of recent advances in multi-core CPUs and fast networking technologies for COTS computers. For notation simplicity, hereinafter, a COTS computer is simply referred to as a computer. Fast IP packet forwarding has been enabled by storing data structures for IP forwarding on a fast Static Random Access Memory (SRAM) device [1], [2]. Compact data structures have become a research issue due to the need keep up with the increasing number of IP prefixes in the Internet. Most of the studies have addressed compact trie-based data structures like multi-bit trie data structures [3] by replacing consecutive elements in a trie with a single element. Such efforts enable it to store the increasing number of IP prefixes, e.g., 7 × 10⁵ prefixes [4] on the latest SRAM device.

Recently, studies on high-speed algorithms and compact data structures are being revisited due to the emergence of a new Internet architecture called Named Data Networking (NDN) [5], wherein rich data including large video data and small sensor data are delivered by a single architecture. Fast NDN packet forwarding is not trivial compared with IP packet forwarding in terms of memory space because name-based forwarding needs a larger forwarding table than IP one in order to store about 2.1 × 10⁸ [6] name prefixes and per-packet caching requires additional memory spaces to store packets. Therefore, NDN data structures need to be stored on a slow Dynamic Random Access Memory (DRAM) device rather than a SRAM device even if compact trie-based data structures are used to store the forwarding table [7]. This implies that hiding the latency to access NDN data structures on the DRAM device is the key to fast NDN packet forwarding, rather than compacting such data structures.

According to this implication, in our previous paper [8], we first identified DRAM access latency of NDN data structures as a true bottleneck on a current state-of-the-art NDN software implementation [9] through an analysis conducted at the level of the CPU instruction pipeline. We proposed a prefetch algorithm of NDN data structures to hide such latency. The key idea is to handle multiple consecutive packets in a batch so that data prefetch of such data structures from a DRAM device overlaps with the computation of other packets. We developed the prefetch algorithm for two consecutive packets and experimentally demonstrated that the prefetch algorithm successfully hides most DRAM access latency so that the NDN packet forwarding rate linearly increases as the number of CPU cores increases.

In this paper, we extend our previous studies [8], [10] in terms of algorithm design and performance analysis to evaluate the prefetch algorithm clearly and in depth. The main contributions of this paper can be summarized as follows:

•
The design rationale of the prefetch algorithm is strengthened by carefully choosing an appropriate data structure for hiding DRAM access latency. We evaluate three representative data structures, a hash table [9], a trie [7] and a bloom filter [11], for name-based forwarding, from the perspective of ease of hiding DRAM access latency. The metric of evaluating the easiness is defined as the number of dependent DRAM accesses, which should be sequentially performed. The quantitative comparison of the three data structures reveals that a hash table incurs the smallest number of dependent DRAM accesses, and thus a hash table is the best data structure among the three for hiding DRAM access latency.
•
We design a prefetch algorithm for hash table-based forwarding tables by handling two consecutive packets in a batch. This algorithm hides DRAM access latency for accessing a hash entry, which is difficult to be hidden by the prefetch instruction alone when packets are handled packet-by-packet.
•
We design a sophisticated prefetch algorithm for Longest Prefix Matching (LPM) in the FIB, which is potentially time-consuming. The algorithm avoids computations of calculating memory addresses of FIB entries whose prefixes are shorter than the matching prefix.
•
We evaluate representative cache eviction and admission algorithms from the perspectives of both hiding DRAM access latency and achieving high cache hit rates. The evaluation reveals that TinyLFU [12] with FIFO eviction is an appropriate cache algorithm because this combination simultaneously achieves the small number of dependent DRAM accesses and high cache hit rates.
•
We experimentally show that processing two consecutive packets in a batch is sufficient for hiding DRAM access latency on modern computers. We also experimentally show that increasing the number of consecutive packets hides DRAM access latency even in the case where DRAM access latency becomes large.

The rest of this paper is organized as follows. First, we identify a true bottleneck for high speed forwarding based on the instruction level analysis in Section 3, after explaining a state-of-the-art software NDN implementation used for the analysis in Section 2. In Sections 4 and 5, we choose data structures and algorithms for name-based packet forwarding and caching among the existing algorithms from the view-point of the ease of hiding DRAM access latency. Then we design a prefetch algorithm and evaluate it by implementing a proof-of-concept prototype in Section 6. We briefly introduce related work in Section 7, and finally conclude this paper in Section 8.

Section snippets

Hardware and software platforms

After describing a hardware platform, this section summarizes software design practices to exploit the parallel computation capabilities of CPU cores, which are essential to high-speed NDN packet forwarding.

Microarchitectural bottleneck analysis

To identify a true bottleneck of NDN packet forwarding on a computer, we conduct a microarchitectural analysis, which analyzes how the individual hardware components of the CPU spend time in the processing of NDN packets at the level of instructions and instruction pipelines.

Overview

In this section, we choose a FIB data structure appropriate for hiding DRAM access latency between three representative FIB data structures, i.e., a hash table-based FIB [7], a bloom filter-based FIB [11] and a trie-based one [28] by comparing the average numbers of dependent DRAM accesses for longest name prefix matching for a queried name. The word “dependent” means that the address of a “dependent” data piece is not determined until a data piece which contains the pointer to the dependent

Overview

In this section, we choose a cache algorithm appropriate for hiding DRAM access latency from representative algorithms, cache eviction algorithms and cache admission algorithms. Cache eviction determines a victim, i.e., a Data packet evicted from a CS, whereas cache admission decides whether a new Data packet is inserted into the CS or not. Usually, a cache admission algorithm is used with a simple cache eviction algorithm based on FIFO. An important difference from the choice of FIB data

Prefetch-friendly packet processing

In this section, we first identify data fetches causing instruction pipeline stalls which cannot be hidden by a conventional packet processing flow. Then, we design a prefetch-friendly packet processing flow to circumvent instruction pipeline stalls caused by the identified data fetches. Finally, we experimentally evaluate the performance gains obtained by using prefetch-friendly packet processing.

Related work

Prototypes of software NDN routers have been developed. Kirchner et al. [17] have implemented their software NDN router, named Augustus, in two different ways: a standalone monolithic forwarding engine based on the DPDK framework [15] and a modular one based on the Click framework. Though the forwarding speed of Augustus is very high, it does not approach the potential forwarding speed realized by computers. Hence, analyzing the bottlenecks of software NDN routers remains an issue.

Fast packet

Conclusion

In this paper, we identified the ideal form of a software NDN router on computers via the following steps: 1) we conducted a detailed study of the existing techniques for high-speed NDN packet forwarding and integrated them into a design rationale toward the realization of an ideal software NDN router. 2) We conducted microarchitectural and comprehensive bottleneck analyses on the software NDN router and revealed that to hide the DRAM access latency is vital to the realization of an ideal

Declaration of Competing Interest

The authors declare the following financial interests/personal relationships which may be considered as potential competing interests:

Junji Takemasa is currently an employee of KDDI Research, Inc. The submitted work has been done when he has been a student at Graduate school of Information Science and Technology, Osaka University. Junji Takemasa has received the following research funding within five years.

-Grant-in-Aid for JSPS Fellows, No. 17J07276

*Period: 2017-2019

*Collaborators: None

Yuki

CRediT authorship contribution statement

Junji Takemasa: Methodology, Software, Investigation, Writing - original draft. Yuki Koizumi: Conceptualization, Writing - original draft. Toru Hasegawa: Methodology, Writing - review & editing, Supervision.

Acknowledgement

This work has been supported by JSPS KAKENHI Grant Number 17H01733.

Junji Takemasa received his Bachelor and Master of Information Science degrees from Osaka University, Japan, in 2014 and 2016, respectively. He is pursuing the Ph.D. degree at the Graduate School of Information Science and Technology, Osaka University. His research interests include Information Centric Networking, high-speed network system and green networking. He is a member of IEEE, IEICE and IPSJ.

References (45)

L. Linguaglossa et al.
High-speed data plane and network functions virtualization by vectorizing packet processing
Comp. Netw.
(2019)
R.B. Mansilha et al.
Exploiting parallelism in hierarchical content stores for high-speed icn routers
Comp. Netw.
(2017)
S. Li et al.
Achieving one billion key-value requests per second on a single server
IEEE Micro
(2016)
C. Partridge et al.
A 50-Gb/s IP router
IEEE/ACM Trans. Network.
(1998)
A. Asthana et al.
Towards a gigabit IP router
J. High Speed Netw.
(1992)
M. Degermark et al.
Small forwarding tables for fast routing lookups
Proceedings of ACM SIGCOMM
(1997)
CIDR report,...
L. Zhang et al.
Named data networking
ACM SIGCOMM Comp. Commun. Rev.
(2014)
December 2017 web server survey,...
T. Song et al.
Scalable name-based packet forwarding: from millions to billions
Proceedings of ACM ICN
(2015)

J. Takemasa et al.

Toward an ideal NDN router on a commercial off-the-shelf computer

Proceedings of ACM ICN

(2017)

W. So et al.

Named data networking on a router: fast and DoS-resistant forwarding with hash tables

Proceedings of ACM/IEEE ANCS

(2013)

K. Taniguchi et al.

Poster: a method for designing high-speed software ndn routers

Proceedings of ACM ICN

(2016)

D. Perino et al.

Caesar: a content router for high-speed forwarding on content names

Proceedings of ACM ANCS

(2014)

G. Einziger et al.

TinyLFU: a highly efficient cache admission policy

ACM Trans. Storage

(2017)

Intel Corporation, IntelⓇ XeonⓇ processor E5 v4 family, (Accessed Sep. 5, 2019)...

Intel Corporation, IntelⓇ data direct I/O technology (IntelⓇ DDIO): A primer, (Accessed Sep. 5, 2019)....

Intel Corporation, Data plane development kit (DPDK), 2017....

Intel Corporation, IntelⓇ 64 and IA-32 architectures optimization reference manual, (Accessed Sep. 5, 2019)...

D. Kirchner et al.

Augustus: a CCN router for programmable networks

Proceedings of ACM ICN

(2016)

The Fast Data Project (FD.io), Vector packet processing, (Accessed Sep. 5, 2019)...

L. Saino et al.

Understanding sharded caching systems

Proceedings of IEEE INFOCOM

(2016)

Cited by (8)

LIGHT: A Compatible, high-performance and scalable user-level network stack
2023, Computer Networks
As the number of CPU cores and the speed of Ethernet NICs keep increasing on server machines, the network stack in the kernel has become the bottleneck for applications demanding very high throughput and ultra-low latency. Recently there is a trend towards moving the network stack out of the kernel. However, most kernel-bypass network stacks discard POSIX APIs that legacy applications have been built on, and the intricate work of transplanting applications brings the barrier to real-world deployment of kernel-bypass stacks. In this work, we propose Light, a novel user-level network stack, which not only gains highly scalable performance on multi-core server, but also achieves compatibility with legacy applications. For compatibility, Light realizes efficient blocking APIs in the user space, intercepts network-related APIs in a non-intrusive manner, and uses the FD space separation technique for proper API redirection. For high performance and scalability, Light adopts lock-free shared queue based inter-process communication and full connection affinity to reduce overheads of system call, cache miss, etc. Experiments demonstrate that many types of legacy applications could run on Light without modifying their source code. Compared with the latest kernel stack, Nginx on Light achieves up to $2.86 \times$ throughput and 78.2 % lower tail latency (99.9th percentile) with 14 CPU cores.
Towards a Scalable Named Data Border Gateway Protocol
2022, International Conference on Electrical, Computer, Communications and Mechatronics Engineering, ICECCME 2022
Terabytes and Terabits/s Packet Caching in ICN Routers using Programmable Switches
2022, Proceedings of the 2022 IEEE Conference on Cloud Networking 2022, CloudNet 2022
Dynamically Allocated Bloom Filter-Based PIT Architectures
2022, IEEE Access
A cache replacement strategy based on content features in named data networking
2021, ACM International Conference Proceeding Series
Vision: Toward 10 tbps ndn forwarding with billion prefixes by programmable switches
2021, ICN 2021 - Proceedings of the 2021 8th ACM Conference on Information-Centric Networking

View all citing articles on Scopus

Yuki Koizumi is an associate professor of Graduate School of Information Science and Technology, Osaka University, Japan. He received his Master of Information Science and Ph.D. of Information Science degrees from Osaka University, Japan, in 2006 and 2009, respectively. His research interests include Information Centric Networking and mobile networking. He is a member of IEEE, ACM, and IEICE.

Toru Hasegawa is a professor of Graduate school of Information and Science, Osaka University. He received the B.E., the M.E. and Dr. Informatics degrees in information engineering from Kyoto University, Japan, in 1982, 1984 and 2000, respectively. After receiving the master degree, he worked as a research engineer at KDDI R&D labs. (former KDD R&D labs.) for 29 years and moved to Osaka University. His current interests are future Internet, Information Centric Networking, mobile computing and so on. He has published over 100 papers in peer-reviewed journals and international conference proceedings including MobiCom, ICNP, IEEE/ACM Transactions on Networking, Computer Communications. He has served on the program or organization committees of several networking conferences such as ICNP, P2P, ICN, CloudNet, ICC, Globecom etc, and as TPC co-chair of Testcom/Fates 2008, ICNP 2010, P2P 2011 and Global Internet Symposium 2014. He received the Meritorious Award on Radi o of ARIB in2003, the best tutorial paper award in 2014 from IEICE and the best paper award in 2015 from IEICE. He is a fellow of IPSJ and IEICE.

View full text

Data prefetch for fast NDN software routers based on hash table-based forwarding tables

Abstract

Introduction

Section snippets

Hardware and software platforms

Microarchitectural bottleneck analysis

Overview

Overview

Prefetch-friendly packet processing

Related work

Conclusion

Declaration of Competing Interest

CRediT authorship contribution statement

Acknowledgement

Comp. Netw.

Comp. Netw.

IEEE Micro

A 50-Gb/s IP router

IEEE/ACM Trans. Network.

Towards a gigabit IP router

J. High Speed Netw.

Small forwarding tables for fast routing lookups

Proceedings of ACM SIGCOMM

Named data networking

ACM SIGCOMM Comp. Commun. Rev.

Scalable name-based packet forwarding: from millions to billions

Proceedings of ACM ICN

Toward an ideal NDN router on a commercial off-the-shelf computer

Proceedings of ACM ICN

Named data networking on a router: fast and DoS-resistant forwarding with hash tables

Proceedings of ACM/IEEE ANCS

Poster: a method for designing high-speed software ndn routers

Proceedings of ACM ICN

Caesar: a content router for high-speed forwarding on content names

Proceedings of ACM ANCS

TinyLFU: a highly efficient cache admission policy

ACM Trans. Storage

Augustus: a CCN router for programmable networks

Proceedings of ACM ICN

Understanding sharded caching systems

Proceedings of IEEE INFOCOM