QBLKe: Host-side flash translation layer management for Open-Channel SSDs

https://doi.org/10.1016/j.sysarc.2021.102233Get rights and content

Abstract

Open-Channel SSD (OCSSD) shows great potential in high-performance storage systems. Existing applications or file systems rely on host-based Flash Translation Layer (FTL) to use OCSSDs. However, the existing solution cannot fully exploit the OCSSD performance under multi-thread workloads. On the read/write critical paths, we find three components (ring buffer, translation map, and DMA memory pool) that use global spinlocks to achieve atomicity. The spinlocks exhaust CPU time and thus hurt system scalability. Besides, as the number of OCSSD parallel units increases, the granularity of garbage collection (GC) of the existing solution also increases. This results in unnecessary page migrations during GC. In this article, we propose QBLKe as an OCSSD host-based FTL. QBLKe adopts three techniques to improve scalability and minimize software overhead: (1) per-CPU ring buffer, (2) lock-free translation map, and (3) per-CPU DMA pool. To further minimize GC page migrations and the impacts on I/O performance, QBLKe implements per-channel GC and a new scheme called score-based rate limiter. Experimental results show that QBLKe improves up to 78.9% write bandwidth compared with the existing solution. It also increases the peak read IOPS by 139.31% and the GC efficiency by 49.11%.

Introduction

With the fast evolution of flash memory technology, more large-scale cloud service providers, personal computers, cellphones, or other smart devices are using flash SSDs to store data. Open-Channel SSD (OCSSD) is a new type of SSD. Different from traditional SSDs, OCSSD allows the host to access the storage via the physical address of NAND flash. In other words, the Flash Translation Layer (FTL) is implemented on the host side. OCSSD exposes physical details of the device to the host and gives the host software almost full control on data physical placement and I/O parallelism.

On one hand, the feasibility of OCSSDs provides system engineers with more design trade-offs in the host storage stack. Many existing works [1], [2], [3], [4], [5], [6], [7], [8], [9] are focus on reducing software stack redundancies and providing predictable latencies. On the other hand, as hardware density and device bandwidth continue to increase, the demand for lower thread competition and better scalability has increased dramatically. For better performance, many software including databases [10], [11], [12], file systems [13], [14], desktop and mobile applications [15], [16], [17] tend to use multiple threads to issue I/Os. Besides, running several applications at the same time or sharing devices among multiple tenants will also bring concurrent I/O requests.

Many existing applications, databases, and file systems use a block device interface to access the storage system. For OCSSDs, this can be achieved by implementing a host-based FTL that exposes a block I/O interface. It is challenging to provide such a host-based FTL with high performance, high scalability, and low software overhead because many components in the FTL (such as the buffer, mapping table, etc.) are shared globally by threads or processes. As the number of user I/O threads increases, without properly designing the software architecture, the system synchronization overhead also increases. We find three components in the Physical Block Device (PBLK) [5], one of the most famous OCSSD host-based FTLs, that do not scale well under high I/O threads workloads.

First, the data buffer. As the flash page size continues to grow [18], to minimize write amplifications, it is beneficial to use a write buffer to coalesce small writes that are less than a flash page size. For example, PBLK uses a modified ring buffer to merge write data. The ring buffer is protected by a spinlock which can become a performance bottleneck under multi-thread workloads. Our experiment shows that under 32 threads write test the ring buffer spinlock can consume more than 84% of the total CPU time, exhausting the CPUs and severely degrading the system performance.

Second, the translation map. Both read and write critical paths include access to the translation map. To avoid race conditions, the whole translation map is protected by a spinlock. However, this may bring unnecessary competitions because different entries in the mapping table do not need to be accessed exclusively.

Third, the DMA memory pool. When preparing an I/O request to the OCSSD device, PBLK needs to allocate a piece of DMA coherent memory to temporarily hold the physical address list and the recovery information. A DMA memory pool which uses a spinlock to achieve allocation atomicity is used. When multiple threads sending I/Os concurrently, the competition for the spinlock will be very fierce.

Besides, PBLK uses a line-based address management. With the increase of flash memory parallel units, the garbage collection (GC) granularity also increases. This will lead to more page migrations during GC.

In this article, we propose an open-source [19] OCSSD host-based FTL called QBLK-Express (QBLKe). QBLKe aims at providing users with a highly scalable block I/O interface. To this end, QBLKe adopts three techniques to improve I/O path scalability and minimize software overhead: (1) per-CPU ring buffer, (2) lock-free translation map, and (3) per-CPU DMA pool. As well as the scalability, QBLKe also optimizes the GC. QBLKe tries to decrease GC page migrations by managing GC in a per-channel manner. Besides, QBLKe adopts a new scheme called score-based rate limiter to minimize the GC impact on I/O performance.

QBLKe is an extended version of a preliminary conference work QBLK [20]. Compared with the conference version, QBLKe has two main improvements as well as some minor optimizations. (1) QBLKe optimizes both read and write paths while QBLK mainly focuses on the write path optimizations. (2) QBLKe further optimizes the GC, and implements a new scheme called score-based rate limiter.

Experimental results show that QBLKe achieves up to 78.9% write bandwidth improvement compared with the PBLK scheme. It also increases the peak read IOPS by 139.31%. Besides, by managing GC in a per-channel manner, QBLKe decreases the page migration cost and increases the GC efficiency by 49.11%.

This article is organized as follows. Section 2 provides background and motivation. Section 3 introduces the design of QBLKe. Section 4 evaluates the effectiveness of QBLKe. Section 5 provides the related works. Section 6 concludes.

Section snippets

Background and motivation

Design of QBLKe

In this section, we introduce our host-based FTL named QBLK-Express (QBLKe). Similar to PBLK, QBLKe is a Linux kernel module that leverages the lightNVM subsystem and exposes a block I/O interface to users. Fig. 5 shows the software architecture comparison between PBLK and QBLKe. QBLKe aims at increasing the host-based FTL scalability. To this end, QBLKe adopts three techniques: (1) per-CPU ring buffer, (2) lock-free translation map, and (3) per-CPU DMA pool.

As well as the scalability, QBLKe

Evaluation

In this section, we evaluate the performance of QBLKe. We implement QBLKe under the Linux kernel 4.16. The source code is available on Github [19].

Related works

To better exploit the performance of NAND flash SSDs, various layers of internal parallelism have been explored [23], [24]. While traditional SSDs embed their FTLs in the device side to work as a general block device, some researchers and vendors go for a different approach and try to manage the FTL functions at the host side.

Linux flash file systems such as JFFS [44] and YAFFS [45] are built upon raw flash devices. They use log-structured techniques to handle the NAND flash’s

Conclusion

With the increase of OCSSD parallelism, the host-side software overhead accounts for an increasing proportion of the overall system overhead. Implementing a high performance, scalability, and low software overhead host-based FTL is challenging because many components in the FTL (such as the buffer, mapping table, etc.) are shared globally by threads or processes. The existing OCSSD host-based FTL lacks scalability in both read and write paths. Specifically, on the write path, multiple threads

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgements

This work is supported by the Nature Science Foundation of China under Grant No. 61821003, No. 61772222, No. U1705261, No. 61832007, the National Science and Technology Major Project, China No. 2017ZX01032-101, and the Fundamental Research Funds for the Central Universities, China No. 2019kfyXMBZ037.

Hongwei Qin received the BE degree in computer science and technology from the Huazhong University of Science and Technology (HUST), China, in 2013. He is currently a Ph.D candidate in HUST. His research interest includes computer system architecture, memory management, file systems, and SSDs. He has published several papers in major conferences including DATE, MSST etc.

References (57)

  • HuangJ. et al.

    Flashblox: Achieving both performance isolation and uniform lifetime for virtualized ssds

  • GonzálezJ. et al.

    Multi-tenant I/O isolation with open-channel SSDs

  • YanS. et al.

    Tiny-tail flash: Near-perfect elimination of garbage collection tail latencies in NAND SSDs

    ACM Trans. Stor. (TOS)

    (2017)
  • MongoDB

    (2009)
  • MariaDB

    (2015)
  • Facebook, RocksDB,...
  • MinC. et al.

    Understanding manycore scalability of file systems

  • ZhangJ. et al.

    ParaFS: A Log-structured file system to exploit the internal parallelism of flash devices

  • JeongD. et al.

    Boosting quasi-asynchronous I/O for better responsiveness in mobile devices

  • HarterT. et al.

    A file is not a file: understanding the I/O behavior of Apple desktop applications

    ACM Trans. Comput. Syst. (TOCS)

    (2012)
  • . jingyu9575, Multithreaded Download Manager,...
  • FengY. et al.

    Multiple subpage writing FTL in MLC by exploiting dual mode operations

    IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst.

    (2019)
  • H. Qin, QBLKe,...
  • QinH. et al.

    QBLK: Towards fully exploiting the parallelism of open-channel SSDs

  • JosephsonW.K. et al.

    DFS: A file system for virtualized flash storage

    ACM Trans. Stor. (TOS)

    (2010)
  • OpenChannelSSD Specifications, The Open-Channel SSD community,...
  • ChenF. et al.

    Essential roles of exploiting internal parallelism of flash memory based solid state drives in high-speed data processing

  • HuY. et al.

    Exploring and exploiting the multilevel parallelism inside SSDs for improved performance and endurance

    IEEE Trans. Comput.

    (2013)
  • Cited by (4)

    • CoDiscard: A revenue model based cross-layer cooperative discarding mechanism for flash memory devices

      2022, Journal of Systems Architecture
      Citation Excerpt :

      For example, the Open-channel SSD [7] changes the architecture by partly giving control authority of the flash memory devices to the host, as shown in Fig. 1(c). Varies mapping strategies [25] and host FTLs [26,27] are proposed to optimize the performance. The host sends the TRIM command to inform the flash memory device that some logical sectors are invalid.

    • EBIO: An Efficient Block I/O Stack for NVMe SSDs With Mixed Workloads

      2023, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems
    • A survey on design and application of open-channel solid-state drives

      2023, Frontiers of Information Technology and Electronic Engineering

    Hongwei Qin received the BE degree in computer science and technology from the Huazhong University of Science and Technology (HUST), China, in 2013. He is currently a Ph.D candidate in HUST. His research interest includes computer system architecture, memory management, file systems, and SSDs. He has published several papers in major conferences including DATE, MSST etc.

    Dan Feng received the BE, ME and Ph.D. degrees in computer science and technology from the Huazhong University of Science and Technology (HUST), China, in 1991, 1994 and 1997, respectively. She is a professor and vice dean of the School of Computer Science and Technology, HUST. Her research interests include computer architecture, massive storage systems and parallel file systems. She has more than 80 publications to her credit in journals and international conferences, including IEEE Transactions on Parallel and Distributed Systems (TPDS), JCST, USENIX ATC, FAST, ICDCS, HPDC, SC, ICS and ICPP. She is a member of the IEEE.

    Wei Tong received the BE, ME and Ph.D. degrees in computer science and technology from the Huazhong University of Science and Technology (HUST), China, in 1999, 2002 and 2011, respectively. She is a lecturer of the School of Computer Science and Technology, HUST. Her research interests include computer architecture, network storage system and solid state storage system. She has more than 10 publications in journals and international conferences including ACM TACO, MSST, NAS, FGCN.

    Yutong Zhao received the BE degree in communication engineering from the Xiamen University (XMU), China, in 2017. She is currently working towards the ME degree in computer architecture from HUST. Her research interest includes computer architecture and flash reliability. She has published several papers in major conferences including MSST, DATE etc.

    Mengye Peng received the B.E. degree in computer science and technology from the Shandong University, China, in 2016. She is currently pursuing the master’s degree in Huazhong University of Science and Technology, Wuhan, China. Her research interest includes computer system architecture, file systems and Open-Channel SSDs. She has published a paper in ICCD.

    Jingning Liu received the BE degree in computer science and technology from the Huazhong University of Science and Technology (HUST), China, in 1982. She is a professor in the HUST and engaged in researching and teaching of computer system architecture. Her research interests include computer storage network system, high-speed interface and channel technology, embedded system and FPGA design. She has over 20 publications in journals and international conferences including ACM TACO, NAS, MSST and ICA3PP.

    View full text