当前位置: X-MOL 学术J. Big Data › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Computational storage: an efficient and scalable platform for big data and HPC applications
Journal of Big Data ( IF 8.1 ) Pub Date : 2019-11-15 , DOI: 10.1186/s40537-019-0265-5
Mahdi Torabzadehkashi , Siavash Rezaei , Ali HeydariGorji , Hosein Bobarshad , Vladimir Alves , Nader Bagherzadeh

In the era of big data applications, the demand for more sophisticated data centers and high-performance data processing mechanisms is increasing drastically. Data are originally stored in storage systems. To process data, application servers need to fetch them from storage devices, which imposes the cost of moving data to the system. This cost has a direct relation with the distance of processing engines from the data. This is the key motivation for the emergence of distributed processing platforms such as Hadoop, which move process closer to data. Computational storage devices (CSDs) push the “move process to data” paradigm to its ultimate boundaries by deploying embedded processing engines inside storage devices to process data. In this paper, we introduce Catalina, an efficient and flexible computational storage platform, that provides a seamless environment to process data in-place. Catalina is the first CSD equipped with a dedicated application processor running a full-fledged operating system that provides filesystem-level data access for the applications. Thus, a vast spectrum of applications can be ported for running on Catalina CSDs. Due to these unique features, to the best of our knowledge, Catalina CSD is the only in-storage processing platform that can be seamlessly deployed in clusters to run distributed applications such as Hadoop MapReduce and HPC applications in-place without any modifications on the underlying distributed processing framework. For the proof of concept, we build a fully functional Catalina prototype and a CSD-equipped platform using 16 Catalina CSDs to run Intel HiBench Hadoop and HPC benchmarks to investigate the benefits of deploying Catalina CSDs in the distributed processing environments. The experimental results show up to 2.2× improvement in performance and 4.3× reduction in energy consumption, respectively, for running Hadoop MapReduce benchmarks. Additionally, thanks to the Neon SIMD engines, the performance and energy efficiency of DFT algorithms are improved up to 5.4× and 8.9×, respectively.

中文翻译:

计算存储:适用于大数据和HPC应用程序的高效且可扩展的平台

在大数据应用时代,对更复杂的数据中心和高性能数据处理机制的需求急剧增长。数据最初存储在存储系统中。为了处理数据,应用服务器需要从存储设备中获取数据,这增加了将数据移至系统的成本。此成本与处理引擎与数据的距离直接相关。这是出现分布式处理平台(例如Hadoop)的关键动力,该平台使流程更接近数据。计算存储设备(CSD)推动“将过程转移到数据通过在存储设备内部部署嵌入式处理引擎来处理数据,从而达到其极限。在本文中,我们介绍了Catalina,这是一种高效且灵活的计算存储平台,它提供了无缝环境来就地处理数据。Catalina是第一个配备专用应用程序处理器的CSD,该处理器运行一个成熟的操作系统,该操作系统为应用程序提供文件系统级别的数据访问。因此,可以移植广泛的应用程序以在Catalina CSD上运行。由于这些独特的功能,据我们所知,Catalina CSD是唯一可以在群集中无缝部署以在内部运行分布式应用程序(例如Hadoop MapReduce和HPC应用程序)的无缝存储平台,而无需对基础分布式处理框架进行任何修改。为了进行概念验证,我们使用16个Catalina CSD构建完整功能的Catalina原型和配备CSD的平台,以运行Intel HiBench Hadoop和HPC基准测试,以研究在分布式处理环境中部署Catalina CSD的好处。实验结果表明,运行Hadoop MapReduce基准测试的性能分别提高了2.2倍,能耗降低了4.3倍。此外,借助Neon SIMD引擎,DFT算法的性能和能效分别提高了5.4倍和8.9倍。为了进行概念验证,我们使用16个Catalina CSD构建完整功能的Catalina原型和配备CSD的平台,以运行Intel HiBench Hadoop和HPC基准测试,以研究在分布式处理环境中部署Catalina CSD的好处。实验结果表明,运行Hadoop MapReduce基准测试的性能分别提高了2.2倍,能耗降低了4.3倍。此外,借助Neon SIMD引擎,DFT算法的性能和能效分别提高了5.4倍和8.9倍。为了进行概念验证,我们使用16个Catalina CSD构建完整功能的Catalina原型和配备CSD的平台,以运行Intel HiBench Hadoop和HPC基准测试,以研究在分布式处理环境中部署Catalina CSD的好处。实验结果表明,运行Hadoop MapReduce基准测试的性能分别提高了2.2倍,能耗降低了4.3倍。此外,借助Neon SIMD引擎,DFT算法的性能和能效分别提高了5.4倍和8.9倍。实验结果表明,运行Hadoop MapReduce基准测试的性能分别提高了2.2倍,能耗降低了4.3倍。此外,借助Neon SIMD引擎,DFT算法的性能和能效分别提高了5.4倍和8.9倍。实验结果表明,运行Hadoop MapReduce基准测试的性能分别提高了2.2倍,能耗降低了4.3倍。此外,借助Neon SIMD引擎,DFT算法的性能和能效分别提高了5.4倍和8.9倍。
更新日期:2019-11-15
down
wechat
bug