POCLib: A High-Performance Framework for Enabling Near Orthogonal Processing on Compression,IEEE Transactions on Parallel and Distributed Systems

当前位置： X-MOL 学术 › IEEE Trans. Parallel Distrib. Syst. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

POCLib: A High-Performance Framework for Enabling Near Orthogonal Processing on Compression
IEEE Transactions on Parallel and Distributed Systems ( IF 5.6 ) Pub Date : 2021-06-29 , DOI: 10.1109/tpds.2021.3093234
Feng Zhang ₁ , Jidong Zhai ₂ , Xipeng Shen ₃ , Onur Mutlu ₄ , Xiaoyong Du ₁

Affiliation

Parallel technology boosts data processing in recent years, and parallel direct data processing on hierarchically compressed documents exhibits great promise. The high-performance direct data processing technique brings large savings in both time and space by removing the need for decompressing data. However, its benefits have been limited to data traversal operations; for random accesses, direct data processing is several times slower than the state-of-the-art baselines. This article proposes a novel concept, orthogonal processing on compression (orthogonal POC), which means that text analytics can be efficiently supported directly on compressed data, regardless of the type of the data processing – that is, the type of data processing is orthogonal to its capability of conducting POC. Previous proposals, such as TADOC, are not orthogonal POC. This article presents a set of techniques that successfully eliminate the limitation, and for the first time, establishes the near orthogonal POC feasibility of effectively handling both data traversal operations and random data accesses on hierarchically-compressed data. The work focuses on text data and yields a unified high-performance library, called POCLib. In a ten-node distributed Spark cluster on Amazon EC2, POCLib achieves 3.1× speedup over the state-of-the-art on random data accesses to compressed data, while preserving the capability of supporting traversal operations efficiently and providing large (3.9×) space savings.

中文翻译：

POCLib：支持近正交压缩处理的高性能框架

近年来，并行技术促进了数据处理，分层压缩文档上的并行直接数据处理展现出了巨大的前景。高性能直接数据处理技术无需解压缩数据，从而节省了大量时间和空间。然而，它的好处仅限于数据遍历操作；对于随机访问，直接数据处理比最先进的基线慢几倍。本文提出了一个新颖的概念，即压缩上的正交处理（orthogonal POC），这意味着无论数据处理的类型如何，都可以直接在压缩数据上高效地支持文本分析——也就是说，数据处理的类型与其进行POC的能力。以前的提案（例如 TADOC）不是正交 POC。本文提出了一套成功消除该限制的技术，并首次建立了有效处理分层压缩数据上的数据遍历操作和随机数据访问的近正交 POC 可行性。这项工作重点关注文本数据并产生了一个统一的高性能库，称为 POCLib。在 Amazon EC2 上的十节点分布式 Spark 集群中，POCLib 在对压缩数据的随机数据访问方面实现了 3.1 倍的加速，同时保留了高效支持遍历操作并提供大容量 (3.9 倍) 的能力。节省空间。

更新日期：2021-06-29

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11