Range partitioning within sublinear time: Algorithms and lower bounds,Theoretical Computer Science

当前位置： X-MOL 学术 › Theor. Comput. Sci. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Range partitioning within sublinear time: Algorithms and lower bounds
Theoretical Computer Science ( IF 0.9 ) Pub Date : 2021-01-13 , DOI: 10.1016/j.tcs.2021.01.017
Baoling Ning , Jianzhong Li , Shouxu Jiang

Range partitioning is a typical and mostly used data partitioning method and has became a core operation in most of big data computing platforms. Given an input L of N data items admitting a total order, the goal of range partitioning is to divide the whole input into k ranges containing the same number of data items. There is a trivial lower bound $Ω (N)$ for the exact partitioning algorithms, since they need to at least make a full scan of the whole data. In the context of big data computing, even algorithms with $O (N)$ time are not always thought to be efficient enough, the ultimate goal of designing algorithms on big data is usually to solve problems within sublinear time. Therefore, it is well motivated and important to study sublinear algorithms for the range partitioning problem.

The paper aims to answer three questions. For the internal memory (RAM) model, since sophisticated sampling based $(ϵ, δ)$ -approximation partitioning algorithm with $O (\frac{k \log (N / δ)}{ϵ^{2}})$ time cost has been proposed, the first question is what a lower bound we can obtain for sublinear partitioning algorithms. For the external memory (I/O) model, as far as we know, no previous works give external partitioning algorithms with performance guarantee within sublinear time, therefore the two questions are what the upper bound and the lower bound we can achieve for sublinear external partitioning algorithms. To answer the above questions, based on the RAM and I/O model, the paper studies the lower and upper bounds for the range partitioning problem. For the RAM model, a lower bound $Ω (\frac{k (1 - δ)}{ϵ^{2}})$ for the cost of sampling based partitioning algorithms is proved. For the I/O model, two lower bounds of the sampling cost required by sublinear external range partitioning algorithms are proved, which indicate that at least a full scan of the whole input is needed in the worst case and a general sublinear external partitioning algorithm does not exist. Motivated by the hard instances utilized in the proof of lower bounds, a model for describing the input distributions of the range partitioning problem in practical applications is proposed. Finally, for the special cases described by the model, a sublinear external partitioning algorithm with $O (\frac{k \log (N / δ)}{w B ϵ^{2}})$ I/O cost is designed.

中文翻译：

亚线性时间内的范围划分：算法和下界

范围分区是一种典型且最常用的数据分区方法，并且已成为大多数大数据计算平台中的核心操作。给定一个输入大号的Ñ数据项承认全序，范围分区的目标是将整个输入分成ķ含有相同数目的数据项的范围。下限很小 $Ω （ ñ ）$ 对于精确的分区算法，因为它们至少需要对整个数据进行全面扫描。在大数据计算的背景下，甚至算法 $Ø （ ñ ）$ 时间并不总是被认为足够有效，设计大数据算法的最终目标通常是在亚线性时间内解决问题。因此，研究用于范围划分问题的亚线性算法是很积极的，而且很重要。

本文旨在回答三个问题。对于内部存储器（RAM）模型，因为基于复杂的采样 $（ ϵ ， δ ）$ -近似分割算法 $Ø （ \frac{ķ 日志（ ñ / δ ）}{ϵ^{2}} ）$ 已经提出了时间成本，第一个问题是对于亚线性划分算法，我们可以获得什么下限。就外部存储器（I / O）模型而言，据我们所知，以前的工作都没有在亚线性时间内提供具有性能保证的外部分区算法，因此，这两个问题是我们可以为亚线性外部实现的上限和下限是什么？分区算法。为了回答上述问题，基于RAM和I / O模型，本文研究了范围划分问题的上下限。对于RAM模型，下限 $Ω （ \frac{ķ （ 1个 - δ ）}{ϵ^{2}} ）$ 证明了基于采样的划分算法的代价。对于I / O模型，证明了亚线性外部范围划分算法所需的采样成本的两个下限，这表明在最坏的情况下至少需要对整个输入进行全面扫描，而常规的亚线性外部划分算法可以不存在。基于下界证明中的困难实例，提出了一个用于描述实际应用中范围划分问题的输入分布的模型。最后，对于模型描述的特殊情况，采用 $Ø （ \frac{ķ 日志（ ñ / δ ）}{w 乙 ϵ^{2}} ）$ 设计I / O成本。

更新日期：2021-01-22

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11