Theoretical Computer Science ( IF 0.747 ) Pub Date : 2021-01-13 , DOI: 10.1016/j.tcs.2021.01.017 Baoling Ning; Jianzhong Li; Shouxu Jiang
Range partitioning is a typical and mostly used data partitioning method and has became a core operation in most of big data computing platforms. Given an input L of N data items admitting a total order, the goal of range partitioning is to divide the whole input into k ranges containing the same number of data items. There is a trivial lower bound for the exact partitioning algorithms, since they need to at least make a full scan of the whole data. In the context of big data computing, even algorithms with time are not always thought to be efficient enough, the ultimate goal of designing algorithms on big data is usually to solve problems within sublinear time. Therefore, it is well motivated and important to study sublinear algorithms for the range partitioning problem.
The paper aims to answer three questions. For the internal memory (RAM) model, since sophisticated sampling based -approximation partitioning algorithm with time cost has been proposed, the first question is what a lower bound we can obtain for sublinear partitioning algorithms. For the external memory (I/O) model, as far as we know, no previous works give external partitioning algorithms with performance guarantee within sublinear time, therefore the two questions are what the upper bound and the lower bound we can achieve for sublinear external partitioning algorithms. To answer the above questions, based on the RAM and I/O model, the paper studies the lower and upper bounds for the range partitioning problem. For the RAM model, a lower bound for the cost of sampling based partitioning algorithms is proved. For the I/O model, two lower bounds of the sampling cost required by sublinear external range partitioning algorithms are proved, which indicate that at least a full scan of the whole input is needed in the worst case and a general sublinear external partitioning algorithm does not exist. Motivated by the hard instances utilized in the proof of lower bounds, a model for describing the input distributions of the range partitioning problem in practical applications is proposed. Finally, for the special cases described by the model, a sublinear external partitioning algorithm with I/O cost is designed.
本文旨在回答三个问题。对于内部存储器（RAM）模型，因为基于复杂的采样-近似分割算法 已经提出了时间成本，第一个问题是对于亚线性划分算法，我们可以获得什么下限。就外部存储器（I / O）模型而言，据我们所知，以前的工作都没有在亚线性时间内提供具有性能保证的外部分区算法，因此，这两个问题是我们可以为亚线性外部实现的上限和下限是什么？分区算法。为了回答上述问题，基于RAM和I / O模型，本文研究了范围划分问题的上下限。对于RAM模型，下限证明了基于采样的划分算法的代价。对于I / O模型，证明了亚线性外部范围划分算法所需的采样成本的两个下限，这表明在最坏的情况下至少需要对整个输入进行全面扫描，而常规的亚线性外部划分算法可以不存在。基于下界证明中的困难实例，提出了一个用于描述实际应用中范围划分问题的输入分布的模型。最后，对于模型描述的特殊情况，采用 设计I / O成本。