Partition Selection for Large-Scale Data Management Using KNN Join Processing,Mathematical Problems in Engineering

当前位置： X-MOL 学术 › Math. Probl. Eng. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Partition Selection for Large-Scale Data Management Using KNN Join Processing
Mathematical Problems in Engineering ( IF 1.430 ) Pub Date : 2020-09-08 , DOI: 10.1155/2020/7898230
Yue Hu _{1,

2} , Ge Peng ₁ , Zehua Wang ₁ , Yanrong Cui _{1,

2} , Hang Qin _{1,

2}

Affiliation

For the data processing with increasing avalanche under large datasets, the k nearest neighbors (KNN) algorithm is a particularly expensive operation for both classification and regression predictive problems. To predict the values of new data points, it can calculate the feature similarity between each object in the test dataset and each object in the training dataset. However, due to expensive computational cost, the single computer is out of work to deal with large-scale dataset. In this paper, we propose an adaptive vKNN algorithm, which adopts on the Voronoi diagram under the MapReduce parallel framework and makes full use of the advantages of parallel computing in processing large-scale data. In the process of partition selection, we design a new predictive strategy for sample point to find the optimal relevant partition. Then, we can effectively collect irrelevant data, reduce KNN join computation, and improve the operation efficiency. Finally, we use a large number of 54-dimensional datasets to conduct a large number of experiments on the cluster. The experimental results show that our proposed method is effective and scalable with ensuring accuracy.

中文翻译：

使用KNN联接处理的大型数据管理分区选择

对于大数据集下雪崩增加的数据处理，k对于分类和回归预测问题，最近邻居（KNN）算法是一项特别昂贵的操作。为了预测新数据点的值，它可以计算测试数据集中的每个对象与训练数据集中的每个对象之间的特征相似度。然而，由于昂贵的计算成本，单台计算机无法处理大规模数据集。本文提出了一种自适应vKNN算法，该算法在MapReduce并行框架下采用Voronoi图，并充分利用了并行计算的优势来处理大规模数据。在分区选择过程中，我们为样本点设计了一种新的预测策略，以找到最佳的相关分区。然后，我们可以有效地收集不相关的数据，减少KNN联接计算，并提高运营效率。最后，我们使用大量的54维数据集对群集进行大量实验。实验结果表明，本文提出的方法是有效且可扩展的，并且可以确保准确性。

更新日期：2020-09-08

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>