Automatic Control and Computer Sciences ( IF 0.6 ) Pub Date : 2021-01-14 , DOI: 10.3103/s0146411620060097 Jiameng Wang , Yunfei Yin , Xiyu Deng
Abstract
Parallel optimization is one of the important research topics of data mining at this stage. Taking CART parallelization as an example, a parallel data mining algorithm based on segmentation and pruning optimization is proposed, namely SSP-OGini-PCCP optimization. Aiming at the problem of choosing the best CART segmentation point, this paper designs an S-SP model without data association; and in order to calculate the Gini index efficiently, a parallel OGini calculation method is designed. In addition, in order to improve the efficiency of the pruning algorithm, a synchronous PCCP pruning strategy is proposed in this paper. In this paper, the optimal segmentation calculation, Gini index calculation, and pruning algorithm are studied in depth. These are important components of parallel data mining. By constructing a distributed cluster simulation system based on SPARK, data mining methods based on SSP-OGini-PCCP are tested. The experimental results show that this method can significantly improve the efficiency of data classification and decision making, which meets the high demands of contemporary mass data processing.
中文翻译:
基于分割和修剪优化的并行数据挖掘方法
摘要
并行优化是现阶段数据挖掘的重要研究课题之一。以CART并行化为例,提出了一种基于分割和修剪优化的并行数据挖掘算法,即SSP-OGini-PCCP优化。针对选择最佳CART分割点的问题,本文设计了一种无数据关联的S-SP模型。为了有效地计算基尼系数,设计了一种并行的OGini计算方法。另外,为了提高修剪算法的效率,提出了一种同步PCCP修剪策略。本文对最优分割计算,基尼系数计算和修剪算法进行了深入研究。这些是并行数据挖掘的重要组成部分。通过构建基于SPARK的分布式集群仿真系统,测试了基于SSP-OGini-PCCP的数据挖掘方法。实验结果表明,该方法可以显着提高数据分类和决策的效率,可以满足当代海量数据处理的高要求。