Abstract
Parallel optimization is one of the important research topics of data mining at this stage. Taking CART parallelization as an example, a parallel data mining algorithm based on segmentation and pruning optimization is proposed, namely SSP-OGini-PCCP optimization. Aiming at the problem of choosing the best CART segmentation point, this paper designs an S-SP model without data association; and in order to calculate the Gini index efficiently, a parallel OGini calculation method is designed. In addition, in order to improve the efficiency of the pruning algorithm, a synchronous PCCP pruning strategy is proposed in this paper. In this paper, the optimal segmentation calculation, Gini index calculation, and pruning algorithm are studied in depth. These are important components of parallel data mining. By constructing a distributed cluster simulation system based on SPARK, data mining methods based on SSP-OGini-PCCP are tested. The experimental results show that this method can significantly improve the efficiency of data classification and decision making, which meets the high demands of contemporary mass data processing.
Similar content being viewed by others
REFERENCES
Guerine, M., Rosseti, I., and Plastino, A., A hybrid data mining heuristic to solve the point-feature cartographic label placement problem, Int. Trans. Oper. Res., 2020, vol. 27, no. 2, pp. 1189–1209.
Bommert, A., Sun, X., and Bischl, B., Benchmark for filter methods for feature selection in high-dimensional classification data, Comput. Stat. Data Anal., 2020, vol. 143.
Moral-Garcia, S., Mantas, C.J., and Castellano, J.G., Bagging of credal decision trees for imprecise classification, Expert Syst. Appl., 2020, vol. 141.
Wang, Q., Luo, Y., and Han, X., Research on estimation model of the battery state of charge in a hybrid electric vehicle based on the classification and regression tree, Math. Comput. Modell. Dyn. Syst., 2019, vol. 25, no. 4, pp. 376–396.
Arifuzzaman, M., Gazder, U., and Alam, M.S., Modelling of asphalt’s adhesive behaviour using classification and regression tree (CART) analysis, Comput. Intell. Neurosci., 2019, vol. 2019.
Rutkowski, L., Jaworski, M., Pietruczuk, L., and Duda, P., The cart decision tree for mining data streams, Inf. Sci., 2014, vol. 266, pp. 1–15.
Agnieszka, M., Classification and regression tree theory application for assessment of building damage caused by surface deformation, Nat. Hazards, 2014, vol. 73, no. 2, pp. 317–334.
Bertini, J.R., Nicoletti, M.D., and Zhao, L., An embedded imputation method via attribute-based decision graphs, Expert Syst. Appl., 2016, vol. 57, pp. 159–177.
Ala’raj, M. and Abbod, M.F., Classifiers consensus system approach for credit scoring, Knowl.-Based Syst., 2016, vol. 104, pp. 89–105.
Salmam, F.Z., Fakir, M., and Errattahi, R., Prediction in OLAP data cubes, J. Inf. Knowl. Manage., 2016, vol. 15, no. 2.
Hernandez-Cabronero, M., Blanes, I., Pinho, A.J., et al., Progressive lossy-to-lossless compression of DNA microarray images, IEEE Signal Proc. Lett., vol. 32, no. 5, pp. 698–702.
Aparicio, J., Pastor, J.T., and Vidal, F., The weighted additive distance function, Eur. J. Oper. Res., 2016, vol. 254, no. 1, pp. 338–346.
Luo, Y.Y., Wang, K.L., Chen, C., and Mao, Y.F., Improved CART arithmetic combined with degrees study, Comput. Eng. Des., 2007, vol. 28, no. 7, pp. 1520–1522.
Zhang, L. and Ning, Q., Two improvements on CART decision tree and its application, Comput. Eng. Des., 2015, vol. 36, no. 5, pp. 1209–1213.
Qian, J.L. and Xun, E.D., Identification of Chinese prosodic phrase based on CART, Comput. Eng. Appl., 2008, vol. 44, no. 6, pp. 169–171.
Zhang, S.C., Cheng, D.B., Zong, M., and Gao, L.L., Self-representation nearest neighbor search for classification, Neurocomputing, 2016, vol. 195, pp. 137–142.
Guo, Y.H. and Zhou, W.M., CART algorithm in International Trade Early Warning, Microcomput. Inf., 2012, vol. 28, no. 10, pp. 248–249.
Chan, C.K., Loh, W.P., and Abd Rahim, Human motion classification using 2D stick-model matching regression coefficients, Appl. Math. Comput., 2016, vol. 283, pp. 70–89.
Zhang, C.S., Liu, C.C., and Zhang, X.L., An up-to-date comparison of state-of-the-art classification algorithms, Expert Syst. Appl., 2017, vol. 82, pp. 128–150.
Fernandez, A., Jose Carmona, C., and Jose del Jesus, M., A Pareto-based ensemble with feature and instance selection for learning from multi-class imbalanced datasets, Int. J. Neural Syst., 2017, vol. 27, no. 6.
Huang, K., Ji, F., and Xie, Z., Artificial liver support system therapy in acute-on-chronic hepatitis B liver failure: Classification and regression tree analysis, Sci. Rep., 2019, vol. 9.
Moral-Garcia, S., Mantas, C.J., and Castellano, J.G., Bagging of credal decision trees for imprecise classification, Expert Syst. Appl., 2020, vol. 141.
Funding
This work was supported by National Natural Science Foundation of China (no. 61702059), Research Fund of Guangxi Key Lab of Multi-source Information Mining and Security (MIMS18-03), and the Fundamental Research Funds for the Central Universities (2018CDGFCH0020).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
The authors declare no conflict of interest.
About this article
Cite this article
Jiameng Wang, Yin, Y. & Deng, X. A Parallel Data Mining Approach Based on Segmentation and Pruning Optimization. Aut. Control Comp. Sci. 54, 483–492 (2020). https://doi.org/10.3103/S0146411620060097
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.3103/S0146411620060097