Abstract
The plethora of analytic platforms escalates the difficulty of selecting the most appropriate analytic platform that fits the needed data mining task, the dataset as well as additional user-defined criteria. Especially analysts, who are rather focused on the analytics domain, experience difficulties to keep up with the latest developments. In this work, we introduce the ASAP-DM framework, which enables analysts to seamlessly use several platforms, whereas programmers can easily add several platforms to the framework. Furthermore, we investigate how to predict a platform based on specific criteria, such as lowest runtime or resource consumption during the execution of a data mining task. We formulate this task as an optimization problem, which can be solved by today’s classification algorithms. We evaluate the proposed framework on several analytic platforms such as Spark, Mahout, and WEKA along with several data mining algorithms for classification, clustering, and association rule discovery. Our experiments unveil that the automatic selection process can save up to 99.71% of the execution time due to automatically choosing a faster platform.
Similar content being viewed by others
Notes
References
Chintapalli S, Dagit D, Evans B, Farivar R, Graves T, Holderbaugh M, Liu Z, Nusbaum K, Patil K, Peng BJ, Poulosky P (2016) Benchmarking streaming computation engines: storm, flink and spark streaming. In: 2016 IEEE international parallel and distributed processing symposium workshops (IPDPSW). IEEE, pp 1789–1792
Dean J, Ghemawat S (2004) MapReduce: simplied data processing on large clusters. In: Proceedings of 6th symposium on operating systems design and implementation, vol 51(1), pp 137–149
Feurer M, Klein A, Eggensperger K, Springenberg J, Blum M, Hutter F (2015) Efficient and robust automated machine learning. Adv Neural Inf Process Syst 28:2944–2952
Fritz M, Albrecht S, Ziekow H, Strüker J (2017) Benchmarking big data technologies for energy procurement efficiency. In: Proceedings of the 23rd America’s conference on information systems (AMCIS 2017)
Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH (2009) The WEKA data mining software: an update. ACM SIGKDD Explorations Newslett 11(1):10–18
Han J, Pei J, Yin Y, Mao R (2004) Mining frequent patterns without candidate generation: a frequent-pattern tree approach. Data Min Knowl Discov 8(1):53–87
Hirmer P, Behringer M (2017) FlexMash 2.0–flexible modeling and execution of data mashups. Springer, Berlin, pp 10–29
Huang S, Huang J, Dai J, Xie T, Huang B (2010) The HiBench benchmark suite: characterization of the MapReduce-based data analysis. In: 2010 IEEE 26th international conference on data engineering workshops (ICDEW). IEEE, pp 41–51
Kraska T, Talwalkar A, Duchi J, Griffith R, Franklin MJ, Jordan MI (2013) MLbase: a distributed machine-learning system. In: Conference on innovative data systems research
Li M, Tan J, Wang Y, Zhang L, Salapura V (2015) SparkBench: a comprehensive benchmarking suite for in memory data analytic platform Spark. In: Proceedings of the 12th ACM international conference on computing frontiers—CF’15. ACM Press, pp 1–8
Lloyd SP (1982) Least squares quantization in PCM. IEEE Trans Inf Theory 28(2):129–137
Macqueen JB (1967) Some methods for classification and analysis of multivariate observations. In: Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, vol 1, pp 281–297
Meng X, Bradley J, Yavuz B, Sparks E, Venkataraman S, Liu D, Freeman J, Tsai D, Amde M, Owen S, Xin D, Xin R, Franklin MJ, Zadeh R, Zaharia M, Talwalkar A (2015) MLlib: machine learning in apache spark. J Mach Learn Res 17:1–7
Quinlan JR (1993) C4.5: programs for machine learning. Morgan Kaufmann Publishers Inc., San Francisco
Shi S, Wang Q, Xu P, Chu X (2016) Benchmarking state-of-the-art deep learning software tools. In: 2016 7th international conference on cloud computing and big data (CCBD). IEEE, pp 99–104
Sparks ER, Talwalkar A, Smith V, Kottalam J, Pan X, Gonzalez J, Franklin MJ, Jordan MI, Kraska T (2013) MLI: an API for distributed machine learning. In: 2013 IEEE 13th international conference on data mining. IEEE, pp 1187–1192
Talwalkar A, Kraska T, Griffith R, Duchi J, Gonzalez J, Britz D, Pan X, Smith V, Sparks ER, Wibisono A, Frankli, MJ, Jordan MI (2012) MLbase: a distributed machine learning wrapper. In: Big learning workshop at NIPS
Wu X, Kumar V, Ross QJ, Ghosh J, Yang Q, Motoda H, McLachlan GJ, Ng A, Liu B, Yu PS, Zhou ZH, Steinbach M, Hand DJ, Steinberg D (2008) Top 10 algorithms in data mining. Knowl Inf Syst 14(1):1–37
Acknowledgements
This research was partially funded by the Ministry of Science of Baden-Württemberg, Germany, for the Doctoral Program ’Services Computing’. Some work presented in this paper was performed in the Project ’INTERACT’ as part of the Software Campus program, which is funded by the German Federal Ministry of Education and Research (BMBF) under Grant No.: 01IS17051.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Fritz, M., Muazzen, O., Behringer, M. et al. ASAP-DM: a framework for automatic selection of analytic platforms for data mining. SICS Softw.-Inensiv. Cyber-Phys. Syst. 35, 17–29 (2020). https://doi.org/10.1007/s00450-019-00408-7
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00450-019-00408-7