ASAP-DM: a framework for automatic selection of analytic platforms for data mining

Fritz, Manuel; Muazzen, Osama; Behringer, Michael; Schwarz, Holger

doi:10.1007/s00450-019-00408-7

ASAP-DM: a framework for automatic selection of analytic platforms for data mining

Special Issue Paper
Published: 17 August 2019

Volume 35, pages 17–29, (2020)
Cite this article

SICS Software-Intensive Cyber-Physical Systems

Manuel Fritz¹,
Osama Muazzen²,
Michael Behringer¹ &
…
Holger Schwarz¹

335 Accesses
1 Citation
Explore all metrics

Abstract

The plethora of analytic platforms escalates the difficulty of selecting the most appropriate analytic platform that fits the needed data mining task, the dataset as well as additional user-defined criteria. Especially analysts, who are rather focused on the analytics domain, experience difficulties to keep up with the latest developments. In this work, we introduce the ASAP-DM framework, which enables analysts to seamlessly use several platforms, whereas programmers can easily add several platforms to the framework. Furthermore, we investigate how to predict a platform based on specific criteria, such as lowest runtime or resource consumption during the execution of a data mining task. We formulate this task as an optimization problem, which can be solved by today’s classification algorithms. We evaluate the proposed framework on several analytic platforms such as Spark, Mahout, and WEKA along with several data mining algorithms for classification, clustering, and association rule discovery. Our experiments unveil that the automatic selection process can save up to 99.71% of the execution time due to automatically choosing a faster platform.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Trends and Future Perspective Challenges in Big Data

Imbalanced data preprocessing techniques for machine learning: a systematic mapping study

Article 09 November 2022

A Systematic Review on Supervised and Unsupervised Machine Learning Algorithms for Data Science

Notes

References

Chintapalli S, Dagit D, Evans B, Farivar R, Graves T, Holderbaugh M, Liu Z, Nusbaum K, Patil K, Peng BJ, Poulosky P (2016) Benchmarking streaming computation engines: storm, flink and spark streaming. In: 2016 IEEE international parallel and distributed processing symposium workshops (IPDPSW). IEEE, pp 1789–1792
Dean J, Ghemawat S (2004) MapReduce: simplied data processing on large clusters. In: Proceedings of 6th symposium on operating systems design and implementation, vol 51(1), pp 137–149
Feurer M, Klein A, Eggensperger K, Springenberg J, Blum M, Hutter F (2015) Efficient and robust automated machine learning. Adv Neural Inf Process Syst 28:2944–2952
Google Scholar
Fritz M, Albrecht S, Ziekow H, Strüker J (2017) Benchmarking big data technologies for energy procurement efficiency. In: Proceedings of the 23rd America’s conference on information systems (AMCIS 2017)
Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH (2009) The WEKA data mining software: an update. ACM SIGKDD Explorations Newslett 11(1):10–18
Han J, Pei J, Yin Y, Mao R (2004) Mining frequent patterns without candidate generation: a frequent-pattern tree approach. Data Min Knowl Discov 8(1):53–87
Article MathSciNet Google Scholar
Hirmer P, Behringer M (2017) FlexMash 2.0–flexible modeling and execution of data mashups. Springer, Berlin, pp 10–29
Google Scholar
Huang S, Huang J, Dai J, Xie T, Huang B (2010) The HiBench benchmark suite: characterization of the MapReduce-based data analysis. In: 2010 IEEE 26th international conference on data engineering workshops (ICDEW). IEEE, pp 41–51
Kraska T, Talwalkar A, Duchi J, Griffith R, Franklin MJ, Jordan MI (2013) MLbase: a distributed machine-learning system. In: Conference on innovative data systems research
Li M, Tan J, Wang Y, Zhang L, Salapura V (2015) SparkBench: a comprehensive benchmarking suite for in memory data analytic platform Spark. In: Proceedings of the 12th ACM international conference on computing frontiers—CF’15. ACM Press, pp 1–8
Lloyd SP (1982) Least squares quantization in PCM. IEEE Trans Inf Theory 28(2):129–137
Article MathSciNet Google Scholar
Macqueen JB (1967) Some methods for classification and analysis of multivariate observations. In: Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, vol 1, pp 281–297
Meng X, Bradley J, Yavuz B, Sparks E, Venkataraman S, Liu D, Freeman J, Tsai D, Amde M, Owen S, Xin D, Xin R, Franklin MJ, Zadeh R, Zaharia M, Talwalkar A (2015) MLlib: machine learning in apache spark. J Mach Learn Res 17:1–7
MathSciNet MATH Google Scholar
Quinlan JR (1993) C4.5: programs for machine learning. Morgan Kaufmann Publishers Inc., San Francisco
Google Scholar
Shi S, Wang Q, Xu P, Chu X (2016) Benchmarking state-of-the-art deep learning software tools. In: 2016 7th international conference on cloud computing and big data (CCBD). IEEE, pp 99–104
Sparks ER, Talwalkar A, Smith V, Kottalam J, Pan X, Gonzalez J, Franklin MJ, Jordan MI, Kraska T (2013) MLI: an API for distributed machine learning. In: 2013 IEEE 13th international conference on data mining. IEEE, pp 1187–1192
Talwalkar A, Kraska T, Griffith R, Duchi J, Gonzalez J, Britz D, Pan X, Smith V, Sparks ER, Wibisono A, Frankli, MJ, Jordan MI (2012) MLbase: a distributed machine learning wrapper. In: Big learning workshop at NIPS
Wu X, Kumar V, Ross QJ, Ghosh J, Yang Q, Motoda H, McLachlan GJ, Ng A, Liu B, Yu PS, Zhou ZH, Steinbach M, Hand DJ, Steinberg D (2008) Top 10 algorithms in data mining. Knowl Inf Syst 14(1):1–37
Article Google Scholar

Download references

Acknowledgements

This research was partially funded by the Ministry of Science of Baden-Württemberg, Germany, for the Doctoral Program ’Services Computing’. Some work presented in this paper was performed in the Project ’INTERACT’ as part of the Software Campus program, which is funded by the German Federal Ministry of Education and Research (BMBF) under Grant No.: 01IS17051.

Author information

Authors and Affiliations

Institute for Parallel and Distributed Systems, University of Stuttgart, Universitätsstr. 38, 70569, Stuttgart, Germany
Manuel Fritz, Michael Behringer & Holger Schwarz
University of Stuttgart, Stuttgart, Germany
Osama Muazzen

Authors

Manuel Fritz
View author publications
You can also search for this author in PubMed Google Scholar
Osama Muazzen
View author publications
You can also search for this author in PubMed Google Scholar
Michael Behringer
View author publications
You can also search for this author in PubMed Google Scholar
Holger Schwarz
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Manuel Fritz.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Fritz, M., Muazzen, O., Behringer, M. et al. ASAP-DM: a framework for automatic selection of analytic platforms for data mining. SICS Softw.-Inensiv. Cyber-Phys. Syst. 35, 17–29 (2020). https://doi.org/10.1007/s00450-019-00408-7

Download citation

Published: 17 August 2019
Issue Date: August 2020
DOI: https://doi.org/10.1007/s00450-019-00408-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

ASAP-DM: a framework for automatic selection of analytic platforms for data mining

Abstract

Access this article

Similar content being viewed by others

Trends and Future Perspective Challenges in Big Data

Imbalanced data preprocessing techniques for machine learning: a systematic mapping study

A Systematic Review on Supervised and Unsupervised Machine Learning Algorithms for Data Science

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

ASAP-DM: a framework for automatic selection of analytic platforms for data mining

Abstract

Access this article

Similar content being viewed by others

Trends and Future Perspective Challenges in Big Data

Imbalanced data preprocessing techniques for machine learning: a systematic mapping study

A Systematic Review on Supervised and Unsupervised Machine Learning Algorithms for Data Science

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation