Skip to main content
Log in

ASAP-DM: a framework for automatic selection of analytic platforms for data mining

  • Special Issue Paper
  • Published:
SICS Software-Intensive Cyber-Physical Systems

Abstract

The plethora of analytic platforms escalates the difficulty of selecting the most appropriate analytic platform that fits the needed data mining task, the dataset as well as additional user-defined criteria. Especially analysts, who are rather focused on the analytics domain, experience difficulties to keep up with the latest developments. In this work, we introduce the ASAP-DM framework, which enables analysts to seamlessly use several platforms, whereas programmers can easily add several platforms to the framework. Furthermore, we investigate how to predict a platform based on specific criteria, such as lowest runtime or resource consumption during the execution of a data mining task. We formulate this task as an optimization problem, which can be solved by today’s classification algorithms. We evaluate the proposed framework on several analytic platforms such as Spark, Mahout, and WEKA along with several data mining algorithms for classification, clustering, and association rule discovery. Our experiments unveil that the automatic selection process can save up to 99.71% of the execution time due to automatically choosing a faster platform.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Notes

  1. https://beam.apache.org/.

  2. https://github.com/szilard/benchm-ml.

  3. http://weka.sourceforge.net/doc.dev/weka/classifiers/trees/RandomTree.html.

  4. https://mahout.apache.org/docs/0.13.0/api/docs/mahout-mr/org/apache/mahout/classifier/df/builder/DefaultTreeBuilder.html.

  5. https://spark.apache.org/docs/2.0.0/api/java/org/apache/spark/ml/classification/DecisionTreeClassifier.html.

References

  1. Chintapalli S, Dagit D, Evans B, Farivar R, Graves T, Holderbaugh M, Liu Z, Nusbaum K, Patil K, Peng BJ, Poulosky P (2016) Benchmarking streaming computation engines: storm, flink and spark streaming. In: 2016 IEEE international parallel and distributed processing symposium workshops (IPDPSW). IEEE, pp 1789–1792

  2. Dean J, Ghemawat S (2004) MapReduce: simplied data processing on large clusters. In: Proceedings of 6th symposium on operating systems design and implementation, vol 51(1), pp 137–149

  3. Feurer M, Klein A, Eggensperger K, Springenberg J, Blum M, Hutter F (2015) Efficient and robust automated machine learning. Adv Neural Inf Process Syst 28:2944–2952

    Google Scholar 

  4. Fritz M, Albrecht S, Ziekow H, Strüker J (2017) Benchmarking big data technologies for energy procurement efficiency. In: Proceedings of the 23rd America’s conference on information systems (AMCIS 2017)

  5. Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH (2009) The WEKA data mining software: an update. ACM SIGKDD Explorations Newslett 11(1):10–18

  6. Han J, Pei J, Yin Y, Mao R (2004) Mining frequent patterns without candidate generation: a frequent-pattern tree approach. Data Min Knowl Discov 8(1):53–87

    Article  MathSciNet  Google Scholar 

  7. Hirmer P, Behringer M (2017) FlexMash 2.0–flexible modeling and execution of data mashups. Springer, Berlin, pp 10–29

    Google Scholar 

  8. Huang S, Huang J, Dai J, Xie T, Huang B (2010) The HiBench benchmark suite: characterization of the MapReduce-based data analysis. In: 2010 IEEE 26th international conference on data engineering workshops (ICDEW). IEEE, pp 41–51

  9. Kraska T, Talwalkar A, Duchi J, Griffith R, Franklin MJ, Jordan MI (2013) MLbase: a distributed machine-learning system. In: Conference on innovative data systems research

  10. Li M, Tan J, Wang Y, Zhang L, Salapura V (2015) SparkBench: a comprehensive benchmarking suite for in memory data analytic platform Spark. In: Proceedings of the 12th ACM international conference on computing frontiers—CF’15. ACM Press, pp 1–8

  11. Lloyd SP (1982) Least squares quantization in PCM. IEEE Trans Inf Theory 28(2):129–137

    Article  MathSciNet  Google Scholar 

  12. Macqueen JB (1967) Some methods for classification and analysis of multivariate observations. In: Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, vol 1, pp 281–297

  13. Meng X, Bradley J, Yavuz B, Sparks E, Venkataraman S, Liu D, Freeman J, Tsai D, Amde M, Owen S, Xin D, Xin R, Franklin MJ, Zadeh R, Zaharia M, Talwalkar A (2015) MLlib: machine learning in apache spark. J Mach Learn Res 17:1–7

    MathSciNet  MATH  Google Scholar 

  14. Quinlan JR (1993) C4.5: programs for machine learning. Morgan Kaufmann Publishers Inc., San Francisco

    Google Scholar 

  15. Shi S, Wang Q, Xu P, Chu X (2016) Benchmarking state-of-the-art deep learning software tools. In: 2016 7th international conference on cloud computing and big data (CCBD). IEEE, pp 99–104

  16. Sparks ER, Talwalkar A, Smith V, Kottalam J, Pan X, Gonzalez J, Franklin MJ, Jordan MI, Kraska T (2013) MLI: an API for distributed machine learning. In: 2013 IEEE 13th international conference on data mining. IEEE, pp 1187–1192

  17. Talwalkar A, Kraska T, Griffith R, Duchi J, Gonzalez J, Britz D, Pan X, Smith V, Sparks ER, Wibisono A, Frankli, MJ, Jordan MI (2012) MLbase: a distributed machine learning wrapper. In: Big learning workshop at NIPS

  18. Wu X, Kumar V, Ross QJ, Ghosh J, Yang Q, Motoda H, McLachlan GJ, Ng A, Liu B, Yu PS, Zhou ZH, Steinbach M, Hand DJ, Steinberg D (2008) Top 10 algorithms in data mining. Knowl Inf Syst 14(1):1–37

    Article  Google Scholar 

Download references

Acknowledgements

This research was partially funded by the Ministry of Science of Baden-Württemberg, Germany, for the Doctoral Program ’Services Computing’. Some work presented in this paper was performed in the Project ’INTERACT’ as part of the Software Campus program, which is funded by the German Federal Ministry of Education and Research (BMBF) under Grant No.: 01IS17051.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Manuel Fritz.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Fritz, M., Muazzen, O., Behringer, M. et al. ASAP-DM: a framework for automatic selection of analytic platforms for data mining. SICS Softw.-Inensiv. Cyber-Phys. Syst. 35, 17–29 (2020). https://doi.org/10.1007/s00450-019-00408-7

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00450-019-00408-7

Keywords

Navigation