Skip to main content
Log in

EAFIM: efficient apriori-based frequent itemset mining algorithm on Spark for big transactional data

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

Frequent itemset mining is considered a popular tool to discover knowledge from transactional datasets. It also serves as the basis for association rule mining. Several algorithms have been proposed to find frequent patterns in which the apriori algorithm is considered as the earliest proposed. Apriori has two significant bottlenecks associated with it: first, repeated scanning of input dataset and second, the requirement of generation of all the candidate itemsets before counting its support value. These bottlenecks reduce the effectiveness of apriori for large-scale datasets. Reasonable efforts have been made to diminish these bottlenecks so that efficiency can be improved. Especially, when the data size is larger, even distributed and parallel environments like MapReduce does not perform well due to the iterative nature of the algorithm that incurs high disk overhead. Apache Spark, on the other hand, is gaining significant attention in the field of big data processing because of its in-memory processing capabilities. Apart from utilizing the parallel and distributed computing environment of Spark, the proposed scheme named efficient apriori-based frequent itemset mining (EAFIM) presents two novel methods to improve the efficiency further. Unlike apriori, it generates the candidates ‘on-the-fly,’ i.e., candidate generation, and count of its support values go simultaneously when the input dataset is being scanned. Also, instead of using the original input dataset in each iteration, it calculates the updated input dataset by removing useless items and transactions. Reduction in size of the input dataset for higher iterations enables EAFIM to perform better. Extensive experiments were conducted to analyze the efficiency and scalability of EAFIM, which outperforms other existing methodologies.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

References

  1. Aggarwal CC (2015) Data mining: the textbook. Springer, Berlin

    MATH  Google Scholar 

  2. Aggarwal CC, Bhuiyan MA, Al Hasan M (2014) Frequent pattern mining algorithms: a survey. In: Aggarwal CC, Han J (eds) Frequent pattern mining. Springer, Berlin, pp 19–64

    Google Scholar 

  3. Han J, Pei J, Kamber M (2011) Data mining: concepts and techniques. Elsevier, Amsterdam

    MATH  Google Scholar 

  4. Agrawal R, Srikant R (1994) Fast algorithms for mining association rules. In: Proceedings of the 20th international conference on very large data bases, VLDB. vol 1215

  5. Anshu C, Raghuvanshi CS (2010) An algorithm for frequent pattern mining based on apriori. (IJCSE) Int J Comput Sci Eng 2(04):942–947

    Google Scholar 

  6. Borgelt C (2003) Efficient implementations of apriori and eclat. In: FIMI’03: Proceedings of the IEEE ICDM workshop on frequent itemset mining implementations

  7. Lin X (2014) MR-apriori: association rules algorithm based on mapreduce. In: 2014 5th IEEE international conference on software engineering and service science (ICSESS). IEEE

  8. Woo J (2012) Apriori-map/reduce algorithm. In: Proceedings of the international conference on parallel and distributed processing techniques and applications (PDPTA). The steering committee of the world congress in computer science, computer engineering and applied computing (WorldComp)

  9. Lin M-Y, Lee P-Y, Hsueh S-C (2012) Apriori-based frequent itemset mining algorithms on MapReduce. In: Proceedings of the 6th international conference on ubiquitous information management and communication. ACM

  10. Li N et al (2012) Parallel implementation of apriori algorithm based on mapreduce. In: 2012 13th ACIS international conference on software engineering, artificial intelligence, networking and parallel/distributed computing. IEEE

  11. Yahya O, Hegazy O, Ezat E (2012) An efficient implementation of Apriori algorithm based on Hadoop–Mapreduce model. Int J Rev Comput 12:1–9

    Google Scholar 

  12. Apache hadoop (2013) http://hadoop.apache.org/

  13. Apache Spark (2016) Lightning-fast cluster computing. The Apache Software Foundation, http://spark.apache.org/. Spark1.6.1, 9 March 2016, Web

  14. Karau H, Konwinski A, Wendell P, Zaharia M (2015) Learning spark: lightning-fast big data analysis. O’Reilly Media Inc., Sebastopol

    Google Scholar 

  15. Moens S, Aksehirli E, Goethals B (2013) Frequent itemset mining for big data. In: BigData conference

  16. Yu K-M et al (2010) A load-balanced distributed parallel mining algorithm. Expert Syst Appl 37(3):2459–2464

    Article  Google Scholar 

  17. Thabtah F, Hammoud S (2012) Mr-arm: a map-reduce association rule mining framework. Parallel Process Lett 23(03):1350012

    Article  MathSciNet  Google Scholar 

  18. Aouad LM, Le-Khac N-A, Kechadi TM (2010) Performance study of distributed Apriori-like frequent itemsets mining. Knowl Inf Syst 23(1):55–72

    Article  Google Scholar 

  19. Chen Z, Cai S, Song Q, Zhu C (2011) An improved Apriori algorithm based on pruning optimization and transaction reduction. In: 2nd international conference on artificial intelligence, Management Science and Electronic Commerce (AIMSEC), Dengleng, 2011, pp 1908–1911

  20. Zhang F et al (2015) A distributed frequent itemset mining algorithm using Spark for big data analytics. Clust Comput 18(4):1493–1501

    Article  Google Scholar 

  21. Qiu H et al (2014) Yafim: a parallel frequent itemset mining algorithm with spark. In: Parallel and distributed processing symposium workshops (IPDPSW), 2014 IEEE international. IEEE

  22. Rathee S, Kaul M, Kashyap A (2015) R-Apriori: an efficient apriori based algorithm on spark. In: Proceedings of the 8th workshop on Ph.D. workshop in information and knowledge management. ACM

  23. Dharavath R, Raj S (2017) Quantitative analysis of frequent itemsets using Apriori algorithm on Apache Spark framework. In: Behera HS, Mohapatra DP (eds) Computational intelligence in data mining. Springer, Singapore, pp 261–272

    Chapter  Google Scholar 

  24. Sethi KK, Ramesh D (2017) HFIM: a Spark-based hybrid frequent itemset mining algorithm for big data processing. J Supercomput 73(8):3652–3668

    Article  Google Scholar 

  25. Gouda K, Zaki MJ (2001) Efficiently mining maximal frequent itemsets. In: Proceedings 2001 IEEE international conference on data mining. IEEE

  26. Dataset (2018) http://www.philippe-fournier-viger.com/spmf/index.php?link=datasets.php

  27. Dataset (2018) http://fimi.ua.ac.be/data/

Download references

Acknowledgements

The authors would like to express their sincere thanks to the editor and anonymous reviewers for their valuable suggestions, which helped us to improve the paper significantly. The authors would like to express their heartiest thanks to the Department of Computer Science and Engineering, Indian Institute of Technology (ISM), Dhanbad, India, for providing their research support.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Dharavath Ramesh.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Raj, S., Ramesh, D., Sreenu, M. et al. EAFIM: efficient apriori-based frequent itemset mining algorithm on Spark for big transactional data. Knowl Inf Syst 62, 3565–3583 (2020). https://doi.org/10.1007/s10115-020-01464-1

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-020-01464-1

Keywords

Navigation