EAFIM: efficient apriori-based frequent itemset mining algorithm on Spark for big transactional data

Raj, Shashi; Ramesh, Dharavath; Sreenu, M.; Sethi, Krishan Kumar

doi:10.1007/s10115-020-01464-1

EAFIM: efficient apriori-based frequent itemset mining algorithm on Spark for big transactional data

Regular Paper
Published: 07 April 2020

Volume 62, pages 3565–3583, (2020)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

Shashi Raj¹,
Dharavath Ramesh ORCID: orcid.org/0000-0003-3338-6520¹,
M. Sreenu² &
…
Krishan Kumar Sethi¹

917 Accesses
37 Citations
Explore all metrics

Abstract

Frequent itemset mining is considered a popular tool to discover knowledge from transactional datasets. It also serves as the basis for association rule mining. Several algorithms have been proposed to find frequent patterns in which the apriori algorithm is considered as the earliest proposed. Apriori has two significant bottlenecks associated with it: first, repeated scanning of input dataset and second, the requirement of generation of all the candidate itemsets before counting its support value. These bottlenecks reduce the effectiveness of apriori for large-scale datasets. Reasonable efforts have been made to diminish these bottlenecks so that efficiency can be improved. Especially, when the data size is larger, even distributed and parallel environments like MapReduce does not perform well due to the iterative nature of the algorithm that incurs high disk overhead. Apache Spark, on the other hand, is gaining significant attention in the field of big data processing because of its in-memory processing capabilities. Apart from utilizing the parallel and distributed computing environment of Spark, the proposed scheme named efficient apriori-based frequent itemset mining (EAFIM) presents two novel methods to improve the efficiency further. Unlike apriori, it generates the candidates ‘on-the-fly,’ i.e., candidate generation, and count of its support values go simultaneously when the input dataset is being scanned. Also, instead of using the original input dataset in each iteration, it calculates the updated input dataset by removing useless items and transactions. Reduction in size of the input dataset for higher iterations enables EAFIM to perform better. Extensive experiments were conducted to analyze the efficiency and scalability of EAFIM, which outperforms other existing methodologies.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Trends and Future Perspective Challenges in Big Data

An efficient join operations for utility list-based high-utility mining approaches using hybrid search technique

Article 12 April 2024

Rashmin Gajera, Suresh Patel, … Ayush Solanki

Big data analytics on Apache Spark

Article 13 October 2016

Salman Salloum, Ruslan Dautov, … Joshua Zhexue Huang

References

Aggarwal CC (2015) Data mining: the textbook. Springer, Berlin
MATH Google Scholar
Aggarwal CC, Bhuiyan MA, Al Hasan M (2014) Frequent pattern mining algorithms: a survey. In: Aggarwal CC, Han J (eds) Frequent pattern mining. Springer, Berlin, pp 19–64
Google Scholar
Han J, Pei J, Kamber M (2011) Data mining: concepts and techniques. Elsevier, Amsterdam
MATH Google Scholar
Agrawal R, Srikant R (1994) Fast algorithms for mining association rules. In: Proceedings of the 20th international conference on very large data bases, VLDB. vol 1215
Anshu C, Raghuvanshi CS (2010) An algorithm for frequent pattern mining based on apriori. (IJCSE) Int J Comput Sci Eng 2(04):942–947
Google Scholar
Borgelt C (2003) Efficient implementations of apriori and eclat. In: FIMI’03: Proceedings of the IEEE ICDM workshop on frequent itemset mining implementations
Lin X (2014) MR-apriori: association rules algorithm based on mapreduce. In: 2014 5th IEEE international conference on software engineering and service science (ICSESS). IEEE
Woo J (2012) Apriori-map/reduce algorithm. In: Proceedings of the international conference on parallel and distributed processing techniques and applications (PDPTA). The steering committee of the world congress in computer science, computer engineering and applied computing (WorldComp)
Lin M-Y, Lee P-Y, Hsueh S-C (2012) Apriori-based frequent itemset mining algorithms on MapReduce. In: Proceedings of the 6th international conference on ubiquitous information management and communication. ACM
Li N et al (2012) Parallel implementation of apriori algorithm based on mapreduce. In: 2012 13th ACIS international conference on software engineering, artificial intelligence, networking and parallel/distributed computing. IEEE
Yahya O, Hegazy O, Ezat E (2012) An efficient implementation of Apriori algorithm based on Hadoop–Mapreduce model. Int J Rev Comput 12:1–9
Google Scholar
Apache hadoop (2013) http://hadoop.apache.org/
Apache Spark (2016) Lightning-fast cluster computing. The Apache Software Foundation, http://spark.apache.org/. Spark1.6.1, 9 March 2016, Web
Karau H, Konwinski A, Wendell P, Zaharia M (2015) Learning spark: lightning-fast big data analysis. O’Reilly Media Inc., Sebastopol
Google Scholar
Moens S, Aksehirli E, Goethals B (2013) Frequent itemset mining for big data. In: BigData conference
Yu K-M et al (2010) A load-balanced distributed parallel mining algorithm. Expert Syst Appl 37(3):2459–2464
Article Google Scholar
Thabtah F, Hammoud S (2012) Mr-arm: a map-reduce association rule mining framework. Parallel Process Lett 23(03):1350012
Article MathSciNet Google Scholar
Aouad LM, Le-Khac N-A, Kechadi TM (2010) Performance study of distributed Apriori-like frequent itemsets mining. Knowl Inf Syst 23(1):55–72
Article Google Scholar
Chen Z, Cai S, Song Q, Zhu C (2011) An improved Apriori algorithm based on pruning optimization and transaction reduction. In: 2nd international conference on artificial intelligence, Management Science and Electronic Commerce (AIMSEC), Dengleng, 2011, pp 1908–1911
Zhang F et al (2015) A distributed frequent itemset mining algorithm using Spark for big data analytics. Clust Comput 18(4):1493–1501
Article Google Scholar
Qiu H et al (2014) Yafim: a parallel frequent itemset mining algorithm with spark. In: Parallel and distributed processing symposium workshops (IPDPSW), 2014 IEEE international. IEEE
Rathee S, Kaul M, Kashyap A (2015) R-Apriori: an efficient apriori based algorithm on spark. In: Proceedings of the 8th workshop on Ph.D. workshop in information and knowledge management. ACM
Dharavath R, Raj S (2017) Quantitative analysis of frequent itemsets using Apriori algorithm on Apache Spark framework. In: Behera HS, Mohapatra DP (eds) Computational intelligence in data mining. Springer, Singapore, pp 261–272
Chapter Google Scholar
Sethi KK, Ramesh D (2017) HFIM: a Spark-based hybrid frequent itemset mining algorithm for big data processing. J Supercomput 73(8):3652–3668
Article Google Scholar
Gouda K, Zaki MJ (2001) Efficiently mining maximal frequent itemsets. In: Proceedings 2001 IEEE international conference on data mining. IEEE
Dataset (2018) http://www.philippe-fournier-viger.com/spmf/index.php?link=datasets.php
Dataset (2018) http://fimi.ua.ac.be/data/

Download references

Acknowledgements

The authors would like to express their sincere thanks to the editor and anonymous reviewers for their valuable suggestions, which helped us to improve the paper significantly. The authors would like to express their heartiest thanks to the Department of Computer Science and Engineering, Indian Institute of Technology (ISM), Dhanbad, India, for providing their research support.

Author information

Authors and Affiliations

Department of Computer Science and Engineering, Indian Institute of Technology (ISM), Dhanbad, Jharkhand, 826004, India
Shashi Raj, Dharavath Ramesh & Krishan Kumar Sethi
Department of Computer Science and Engineering, National Institute of Technology, Hamirpur, Himachal Pradesh, India
M. Sreenu

Authors

Shashi Raj
View author publications
You can also search for this author in PubMed Google Scholar
Dharavath Ramesh
View author publications
You can also search for this author in PubMed Google Scholar
M. Sreenu
View author publications
You can also search for this author in PubMed Google Scholar
Krishan Kumar Sethi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Dharavath Ramesh.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Raj, S., Ramesh, D., Sreenu, M. et al. EAFIM: efficient apriori-based frequent itemset mining algorithm on Spark for big transactional data. Knowl Inf Syst 62, 3565–3583 (2020). https://doi.org/10.1007/s10115-020-01464-1

Download citation

Received: 24 August 2018
Revised: 20 March 2020
Accepted: 21 March 2020
Published: 07 April 2020
Issue Date: September 2020
DOI: https://doi.org/10.1007/s10115-020-01464-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

EAFIM: efficient apriori-based frequent itemset mining algorithm on Spark for big transactional data

Abstract

Access this article

Similar content being viewed by others

Trends and Future Perspective Challenges in Big Data

An efficient join operations for utility list-based high-utility mining approaches using hybrid search technique

Big data analytics on Apache Spark

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

EAFIM: efficient apriori-based frequent itemset mining algorithm on Spark for big transactional data

Abstract

Access this article

Similar content being viewed by others

Trends and Future Perspective Challenges in Big Data

An efficient join operations for utility list-based high-utility mining approaches using hybrid search technique

Big data analytics on Apache Spark

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation