当前位置: X-MOL 学术Int. J. Mach. Learn. & Cyber. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
CL-MAX: a clustering-based approximation algorithm for mining maximal frequent itemsets
International Journal of Machine Learning and Cybernetics ( IF 3.1 ) Pub Date : 2020-08-10 , DOI: 10.1007/s13042-020-01177-5
Seyed Mohsen Fatemi , Seyed Mohsen Hosseini , Ali Kamandi , Mahmood Shabankhah

The problem of frequent itemset mining is one of the more important problems in data mining which has been extensively employed across a wide range of other relevant tasks such as market basket analysis in marketing, or text analysis in text mining applications. The majority of the deterministic frequent itemset mining algorithms which have been proposed in recent years use some sort or another of an optimal data structures to reduce the overall execution time of the algorithm. In this paper, however, we have tried instead to introduce an approximation algorithm which works by converting the problem into a clustering problem where similar transactions are grouped together. Each cluster centroid represents an itemset which may be assumed to be a candidate frequent itemsets. The validity of this assumption is simply verified by calculating the support count of these itemsets. Those who meet the min-support condition are considered to be an actual frequent itemset. As for the remaining itemsets, they are then passed to MAFIA which extract all maximal frequent itemsets therefrom. Experimentations made on several well-known and diverse datasets show that the proposed algorithm performs almost always faster, and in some cases up to 10 times faster, than the existing deterministic algorithms, and all this by retaining up to 95% of its accuracy.



中文翻译:

CL-MAX:用于挖掘最大频繁项集的基于聚类的近似算法

项目集频繁挖掘的问题是数据挖掘中较重要的问题之一,数据挖掘已广泛应用于其他相关任务,例如市场营销中的购物篮分析或文本挖掘应用程序中的文本分析。近年来提出的大多数确定性频繁项集挖掘算法使用某种或另一种最佳数据结构来减少算法的总体执行时间。但是,在本文中,我们尝试引入一种近似算法,该算法通过将问题转换为将相似交易分组在一起的聚类问题来工作。每个聚类质心代表一个项目集,可以假定它是候选的频繁项目集。通过计算这些项目集的支持计数,可以简单地验证此假设的有效性。满足最低支持条件的人被视为实际的频繁项集。至于其余的项目集,则将它们传递给MAFIA,MAFIA从中提取所有最大的频繁项目集。在多个知名且多样化的数据集上进行的实验表明,与现有的确定性算法相比,所提出的算法执行速度几乎总是更快,在某些情况下甚至快10倍,并且所有这些都保留了其95%的准确性。

更新日期:2020-08-10
down
wechat
bug