当前位置: X-MOL 学术J. Grid Comput. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
On Efficient Mining of Frequent Itemsets from Big Uncertain Databases
Journal of Grid Computing ( IF 5.5 ) Pub Date : 2018-08-06 , DOI: 10.1007/s10723-018-9456-0
Ahsan Shah , Zahid Halim

In the current era of information, communication, and technology the data is being generated at an exponential rate. This provides machine learning and data mining algorithms an opportunity to learn from huge data repositories. However, at the same time, the big data poses many challenges. Data uncertainty being the key concern of the modern data mining systems. This work addresses the problem of extracting frequent itemsets from such large uncertain databases to assist the decision makers in understanding the non-trivial data trends. The usual technique utilized to find frequent itemsets from uncertain databases is known as the Possible Word Semantics (PWS). However, as the database size increases, PWS suffers from performance issues. Therefore, there is a need for efficient frequent pattern mining algorithms. This work presents three techniques to address the issue at hand, namely: 3D linked array-based strategy, connected tree technique, and average probability-based setup with the support of a tree data structure. The objective here is to minimize computational cost by traversing the database only once. The 3D linked array-based solution scans the database only once and stores the support information of the item and its association with other items within the 3D array. For the tree-based method, 1D array is associated with each node of the tree, comprising of support information of the database items and their associations with other items. The average probability-based approach computes the average probability factor and utilizes it to map the uncertain database to a tree. The current proposal addresses attribute uncertainty as well as the tuple uncertainty to map large uncertain databases to the proposed data structures. In addition to introducing the three data structures, this work also presents algorithms to extract frequent itemsets. The proposal is compared with four recent works done in this domain for uncertain data, namely, mining threshold-based (MB) technique, frequent itemsets using nodesets (FIN), prepost + , and uncertain apriori (UApriori). Experiments are performed utilizing four benchmark datasets. The results obtained suggest better performance of the three techniques presented here, while consuming 60% less execution time.

中文翻译:

从不确定的大型数据库中高效挖掘频繁项集

在当前的信息,通信和技术时代,数据正以指数速率生成。这为机器学习和数据挖掘算法提供了从大量数据存储库中学习的机会。但是,与此同时,大数据带来了许多挑战。数据不确定性是现代数据挖掘系统的主要关注点。这项工作解决了从如此庞大的不确定数据库中提取频繁项集的问题,以帮助决策者理解非平凡的数据趋势。用于从不确定的数据库中查找频繁项目集的常用技术称为“可能的词语义(PWS)”。但是,随着数据库大小的增加,PWS会遇到性能问题。因此,需要有效的频繁模式挖掘算法。这项工作提出了三种解决当前问题的技术,即:基于3D链接数组的策略,连接树技术以及在树数据结构的支持下基于平均概率的设置。此处的目的是通过仅遍历数据库一次来使计算成本最小化。基于3D链接阵列的解决方案仅扫描数据库一次,并存储该项目的支持信息及其与3D阵列中其他项目的关联。对于基于树的方法,一维数组与树的每个节点相关联,包括数据库项目的支持信息及其与其他项目的关联。基于平均概率的方法计算平均概率因子,并利用其将不确定数据库映射到树。当前的建议解决属性不确定性以及元组不确定性,以将大型不确定性数据库映射到所建议的数据结构。除了介绍这三种数据结构外,这项工作还提出了提取频繁项集的算法。将该提案与该领域中针对不确定数据所做的四项最新工作进行了比较,即基于阈值的挖掘(MB)技术,使用节点集的频繁项集(FIN),prepost +和不确定先验(UApriori)。利用四个基准数据集进行实验。获得的结果表明,此处介绍的三种技术具有更好的性能,而执行时间却减少了60%。这项工作还提出了提取频繁项集的算法。将该提案与该领域中针对不确定数据所做的四项最新工作进行了比较,即基于阈值的挖掘(MB)技术,使用节点集的频繁项集(FIN),prepost +和不确定先验(UApriori)。利用四个基准数据集进行实验。获得的结果表明,这里介绍的三种技术的性能更好,而执行时间却减少了60%。这项工作还提出了提取频繁项集的算法。将该提案与该领域中针对不确定数据所做的四项最新工作进行了比较,即基于阈值的挖掘(MB)技术,使用节点集的频繁项集(FIN),prepost +和不确定先验(UApriori)。利用四个基准数据集进行实验。获得的结果表明,此处介绍的三种技术具有更好的性能,而执行时间却减少了60%。
更新日期:2018-08-06
down
wechat
bug