当前位置: X-MOL 学术Data Min. Knowl. Discov. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Extending inverse frequent itemsets mining to generate realistic datasets: complexity, accuracy and emerging applications
Data Mining and Knowledge Discovery ( IF 2.8 ) Pub Date : 2019-07-20 , DOI: 10.1007/s10618-019-00643-1
Domenico Saccá , Edoardo Serra , Antonino Rullo

The development of novel platforms and techniques for emerging “Big Data” applications requires the availability of real-life datasets for data-driven experiments, which are however not accessible in most cases for various reasons, e.g., confidentiality, privacy or simply insufficient availability. An interesting solution to ensure high quality experimental findings is to synthesize datasets that reflect patterns of real ones using a two-step approach: first a real dataset X is analyzed to derive relevant patterns Z (latent variables) and, then, such patterns are used to reconstruct a new dataset \(X'\) that is like X but not exactly the same. The approach can be implemented using inverse mining techniques such as inverse frequent itemset mining (\(\texttt {IFM}\)), which consists of generating a transactional dataset satisfying given support constraints on the itemsets of an input set, that are typically the frequent ones. This paper introduces various extensions of \(\texttt {IFM}\) within a uniform framework with the aim to generate artificial datasets that reflect more elaborated patterns (in particular infrequency and duplicate constraints) of real ones. Furthermore, in order to further enlarge the application domain of \(\texttt {IFM}\), an additional extension is introduced that considers more structured schemes for the datasets to be generated, as required in emerging big data applications, e.g., social network analytics.

中文翻译:

扩展逆频繁项集挖掘以生成现实的数据集:复杂性,准确性和新兴应用

为新兴的“大数据”应用程序开发新颖的平台和技术,需要获得用于数据驱动实验的真实数据集,但是由于各种原因(例如机密性,隐私或可用性不足),在大多数情况下都无法访问这些数据集。确保高质量实验结果的有趣解决方案是使用两步方法合成反映真实模式的数据集:首先分析真实数据集X以得出相关的模式Z(潜在变量),然后使用此类模式重建像X的新数据集\(X'\)但不完全相同。该方法可以使用逆挖掘技术来实现,例如逆频繁项目集挖掘(\(\ texttt {IFM} \)),该技术包括生成一个交易数据集,该数据集满足输入集项目集上给定的支持约束,通常是经常的。本文介绍了在统一框架内\(\ texttt {IFM} \)的各种扩展,目的是生成能反映真实模式的精心设计的模式(尤其是不频繁和重复约束)的人工数据集。此外,为了进一步扩大\(\ texttt {IFM} \)的应用领域,引入了一个额外的扩展,该扩展考虑了要生成的数据集的结构化方案,这是新兴的大数据应用程序(例如,社交网络分析)所要求的。
更新日期:2019-07-20
down
wechat
bug