当前位置: X-MOL 学术Knowl. Inf. Syst. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Sequential pattern sampling with norm-based utility
Knowledge and Information Systems ( IF 2.5 ) Pub Date : 2019-10-26 , DOI: 10.1007/s10115-019-01417-3
Lamine Diop , Cheikh Talibouya Diop , Arnaud Giacometti , Dominique Li , Arnaud Soulet

Sequential pattern mining has been introduced by Agrawal and Srikant (in: Proceedings of ICDE’95, pp 3–14, 1995) 2 decades ago, and its usefulness has been widely proved for different mining tasks and application fields such as web usage mining, text mining, bioinformatics, fraud detection and so on. Since 1995, despite numerous optimization proposals, sequential pattern mining remains a costly task that often generates too many patterns. This limit, also reached by itemset mining, was circumvented by pattern sampling. Pattern sampling is a non-exhaustive method for instantly discovering relevant patterns that ensures a good interactivity while providing strong statistical guarantees due to its random nature. Curiously, such an approach investigated for different kinds of patterns including itemsets and subgraphs has not yet been applied to sequential patterns. In this paper, we propose the first method dedicated to sequential pattern sampling. In addition to address sequential data, the originality of our approach is to introduce a class of interestingness measures relying on the norm of the sequence, named norm-based utilities. In particular, it enables to add constraints on the norm of sampled patterns to control the length of the drawn patterns and to avoid the pitfall of the “long tail” where the rarest patterns flood the user. We propose a new two-step random procedure integrating this class of measures, named \({\textsc {NUSSampling}}\) that randomly draws sequential patterns according to frequency weighted by a norm-based utility. We demonstrate that this method performs an exact sampling according to the underlying measure. Moreover, despite the use of rejection sampling, the experimental study shows that \({\textsc {NUSSampling}}\) remains efficient. We especially focus on the interest of norm constraints and exponential decays that help to draw general patterns of the “head”. We also illustrate how to benefit from these sampled patterns to instantly build an associative classifier dedicated to sequences. This classification approach rivals state-of-the-art proposals showing the interest of sequential pattern sampling with norm-based utility.

中文翻译:

使用基于规范的实用程序进行顺序模式采样

2年前,Agrawal和Srikant引入了顺序模式挖掘(在:ICDE'95会议论文集,第3-14页,1995年),其有效性已被广泛证明用于不同的挖掘任务和应用领域,例如Web用法挖掘,文本挖掘,生物信息学,欺诈检测等。自1995年以来,尽管提出了许多优化建议,但顺序模式挖掘仍然是一项代价高昂的任务,通常会生成太多模式。模式集采样规避了项目集挖掘也达到的这一限制。模式采样是一种用于立即发现相关模式的非穷举方法,该模式可确保良好的交互性,同时由于其随机性而提供强大的统计保证。奇怪的是 针对包括项目集和子图在内的不同类型的模式研究的这种方法尚未应用于顺序模式。在本文中,我们提出了第一种专用于顺序模式采样的方法。除了处理顺序数据外,我们方法的独创性是根据序列的规范引入一类趣味性测度,即基于规范的实用程序。特别是,它能够对采样模式的规范添加约束,以控制绘制模式的长度,并避免“长尾巴”的陷阱,在这种情况下,最稀有的模式会淹没用户。我们提出了一个新的两步随机过程,该过程集成了此类度量,名为\({\ textsc {NUSSampling}} \),该函数根据基于规范的实用程序加权的频率随机绘制顺序模式。我们证明了该方法根据基本度量执行精确的采样。此外,尽管使用了拒绝采样,但实验研究表明\({\ textsc {NUSSampling}} \)仍然有效。我们特别关注规范约束和指数衰减的兴趣,这有助于绘制“头部”的一般模式。我们还将说明如何从这些采样的模式中受益,以立即构建专用于序列的关联分类器。这种分类方法可与最新的提案相抗衡,这些提案显示了对基于规范的实用程序进行顺序模式采样的兴趣。
更新日期:2019-10-26
down
wechat
bug