Sequence Pattern Mining with Variables,IEEE Transactions on Knowledge and Data Engineering

当前位置： X-MOL 学术 › IEEE Trans. Knowl. Data. Eng. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Sequence Pattern Mining with Variables
IEEE Transactions on Knowledge and Data Engineering ( IF 8.9 ) Pub Date : 2020-01-01 , DOI: 10.1109/tkde.2018.2881675
James S. Okolica , Gilbert L. Peterson , Robert F. Mills , Michael R. Grimaila

Sequence pattern mining (SPM) seeks to find multiple items that commonly occur together in a specific order. One common assumption is that the relevant differences between items are captured through creating distinct items. In some domains, this leads to an exponential increase in the number of items. This paper presents a new SPM, Sequence Mining of Temporal Clusters (SMTC), that allows item differentiation through attribute variables for domains with large numbers of items. It also provides a new technique for addressing interleaving, a phenomena that occurs when two sequences occur simultaneously resulting in their items alternating. By first clustering items temporally and only focusing on sequences after the temporal clusters are established, it sidesteps the traditional interleaving issues. SMTC is evaluated on a digital forensics dataset, a domain with a large number of items and frequent interleaving. Its results are compared with Discontinuous Varied Order Sequence Mining (DVSM) with variables added (DVSM-V). By adding variables, both algorithms reduce the data by 96 percent, and identify 100 percent of the events while keeping the false positive rate below 0.03 percent. SMTC mines the data in 20 percent of the time it takes DVSM-V and provides a lower false positive rate even at higher similarity thresholds.

中文翻译：

使用变量进行序列模式挖掘

序列模式挖掘 (SPM) 试图找到通常以特定顺序一起出现的多个项目。一种常见的假设是通过创建不同的项目来捕获项目之间的相关差异。在某些领域，这会导致项目数量呈指数增长。本文提出了一种新的 SPM，即时序聚类序列挖掘 (SMTC)，它允许通过具有大量项目的域的属性变量来区分项目。它还提供了一种新的技术来解决交织，当两个序列同时发生导致它们的项目交替时发生的现象。通过首先对项目进行时间聚类，并在时间聚类建立后仅关注序列，它避开了传统的交织问题。SMTC 在数字取证数据集上进行评估，具有大量项目和频繁交错的域。将其结果与添加变量的不连续变序序列挖掘 (DVSM) (DVSM-V) 进行比较。通过添加变量，两种算法都将数据减少了 96%，并识别了 100% 的事件，同时将误报率保持在 0.03% 以下。SMTC 以 DVSM-V 20% 的时间挖掘数据，即使在较高的相似性阈值下也能提供较低的误报率。

更新日期：2020-01-01

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11