Scalable Mining of Contextual Outliers Using Relevant Subspace,IEEE Transactions on Systems, Man, and Cybernetics: Systems

当前位置： X-MOL 学术 › IEEE Trans. Syst. Man Cybern. Syst. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Scalable Mining of Contextual Outliers Using Relevant Subspace
IEEE Transactions on Systems, Man, and Cybernetics: Systems ( IF 8.6 ) Pub Date : 2020-03-01 , DOI: 10.1109/tsmc.2017.2718592
Jifu Zhang , Xiaolong Yu , Yaling Xun , Sulan Zhang , Xiao Qin

In this paper, we propose a scalable mining algorithm to discover contextual outliers using relevant subspaces. We develop the mining algorithm using the MapReduce programming model running on a Hadoop cluster. Relevant subspaces, which effectively capture the local distribution of various datasets, are quantified using local sparseness of attribute dimensions. We design a novel way of calculating local outlier factors in a relevant subspace with the probability density of local datasets; this new approach can effectively reflect the outlier degree of a data object that does not satisfy the distribution of the local dataset in the relevant subspace. Attribute dimensions of a relevant subspace, and local outlier factors are expressed as vital contextual information, which improves the interpretability of outliers. Importantly, the selection of ${N}$ data objects with the largest local outlier factor value is categorized as contextual outliers in our solution. To this end, our scalable mining algorithm, which incorporates the locality sensitive hashing distributed strategy, is implemented on a Hadoop cluster. The experimental results validate the effectiveness, interpretability, scalability, and extensibility of the algorithm using both synthetic data and stellar spectral data as experimental datasets.

中文翻译：

使用相关子空间的上下文异常值的可扩展挖掘

在本文中，我们提出了一种可扩展的挖掘算法，以使用相关子空间发现上下文异常值。我们使用运行在 Hadoop 集群上的 MapReduce 编程模型开发挖掘算法。使用属性维度的局部稀疏性对有效捕获各种数据集的局部分布的相关子空间进行量化。我们设计了一种使用局部数据集的概率密度计算相关子空间中局部异常值因子的新方法；这种新方法可以有效反映不满足相关子空间局部数据集分布的数据对象的离群程度。相关子空间的属性维度和局部异常值因素被表示为重要的上下文信息，这提高了异常值的可解释性。重要的，在我们的解决方案中，具有最大局部异常值因子值的 ${N}$ 数据对象的选择被归类为上下文异常值。为此，我们在 Hadoop 集群上实现了我们的可扩展挖掘算法，它结合了局部敏感散列分布式策略。实验结果验证了该算法使用合成数据和恒星光谱数据作为实验数据集的有效性、可解释性、可扩展性和可扩展性。

更新日期：2020-03-01

点击分享查看原文

点击收藏

阅读更多本刊最新论文