Anomaly Detection in High Dimensional Data,Journal of Computational and Graphical Statistics

当前位置： X-MOL 学术 › J. Comput. Graph. Stat. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Anomaly Detection in High Dimensional Data
Journal of Computational and Graphical Statistics ( IF 1.4 ) Pub Date : 2020-09-30 , DOI: 10.1080/10618600.2020.1807997
Priyanga Dilini Talagala _{1,

2,

3} , Rob J. Hyndman _{1,

2} , Kate Smith-Miles _{2,

4}

Affiliation

The HDoutliers algorithm is a powerful unsupervised algorithm for detecting anomalies in high-dimensional data, with a strong theoretical foundation. However, it suffers from some limitations that significantly hinder its performance level, under certain circumstances. In this article, we propose an algorithm that addresses these limitations. We define an anomaly as an observation that deviates markedly from the majority with a large distance gap. An approach based on extreme value theory is used for the anomalous threshold calculation. Using various synthetic and real datasets, we demonstrate the wide applicability and usefulness of our algorithm, which we call the stray algorithm. We also demonstrate how this algorithm can assist in detecting anomalies present in other data structures using feature engineering. We show the situations where the stray algorithm outperforms the HDoutliers algorithm both in accuracy and computational time. This framework is implemented in the open source R package stray.

中文翻译：

高维数据中的异常检测

HDoutliers算法是一种强大的无监督高维数据异常检测算法，具有很强的理论基础。但是，在某些情况下，它会受到一些限制，这些限制会严重阻碍其性能水平。在本文中，我们提出了一种解决这些限制的算法。我们将异常定义为明显偏离大多数且距离差距较大的观察。异常阈值计算采用基于极值理论的方法。使用各种合成和真实数据集，我们展示了我们称为杂散算法的算法的广泛适用性和实用性。我们还演示了该算法如何使用特征工程帮助检测其他数据结构中存在的异常。我们展示了杂散算法在准确性和计算时间上都优于 HDoutliers 算法的情况。这个框架是在开源 R 包中实现的。

更新日期：2020-09-30

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南11