当前位置: X-MOL 学术ACM Trans. Knowl. Discov. Data › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Benchmarking Unsupervised Outlier Detection with Realistic Synthetic Data
ACM Transactions on Knowledge Discovery from Data ( IF 3.6 ) Pub Date : 2021-04-18 , DOI: 10.1145/3441453
Georg Steinbuss 1 , Klemens Böhm 1
Affiliation  

Benchmarking unsupervised outlier detection is difficult. Outliers are rare, and existing benchmark data contains outliers with various and unknown characteristics. Fully synthetic data usually consists of outliers and regular instances with clear characteristics and thus allows for a more meaningful evaluation of detection methods in principle. Nonetheless, there have only been few attempts to include synthetic data in benchmarks for outlier detection. This might be due to the imprecise notion of outliers or to the difficulty to arrive at a good coverage of different domains with synthetic data. In this work, we propose a generic process for the generation of datasets for such benchmarking. The core idea is to reconstruct regular instances from existing real-world benchmark data while generating outliers so that they exhibit insightful characteristics. We propose and describe a generic process for the benchmarking of unsupervised outlier detection, as sketched so far. We then describe three instantiations of this generic process that generate outliers with specific characteristics, like local outliers. To validate our process, we perform a benchmark with state-of-the-art detection methods and carry out experiments to study the quality of data reconstructed in this way. Next to showcasing the workflow, this confirms the usefulness of our proposed process. In particular, our process yields regular instances close to the ones from real data. Summing up, we propose and validate a new and practical process for the benchmarking of unsupervised outlier detection.

中文翻译:

使用真实合成数据对无监督异常值检测进行基准测试

对无监督异常值检测进行基准测试很困难。异常值很少见,现有的基准数据包含具有各种未知特征的异常值。完全合成的数据通常由具有明确特征的异常值和常规实例组成,因此原则上可以对检测方法进行更有意义的评估。尽管如此,很少有人尝试将合成数据包含在异常值检测的基准中。这可能是由于异常值的概念不精确,或者难以用合成数据很好地覆盖不同的领域。在这项工作中,我们提出了一个通用过程来生成用于此类基准测试的数据集。核心思想是从现有的真实世界基准数据中重建常规实例,同时生成异常值,以便它们表现出有洞察力的特征。我们提出并描述了一个用于无监督异常值检测基准测试的通用过程,如目前所描绘的那样。然后,我们描述了这个通用过程的三个实例,它们生成具有特定特征的异常值,例如局部异常值。为了验证我们的过程,我们使用最先进的检测方法执行基准测试,并进行实验以研究以这种方式重建的数据的质量。在展示工作流程之后,这证实了我们提出的流程的有用性。特别是,我们的过程产生了与真实数据接近的常规实例。加起来,
更新日期:2021-04-18
down
wechat
bug