Automating Outlier Detection via Meta-Learning,arXiv - CS - Machine Learning

当前位置： X-MOL 学术 › arXiv.cs.LG › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Automating Outlier Detection via Meta-Learning
arXiv - CS - Machine Learning Pub Date : 2020-09-22 , DOI: arxiv-2009.10606
Yue Zhao, Ryan A. Rossi, Leman Akoglu

Given an unsupervised outlier detection (OD) task on a new dataset, how can we automatically select a good outlier detection method and its hyperparameter(s) (collectively called a model)? Thus far, model selection for OD has been a "black art"; as any model evaluation is infeasible due to the lack of (i) hold-out data with labels, and (ii) a universal objective function. In this work, we develop the first principled data-driven approach to model selection for OD, called MetaOD, based on meta-learning. MetaOD capitalizes on the past performances of a large body of detection models on existing outlier detection benchmark datasets, and carries over this prior experience to automatically select an effective model to be employed on a new dataset. To capture task similarity, we introduce specialized meta-features that quantify outlying characteristics of a dataset. Through comprehensive experiments, we show the effectiveness of MetaOD in selecting a detection model that significantly outperforms the most popular outlier detectors (e.g., LOF and iForest) as well as various state-of-the-art unsupervised meta-learners while being extremely fast. To foster reproducibility and further research on this new problem, we open-source our entire meta-learning system, benchmark environment, and testbed datasets.

中文翻译：

通过元学习自动检测异常值

给定一个新数据集上的无监督异常值检测 (OD) 任务，我们如何自动选择一个好的异常值检测方法及其超参数（统称为模型）？迄今为止，OD的选型一直是一门“黑色艺术”；因为任何模型评估都是不可行的，因为缺乏 (i) 带有标签的保留数据，以及 (ii) 通用目标函数。在这项工作中，我们基于元学习开发了第一个原则性数据驱动的 OD 模型选择方法，称为 MetaOD。MetaOD 利用大量检测模型在现有异常检测基准数据集上的过去表现，并继承这一先前的经验来自动选择要在新数据集上使用的有效模型。为了捕捉任务相似性，我们引入了专门的元特征来量化数据集的外围特征。通过综合实验，我们展示了 MetaOD 在选择检测模型方面的有效性，该模型显着优于最流行的异常检测器（例如，LOF 和 iForest）以及各种最先进的无监督元学习器，同时速度极快。为了促进对这个新问题的可重复性和进一步研究，我们将整个元学习系统、基准环境和测试平台数据集开源。

更新日期：2020-09-23

点击分享查看原文

点击收藏

阅读更多本刊最新论文