当前位置: X-MOL 学术Water Res. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
The value of human data annotation for machine learning based anomaly detection in environmental systems
Water Research ( IF 11.4 ) Pub Date : 2021-09-27 , DOI: 10.1016/j.watres.2021.117695
Stefania Russo 1 , Michael D Besmer 2 , Frank Blumensaat 3 , Damien Bouffard 4 , Andy Disch 4 , Frederik Hammes 4 , Angelika Hess 3 , Moritz Lürig 5 , Blake Matthews 6 , Camille Minaudo 7 , Eberhard Morgenroth 3 , Viet Tran-Khac 8 , Kris Villez 9
Affiliation  

Anomaly detection is the process of identifying unexpected data samples in datasets. Automated anomaly detection is either performed using supervised machine learning models, which require a labelled dataset for their calibration, or unsupervised models, which do not require labels. While academic research has produced a vast array of tools and machine learning models for automated anomaly detection, the research community focused on environmental systems still lacks a comparative analysis that is simultaneously comprehensive, objective, and systematic. This knowledge gap is addressed for the first time in this study, where 15 different supervised and unsupervised anomaly detection models are evaluated on 5 different environmental datasets from engineered and natural aquatic systems. To this end, anomaly detection performance, labelling efforts, as well as the impact of model and algorithm tuning are taken into account. As a result, our analysis reveals the relative strengths and weaknesses of the different approaches in an objective manner without bias for any particular paradigm in machine learning. Most importantly, our results show that expert-based data annotation is extremely valuable for anomaly detection based on machine learning.



中文翻译:

人类数据注释对环境系统中基于机器学习的异常检测的价值

异常检测是识别数据集中意外数据样本的过程。自动异常检测要么使用监督机器学习模型执行,该模型需要标记数据集进行校准,要么使用不需要标签的无监督模型。虽然学术研究已经产生了大量用于自动异常检测的工具和机器学习模型,但专注于环境系统的研究界仍然缺乏同时全面、客观和系统的比较分析。本研究首次解决了这一知识差距,在来自工程和自然水生系统的 5 个不同环境数据集上评估了 15 个不同的监督和非监督异常检测模型。为此,异常检测性能、标注工作、以及模型和算法调整的影响被考虑在内。因此,我们的分析以客观的方式揭示了不同方法的相对优势和劣势,而不会对机器学习中的任何特定范式产生偏见。最重要的是,我们的结果表明,基于专家的数据注释对于基于机器学习的异常检测非常有价值。

更新日期:2021-10-07
down
wechat
bug