当前位置: X-MOL 学术Eng. Appl. Artif. Intell. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
A data-driven method for detecting and diagnosing causes of water quality contamination in a dataset with a high rate of missing values
Engineering Applications of Artificial Intelligence ( IF 8 ) Pub Date : 2020-08-05 , DOI: 10.1016/j.engappai.2020.103822
Raymond Houé Ngouna , Romy Ratolojanahary , Kamal Medjaher , Fabien Dauriac , Mathieu Sebilo , Jean Junca-Bourié

Democratization of sensing devices in industrial systems has made it possible to collect a large amount of data of different types, which has led to the necessity of handling complex analyses for knowledge extraction. The field of water resources is of those areas which has drawn the attention of decision-makers seeking to preserve human health and safety. Recent advances in Artificial Intelligence, particularly in the domain of Machine Learning, have opened the potential to leverage massive data to better address the issue related to the relationship between water quality and human activities. However, high rate of missing data and heterogeneity of the measurements are scientific issues that cannot be solved by standard methods, especially when no prior knowledge on the label of each observation is provided. In this article, Prognostics and Health Management was implemented to detect and diagnose anomalies in water quality datasets, taking into account the uncertainties induced by the above-mentioned issues. Fuzzy c-means was used to identify the different water quality classes, while Random Forest was applied to determine the most influencing parameters, with respect to potential contamination of water resources in the southwest of France. The results suggest that multiple imputation methods can handle the missingness issue, while the use of decision rules based on well-known water quality standards can solve the problem regarding the lack of labelled observations. In addition, two potential sources of contamination (atrazine and nitrate) were identified and then validated by hydrogeology experts, prior to further online deployment of the proposed model.



中文翻译:

一种数据驱动的方法,用于在数据丢失率很高的数据集中检测和诊断水质污染的原因

工业系统中传感设备的民主化使得有可能收集大量不同类型的数据,这导致有必要处理复杂的分析以提取知识。水资源领域属于那些寻求维护人类健康和安全的决策者的关注的领域。人工智能的最新进展,尤其是在机器学习领域,为利用海量数据更好地解决与水质与人类活动之间的关系问题提供了潜力。但是,数据丢失率高和测量的异质性是无法通过标准方法解决的科学问题,尤其是在没有提供每个观测结果标签上的先验知识的情况下。在这篇文章中,考虑到上述问题引起的不确定性,实施了“预测和健康管理”以检测和诊断水质数据集中的异常。关于法国西南部水资源的潜在污染,使用模糊c均值来识别不同的水质等级,而应用随机森林来确定影响最大的参数。结果表明,多种插补方法可以处理缺失问题,而基于众所周知的水质标准的决策规则的使用可以解决缺少标记观测结果的问题。此外,在进一步在线部署拟议模型之前,确定了两种潜在的污染源(阿特拉津和硝酸盐),然后由水文地质专家进行了验证。

更新日期:2020-08-05
down
wechat
bug