Supervised Machine Learning and Heuristic Algorithms for Outlier Detection in Irregular Spatiotemporal Datasets,Journal of Environmental Informatics

当前位置： X-MOL 学术 › J. Environ. Inform. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Supervised Machine Learning and Heuristic Algorithms for Outlier Detection in Irregular Spatiotemporal Datasets
Journal of Environmental Informatics ( IF 6.0 ) Pub Date : 2018-01-01 , DOI: 10.3808/jei.201700375
K. P. Chowdhury ,

A central problem in time series analysis is the detection of outliers, with further complications presented by irregular time series data measured having spatiotemporal components. This paper presents one Heuristic and two Supervised Machine Learning algorithms for the detection of outliers in this context in univariate time series data, with comparison of results to Chen and Liu's (1993) automatic outlier detection methodology. Due to the recent trend of set up of large environmental databases across many states in the US and around the world, which allow submission of pollutant measurement data from virtually any source, these procedures are applied to the measurements of various surface water pollutants in the California Environmental Data Exchange Network (CEDEN) for understanding and exploring the viability of such databases and the proposed methods. The proposed methodologies though not as robust, give similar results to existing methodologies given the nature of the data, but can be far less time intensive to implement providing interesting insights into the database. Thus, the algorithms presented can be widely used with minimal computing resource requirements with very tractable results even with very large datasets. The methodologies have wide applicability in a variety of contexts and a wide variety of databases with similar measurement challenges across many disciplines, specifically in the environmental setting. In particular, the results have large potential regulatory impact on accepted levels of different pollutants in California water bodies, as well as the amounts to be charged for industrial discharge into those water bodies, and is intended to provide direction for further research and regulatory investments. Based on the results it seems reasonable to assume that there is further room for the inclusion of nongovernmental agency pollutant measurements in the debate of environmental pollution, specifically in California. However, the results also indicate that the use of such databases in a more inclusive way for regulatory matters must be carefully evaluated on an individualized basis. That is to ensure that poorly collected/handled measurements, do not inundate the database over and above those collected with more rigor, thus potentially making inference on the true population distribution of the pollutants more difficult; being especially relevant for those pollutant measurements, which require more delicate sampling procedures.

中文翻译：

用于不规则时空数据集中异常值检测的监督机器学习和启发式算法

时间序列分析中的一个核心问题是异常值的检测，并且具有时空分量的不规则时间序列数据带来了进一步的复杂性。本文介绍了一种启发式算法和两种监督式机器学习算法，用于检测单变量时间序列数据中这种情况下的异常值，并将结果与 Chen 和 Liu (1993) 的自动异常值检测方法进行比较。由于最近在美国和世界各地的许多州建立大型环境数据库的趋势，允许提交几乎任何来源的污染物测量数据，这些程序应用于加州环境数据交换网络 (CEDEN) 中各种地表水污染物的测量，以了解和探索此类数据库和建议方法的可行性。所提出的方法虽然不那么健壮，但鉴于数据的性质，给出了与现有方法相似的结果，但实施提供对数据库的有趣见解的时间要少得多。因此，即使对于非常大的数据集，所提出的算法也可以以最小的计算资源需求广泛使用，结果非常容易处理。这些方法在各种环境和各种数据库中具有广泛的适用性，在许多学科中具有类似的测量挑战，特别是在环境设置中。特别是，结果对加利福尼亚水体中不同污染物的可接受水平以及向这些水体中工业排放的收费量具有巨大的潜在监管影响，旨在为进一步研究和监管投资提供方向。根据这些结果，似乎可以合理地假设，在环境污染的辩论中，特别是在加利福尼亚州，还有进一步的空间将非政府机构的污染物测量包括在内。然而，结果还表明，必须在个性化的基础上仔细评估以更具包容性的方式将此类数据库用于监管事务。那是为了确保收集/处理不当的测量结果不会使数据库淹没在更严格收集的数据之上，从而可能使推断污染物的真实人口分布更加困难；尤其适用于那些需要更精细采样程序的污染物测量。

更新日期：2018-01-01

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南11