A computational study on imputation methods for missing environmental data,arXiv - CS - Databases

当前位置： X-MOL 学术 › arXiv.cs.DB › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

A computational study on imputation methods for missing environmental data
arXiv - CS - Databases Pub Date : 2021-08-21 , DOI: arxiv-2108.09500
Paul Dixneuf, Fausto Errico, Mathias Glaus

Data acquisition and recording in the form of databases are routine operations. The process of collecting data, however, may experience irregularities, resulting in databases with missing data. Missing entries might alter analysis efficiency and, consequently, the associated decision-making process. This paper focuses on databases collecting information related to the natural environment. Given the broad spectrum of recorded activities, these databases typically are of mixed nature. It is therefore relevant to evaluate the performance of missing data processing methods considering this characteristic. In this paper we investigate the performances of several missing data imputation methods and their application to the problem of missing data in environment. A computational study was performed to compare the method missForest (MF) with two other imputation methods, namely Multivariate Imputation by Chained Equations (MICE) and K-Nearest Neighbors (KNN). Tests were made on 10 pretreated datasets of various types. Results revealed that MF generally outperformed MICE and KNN in terms of imputation errors, with a more pronounced performance gap for mixed typed databases where MF reduced the imputation error up to 150%, when compared to the other methods. KNN was usually the fastest method. MF was then successfully applied to a case study on Quebec wastewater treatment plants performance monitoring. We believe that the present study demonstrates the pertinence of using MF as imputation method when dealing with missing environmental data.

中文翻译：

环境数据缺失插补方法的计算研究

数据库形式的数据采集和记录是常规操作。然而，收集数据的过程可能会遇到不规则的情况，导致数据库丢失数据。缺少条目可能会改变分析效率，从而影响相关的决策过程。本文侧重于收集与自然环境相关的信息的数据库。鉴于记录的活动范围广泛，这些数据库通常具有混合性质。因此，考虑到这一特性，评估缺失数据处理方法的性能是相关的。在本文中，我们研究了几种缺失数据插补方法的性能及其在环境中缺失数据问题中的应用。进行了一项计算研究，以比较 MissForest (MF) 方法与其他两种插补方法，即链式方程多元插补 (MICE) 和 K-最近邻 (KNN)。对 10 个不同类型的预处理数据集进行了测试。结果表明，MF 在插补错误方面的表现通常优于 MICE 和 KNN，混合类型数据库的性能差距更为明显，与其他方法相比，MF 将插补错误降低了 150%。KNN 通常是最快的方法。随后，MF 成功应用于魁北克污水处理厂性能监测的案例研究。我们认为，本研究证明了在处理缺失的环境数据时使用 MF 作为插补方法的相关性。即通过链式方程 (MICE) 和 K-最近邻 (KNN) 进行多元插补。对 10 个不同类型的预处理数据集进行了测试。结果表明，MF 在插补错误方面的表现通常优于 MICE 和 KNN，混合类型数据库的性能差距更为明显，与其他方法相比，MF 将插补错误降低了 150%。KNN 通常是最快的方法。随后，MF 成功应用于魁北克污水处理厂性能监测的案例研究。我们认为，本研究证明了在处理缺失的环境数据时使用 MF 作为插补方法的相关性。即通过链式方程 (MICE) 和 K-最近邻 (KNN) 进行多元插补。对 10 个不同类型的预处理数据集进行了测试。结果表明，MF 在插补错误方面的表现通常优于 MICE 和 KNN，混合类型数据库的性能差距更为明显，与其他方法相比，MF 将插补错误降低了 150%。KNN 通常是最快的方法。随后，MF 成功应用于魁北克污水处理厂性能监测的案例研究。我们认为，本研究证明了在处理缺失的环境数据时使用 MF 作为插补方法的相关性。结果表明，MF 在插补错误方面的表现通常优于 MICE 和 KNN，混合类型数据库的性能差距更为明显，与其他方法相比，MF 将插补错误降低了 150%。KNN 通常是最快的方法。随后，MF 成功应用于魁北克污水处理厂性能监测的案例研究。我们认为，本研究证明了在处理缺失的环境数据时使用 MF 作为插补方法的相关性。结果表明，MF 在插补错误方面的表现通常优于 MICE 和 KNN，混合类型数据库的性能差距更为明显，与其他方法相比，MF 将插补错误降低了 150%。KNN 通常是最快的方法。随后，MF 成功应用于魁北克污水处理厂性能监测的案例研究。我们认为，本研究证明了在处理缺失的环境数据时使用 MF 作为插补方法的相关性。

更新日期：2021-08-24

点击分享查看原文

点击收藏

阅读更多本刊最新论文