当前位置: X-MOL 学术Atmos. Environ. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Use of data imputation tools to reconstruct incomplete air quality datasets: A case-study in Temuco, Chile
Atmospheric Environment ( IF 5 ) Pub Date : 2019-03-01 , DOI: 10.1016/j.atmosenv.2018.11.053
María Elisa Quinteros , Siyao Lu , Carola Blazquez , Juan Pablo Cárdenas-R , Ximena Ossa , Juana-María Delgado-Saborit , Roy M. Harrison , Pablo Ruiz-Rudolph

Abstract Missing data from air quality datasets is a common problem, but is much more severe in small cities or localities. This poses a great challenge for environmental epidemiology as high exposures to pollutants worldwide occur in these settings and gaps in datasets hinder health studies that could later inform local and international policies. Here, we propose the use of imputation methods as a tool to reconstruct air quality datasets and have applied this approach to an air quality dataset in Temuco, a mid-size city in Chile as a case-study. We attempted to reconstruct the database comparing five approaches: mean imputation, conditional mean imputation, K-Nearest Neighbor imputation, multiple imputation and Bayesian Principal Component Analysis imputation. As a base for the imputation methods, linear regression models were fitted for PM2.5 against other air quality and meteorological variables. Methods were challenged against validation sets where data was removed artificially. Imputation methods were able to reconstruct the dataset with good performance in terms of completeness, errors, and bias, even when challenged against the validations sets. The performance improved when including covariates from a second monitoring station in Temuco. K-Nearest Neighbor imputation showed slightly better performance than multiple imputation for error (25% vs. 27%) and bias (2.1% vs. 3.9%), but presented lower completeness (70% vs. 100%). In summary, our results show that the imputation methods can be a useful tool in reconstructing air quality datasets in a real-life situation.

中文翻译:

使用数据插补工具重建不完整的空气质量数据集:智利特木科的案例研究

摘要 空气质量数据集缺失数据是一个常见问题,但在小城市或地方更为严重。这对环境流行病学构成了巨大挑战,因为在这些环境中发生了全世界对污染物的高暴露,并且数据集的空白阻碍了健康研究,这些研究可能会为以后的地方和国际政策提供信息。在这里,我们建议使用插补方法作为重建空气质量数据集的工具,并将这种方法应用于智利中型城市 Temuco 的空气质量数据集作为案例研究。我们尝试重建比较五种方法的数据库:均值插补、条件均值插补、K-最近邻插补、多重插补和贝叶斯主成分分析插补。作为插补方法的基础,针对 PM2.5 拟合了线性回归模型。5 针对其他空气质量和气象变量。方法受到了人为删除数据的验证集的挑战。即使在对验证集提出挑战时,插补方法也能够在完整性、错误和偏差方面以良好的性能重建数据集。当包含来自 Temuco 的第二个监测站的协变量时,性能有所提高。K-最近邻插补在误差(25% 对 27%)和偏差(2.1% 对 3.9%)方面表现出略优于多重插补的性能,但完整性较低(70% 对 100%)。总之,我们的结果表明,插补方法可以成为在现实生活中重建空气质量数据集的有用工具。方法受到了人为删除数据的验证集的挑战。即使在对验证集提出挑战时,插补方法也能够在完整性、错误和偏差方面以良好的性能重建数据集。当包含来自 Temuco 的第二个监测站的协变量时,性能有所提高。K-最近邻插补在误差(25% 对 27%)和偏差(2.1% 对 3.9%)方面表现出略优于多重插补的性能,但完整性较低(70% 对 100%)。总之,我们的结果表明,插补方法可以成为在现实生活中重建空气质量数据集的有用工具。方法受到人为删除数据的验证集的挑战。即使在对验证集提出挑战时,插补方法也能够在完整性、错误和偏差方面以良好的性能重建数据集。当包含来自 Temuco 的第二个监测站的协变量时,性能有所提高。K-最近邻插补在误差(25% 对 27%)和偏差(2.1% 对 3.9%)方面表现出略优于多重插补的性能,但完整性较低(70% 对 100%)。总之,我们的结果表明,插补方法可以成为在现实生活中重建空气质量数据集的有用工具。即使受到验证集的挑战。当包含来自 Temuco 的第二个监测站的协变量时,性能有所提高。K-最近邻插补在误差(25% 对 27%)和偏差(2.1% 对 3.9%)方面表现出略优于多重插补的性能,但完整性较低(70% 对 100%)。总之,我们的结果表明,插补方法可以成为在现实生活中重建空气质量数据集的有用工具。即使受到验证集的挑战。当包含来自 Temuco 的第二个监测站的协变量时,性能有所提高。K-最近邻插补在误差(25% 对 27%)和偏差(2.1% 对 3.9%)方面表现出略优于多重插补的性能,但完整性较低(70% 对 100%)。总之,我们的结果表明,插补方法可以成为在现实生活中重建空气质量数据集的有用工具。
更新日期:2019-03-01
down
wechat
bug