Using a land use regression model with machine learning to estimate ground level PM2.5

doi:10.1016/j.envpol.2021.116846

Environmental Pollution

Volume 277, 15 May 2021, 116846

https://doi.org/10.1016/j.envpol.2021.116846 Get rights and content

Highlights

•
Estimating long-term daily PM_2.5 concentration with machine learning models.
•
Land-use patterns were included in machine learning models by using land-use regression.
•
Explanatory power of daily PM_2.5 concentration was increased from 0.58 to 0.94.
•
XGboost outperformed random forest and deep neural network algorithms.

Abstract

Ambient fine particulate matter (PM_2.5) has been ranked as the sixth leading risk factor globally for death and disability. Modelling methods based on having access to a limited number of monitor stations are required for capturing PM_2.5 spatial and temporal continuous variations with a sufficient resolution. This study utilized a land use regression (LUR) model with machine learning to assess the spatial-temporal variability of PM_2.5. Daily average PM_2.5 data was collected from 73 fixed air quality monitoring stations that belonged to the Taiwan EPA on the main island of Taiwan. Nearly 280,000 observations from 2006 to 2016 were used for the analysis. Several datasets were collected to determine spatial predictor variables, including the EPA environmental resources dataset, a meteorological dataset, a land-use inventory, a landmark dataset, a digital road network map, a digital terrain model, MODIS Normalized Difference Vegetation Index (NDVI) database, and a power plant distribution dataset. First, conventional LUR and Hybrid Kriging-LUR were utilized to identify the important predictor variables. Then, deep neural network, random forest, and XGBoost algorithms were used to fit the prediction model based on the variables selected by the LUR models. Data splitting, 10-fold cross validation, external data verification, and seasonal-based and county-based validation methods were used to verify the robustness of the developed models. The results demonstrated that the proposed conventional LUR and Hybrid Kriging-LUR models captured 58% and 89% of PM_2.5 variations, respectively. When XGBoost algorithm was incorporated, the explanatory power of the models increased to 73% and 94%, respectively. The Hybrid Kriging-LUR with XGBoost algorithm outperformed the other integrated methods. This study demonstrates the value of combining Hybrid Kriging-LUR model and an XGBoost algorithm for estimating the spatial-temporal variability of PM_2.5 exposures.

Graphical abstract

Introduction

The health effects of ambient fine particulate matter (PM) have been broadly investigated worldwide for decades and PM ranked as the sixth leading risk factor globally for death and disability (Forouzanfar et al., 2016). It is well documented that PM_2.5 (PM with an aerodynamic diameter < 2.5 μm) is associated with various short-term and long-term adverse health effects, including increased risk of respiratory diseases, cardiovascular mortality, type 2 diabetes mellitus, hypertensive disorders of pregnancy, neurological hospital admissions, and premature mortality (Bai et al., 2020; Kioumourtzoglou et al., 2016; Pope III & Dockery, 2006). As such, a more precise method to assess PM_2.5 exposures is needed in environmental epidemiological studies. Some prior studies estimated PM_2.5 concentrations by applying data from sparse monitoring stations. This method is not feasible for large cohort studies, particularly those focused on rural areas. Hence, environmental modelling methods need to be developed in order to capture PM_2.5 spatial and temporal continuous variations at a sufficient resolution based on limited monitoring stations.

Previous studies have used aerosol optical depth (AOD) to retrieve ground-level PM estimations or have used it as a predictor variable for model improvement (Di et al., 2019; Shtein et al., 2019; Stafoggia et al., 2019). Although satellite-derived images could provide AOD properties which cover an almost global surface at a moderate spatial resolution, these remote sensing images are easily affected by cloudy weather and water/snow glint reflectance (Sayer et al., 2013; Stafoggia et al., 2019). Because of this limitation, AOD products have commonly been applied for continents and not islands or areas with cloudy weather conditions (Wu et al., 2018). For instance, Taiwan is an island and could have missing data of AOD at a rate higher than 70%. Therefore, an alternative method to estimate PM should be developed for areas such as Taiwan that may have too much missing data.

Land-Use Regression (LUR) has been used to depict intra-urban air pollution concentration variation in fine spatial-temporal resolution worldwide because of its ability to consider different types of land-use variables in assessing target pollutants (Beelen et al., 2013; Eeftens et al., 2012; Wu et al., 2017; Young et al., 2016). LUR models use a set of geographic sources as predictor variables to develop multiple linear regressions in order to estimate air pollution concentrations. A more defined land-use predictor dataset was applied because it offers the potential to obtain a LUR model with greater explanatory power. Thus, land-use/land cover dataset could play an important role in affecting model performance (Beelen et al., 2013; Eeftens et al., 2012; Hellack et al., 2017). The strength of LUR is that important predictor variables that affect air pollution concentration can be identified by a stepwise variable selection procedure, thereby reducing the dimension of predictor variables in the resultant models. Recently, more research in the field of air pollution modelling has utilized non-linear statistical methods, such as generalized additive model, to improve estimation accuracy (Li et al., 2013; Yang et al., 2018; Zou et al., 2017) and machine learning algorithms (Chen et al., 2019; Di et al., 2019; Shtein et al., 2019; Stafoggia et al., 2019; Weichenthal et al., 2016). Among them, machine learning models, with their ability to capture non-linear relationships between predictor variables and air pollution concentrations, improve a model’s prediction performance more than do traditional regression approaches. Araki et al. (2018) used random forest (RF) to estimate metropolitan NO₂ variation in Japan, and the results of RF outperformed traditional regression approaches (Araki et al., 2018). Another study developed multiple modelling approaches to estimate air pollution concentration in Europe, and it also concluded that the results of RF and artificial neural networks were better than linear regression methods (Chen et al., 2019). A limited number of studies have used extreme gradient boosting (XGBoost) algorithms to estimate air pollution concentration. Offering the ability to enable slow learners to investigate the bias of a loss function, XGBoost has potential to outperform some tree-based machine learning algorithms (Chen and Guestrin, 2016). However, one of the shortcomings of machine learning models is that most of them are unable to select important variables before training each algorithm. Training estimation models with an enormous amount of predictor variables may result in overfitting issues and may result in reduced computational efficiency. Due to the lack of variable selection procedures, predictors used in these algorithms may have limited interpretability and limited reasonable explanations for variations in air pollution. Thus, it is important to use a variable selection method before training a machine learning algorithm.

Given that the shortcoming of machine learning models in selecting proper predictor variables can be solved by applying LUR, this study aims to use an integrated approach combining LUR and machine learning models to improve the daily concentration estimation of PM_2.5 during 2000–2016 on the main island of Taiwan. In detail, important predictor variables identified by the stepwise variable selection from the LUR procedures were applied to develop the prediction models using three types of machine learning algorithms, namely deep neural network (DNN), RF, and XGBoost for improving the accuracy of PM_2.5 variation predictions. The mixed spatial prediction models that combine the strength of LUR in identifying the most influential emission predictors and the predictability of machine learning in estimating non-linear trends would be more broadly effective than techniques that rely solely on LUR or machine learning.

Section snippets

Study area

Taiwan is an island located southeast of China, across the Taiwan Strait. The area of Taiwan is 36,193 km², and the country consists of 14 counties and 368 townships. The total population of Taiwan is 23.5 million people and 22 million motorcycles and cars are registered on the island (MOTC, 2020). For decades heavy traffic has given rise to concerns about air pollution on this densely populated island. The topography of Taiwan is characterized by a great plain in the western portion and major

Measured PM_2.5 concentrations for the study period

Fig. 3. illustrates the observed daily averages of PM_2.5 concentrations in Taiwan from 2006 to 2016. Seasonal variations were obtained from the daily trend for each year. The highest level of PM_2.5 concentrations appeared during winters, while the lowest levels appeared during summers (Fig. S1). The annual mean PM_2.5 concentration (30.47 μg/m³) was higher than the air quality standard in Taiwan (15 μg/m³). The highest levels of PM_2.5 concentration were observed at traffic stations while the

Discussion

Findings of this study demonstrate that the Hybrid-Kriging LUR integrated with XGBoost had the best performance (R² = 0.94, RMSE = 4.41 μg/m³) among the proposed models. Compared with conventional LUR, the best model increased explanatory power by 36%, while the R² increased from 0.58 to 0.94. It is worth noting that incorporating Kriging interpolation, LUR, and machine learning methods can improve model efficiency by reducing variable dimension, as well as increasing statistical ability.

Conclusion

This study utilizes conventional LUR, Hybrid Kriging-LUR, and two LUR models incorporated with deep neural network, random forest, and XGBoost machine learning algorithms for PM_2.5 variation estimation in Taiwan. After comparing the eight models, the results showed that the proposed method incorporating Hybrid Kriging-LUR and XGBoost machine learning algorithms estimates PM_2.5 concentrations more accurately in Taiwan than do other models. Moreover, utilizing a stepwise variable selection

Declaration of competing interest

The authors declare no conflict of interest.

Acknowledgements

This study is funded by the Ministry of Science and Technology, R.O.C. (MOST 108-2621-M-006-017 -) and Academia Sinica, Taiwan, under “Trans-disciplinary PM_2.5 Exposure Research in Urban Areas for Health-oriented Preventive Strategies(II) ”. Project No.: AS–SS–110-02. The authors are grateful to the National Aeronautics and Space Administration (NASA) and to the U.S. Geological Survey (USGS) for data.

References (52)

S. Araki et al.
Spatiotemporal land use random forest model for estimating metropolitan NO2 exposure in Japan
Sci. Total Environ.
(2018)
W. Bai et al.
Association between ambient air pollution and pregnancy complications: a systematic review and meta-analysis of cohort studies
Environ. Res.
(2020)
R. Beelen et al.
Development of NO2 and NOx land use regression models for estimating air pollution exposure in 36 study areas in Europe–The ESCAPE project
Atmos. Environ.
(2013)
G. Chen et al.
A machine learning method to estimate PM2.5 concentrations across China with remote sensing, meteorological and land use information
Sci. Total Environ.
(2018)
J. Chen et al.
A comparison of linear regression, regularization, and machine learning algorithms to develop Europe-wide spatial models of fine particles and nitrogen dioxide
Environ. Int.
(2019)
T.H. Chen et al.
A hybrid kriging/land-use regression model with Asian culture-specific sources to assess NO2 spatial-temporal variations
Environ. Pollut.
(2020)
Q. Di et al.
An ensemble-based model of PM2.5 concentration across the contiguous United States with high spatiotemporal resolution
Environ. Int.
(2019)
B. Friedman et al.
SOA and gas phase organic acid yields from the sequential photooxidation of seven monoterpenes
Atmos. Environ.
(2018)
B. Hellack et al.
Land use regression modeling of oxidative potential of fine particles, NO2, PM2.5 mass and association to type two diabetes mellitus
Atmos. Environ.
(2017)
L. Huang et al.
Development of land use regression models for PM2.5, SO2, NO2 and O3 in Nanjing, China
Environ. Res.
(2017)

J. Kammer et al.

Observation of nighttime new particle formation over the French Landes forest

Sci. Total Environ.

(2018)

K. Kim et al.

A review on the human health impact of airborne particulate matter

Environ. Int.

(2015)

S.C. Lee et al.

Characteristics of emissions of air pollutants from burning of incense in a large environmental chamber

Atmos. Environ.

(2004)

L. Li et al.

Estimating spatiotemporal variability of ambient air pollutant concentrations with a hierarchical model

Atmos. Environ.

(2013)

C.Y. Lin et al.

Long-range transport of aerosols and their impact on the air quality of Taiwan

Atmos. Environ.

(2005)

D.J. Nowak et al.

Air pollution removal by urban trees and shrubs in the United States

Urban For. Urban Green.

(2006)

A. Sæbø et al.

Plant species differences in particulate matter accumulation on leaf surfaces

Sci. Total Environ.

(2012)

B. Srimuruganandam et al.

Source characterization of PM10 and PM2.5 mass using a chemical mass balance model at urban roadside

Sci. Total Environ.

(2012)

M. Stafoggia et al.

Estimation of daily PM10 and PM2.5 concentrations in Italy, 2013–2015, using a spatiotemporal land-use random-forest model

Environ. Int.

(2019)

S. Weichenthal et al.

A land use regression model for ambient ultrafine particles in Montreal, Canada: a comparison of linear regression and a machine learning approach

Environ. Res.

(2016)

C.D. Wu et al.

Land-use regression with long-term satellite-based greenness index and culture-specific sources to model PM2.5 spatial-temporal variability

Environ. Pollut.

(2017)

C.D. Wu et al.

A hybrid kriging/land-use regression model to assess PM2.5 spatial-temporal variability

Sci. Total Environ.

(2018)

H. Xu et al.

National PM2.5 and NO2 exposure models for China based on land use regression, satellite measurements, and universal kriging

Sci. Total Environ.

(2019)

S. Yin et al.

Quantifying air pollution attenuation within urban parks: an experimental approach in Shanghai, China

Environ. Pollut.

(2011)

K. Yu et al.

Indoor air pollution from gas cooking in five Taiwanese families

Build. Environ.

(2015)

Y. Zhan et al.

Spatiotemporal prediction of continuous daily PM2.5 concentrations across China using a spatially explicit machine learning algorithm

Atmos. Environ.

(2017)

Cited by (80)

A review of machine learning for modeling air quality: Overlooked but important issues
2024, Atmospheric Research
Machine learning models based on satellite remote sensing have gained widespread use in estimating ground-level air pollutant concentrations, which overcome the limitations of the discontinuous spatial distribution of ground monitoring stations. However, due to the interdisciplinary nature of environmental modeling, atmospheric researchers may overlook some important issues when using machine learning. In this review, we summarize and discuss the overlooked but important issues in data preparation, model development, validation, and prediction, including feature engineering, imbalanced data, validation strategy, and model interpretation, which are critical for model generalizability. Firstly, we provide considerations and recommendations in obtaining, selecting, and using data of the main variables in machine learning for air quality mapping. Secondly, sufficient introduction and discussion are provided on using feature engineering and addressing imbalanced data, which can enhance data representativeness and improve model performance during model development. Thirdly, we analyze and compare model validation strategies, and give suggestions on their applicable situations. Finally, we propose that placing importance on model interpretation in model development and prediction can guide model improvements. We review several commonly used model interpretation methods, elucidate the interpretation scope, and advance the application in model diagnostics. Corresponding to these issues, this review provides in-depth and practical guidance on applying machine learning for robust air quality mapping.
Assessing influential factors of Chinese industrial aqueous cadmium emissions based on machine learning and shapley additive explanations
2024, Journal of Cleaner Production
As the main contributor of cadmium (Cd) pollution in China, industrial aqueous Cd emissions are influenced by both industrial and socioeconomic characteristics owing to their “environmental service” nature. It is essential to understand the relationship between these characteristics and industrial aqueous Cd emissions during industrialization. In this study, a data-driven framework was proposed to reveal these relationships. The framework comprises two key steps. i) Three state-of-the-art tree-based machine learning models, including LightGBM, Gradient Boosting Decision Tree and Random Forest, were trained to capture the relationships among variables. ii) Shapley additive explanation was utilized to decompose the contribution of the characteristics to the prediction for each sample. The trained LightGBM model demonstrated the most superior performance in predicting industrial aqueous Cd emissions in test datasets ( $R^{2} = 0.881 \pm 0.0431$ ， $R M S E = 58.260 \pm 11.839$ ， $M A E = 27.743 \pm 3.274$ ), significantly overperforming traditional linear regression model. Our analysis revealed that during our sample period, the influence of industrial characteristics on regional industrial aqueous Cd emissions was approximately 1.796 times greater than that of socioeconomic characteristics. Shifts in the characteristics of the non-ferrous metal industry contributed to approximately 74% of the average increase in industrial aqueous Cd emissions in China approximately. Regional emissions were found to be positively affected by average size and density but negatively affected by the operation duration of local non-ferrous metal industrial firms. Further driver analysis showed that the growth trajectory of emissions in China can be split into three stages based on the main drivers and growth rate: 2000–2002, 2003–2004, and 2005–2007. The proposed framework overcomes the limitations of the previous methods in terms of application scope, potential factor considerations, and regression structure predefinition. Our analysis implies that policymakers should proactively adjust industrial policies and emissions regulations in response to market shocks and industrial shifts, to a better management of industrial aqueous Cd emissions.
A machine learning-based ensemble model for estimating diurnal variations of nitrogen oxide concentrations in Taiwan
2024, Science of the Total Environment
Air pollution is inextricable from human activity patterns. This is especially true for nitrogen oxide (NO_x), a pollutant that exists naturally and also as a result of anthropogenic factors. Assessing exposure by considering diurnal variation is a challenge that has not been widely studied. Incorporating 27 years of data, we attempted to estimate diurnal variations in NO_x across Taiwan. We developed a machine learning-based ensemble model that integrated hybrid kriging-LUR, machine-learning, and an ensemble learning approach. Hybrid kriging-LUR was performed to select the most influential predictors, and machine-learning algorithms were applied to improve model performance. The three best machine-learning algorithms were suited and reassessed to develop ensemble learning that was designed to improve model performance. Our ensemble model resulted in estimates of daytime, nighttime, and daily NO_x with high explanatory powers (Adj-R²) of 0.93, 0.98, and 0.94, respectively. These explanatory powers increased from the initial model that used only hybrid kriging-LUR. Additionally, the results depicted the temporal variation of NO_x, with concentrations higher during the daytime than the nighttime. Regarding spatial variation, the highest NO_x concentrations were identified in northern and western Taiwan. Model evaluations confirmed the reliability of the models. This study could serve as a reference for regional planning supporting emission control for environmental and human health.
Quantifying source contributions to ambient NH<inf>3</inf> using Geo-AI with time lag and parcel tracking functions
2024, Environment International
Ambient ammonia (NH₃) plays an important compound in forming particulate matters (PMs), and therefore, it is crucial to comprehend NH₃′s properties in order to better reduce PMs. However, it is not easy to achieve this goal due to the limited range/real-time NH₃ data monitored by the air quality stations. While there were other studies to predict NH₃ and its source apportionment, this manuscript provides a novel method (i.e., GEO-AI)) to look into NH₃ predictions and their contribution sources. This study represents a pioneering effort in the application of a novel geospatial-artificial intelligence (Geo-AI) base model with parcel tracking functions. This innovative approach seamlessly integrates various machine learning algorithms and geographic predictor variables to estimate NH₃ concentrations, marking the first instance of such a comprehensive methodology. The Shapley additive explanation (SHAP) was used to further analyze source contribution of NH₃ with domain knowledge. From 2016 to 2018, Taichung's hourly average NH₃ values were predicted with total variance up to 96%. SHAP values revealed that waterbody, traffic and agriculture emissions were the most significant factors to affect NH₃ concentrations in Taichung among all the characteristics. Our methodology is a vital first step for shaping future policies and regulations and is adaptable to regions with limited monitoring sites.
Characterization of spatial-temporal distribution and microenvironment source contribution of PM<inf>2.5</inf> concentrations using a low-cost sensor network with artificial neural network/kriging techniques
2024, Environmental Research
Low-cost sensors (LCS) network is widely used to improve the resolution of spatial-temporal distribution of air pollutant concentrations in urban areas. However, studies on air pollution sources contribution to the microenvironment, especially in industrial and mix-used housing areas, still need to be completed. This study investigated the spatial-temporal distribution and source contributions of PM_2.5 in the urban area based on 6-month of the LCS network datasets. The Artificial Neural Network (ANN) was used to calibrate the measured PM_2.5 by the LCS network. The calibrated PM_2.5 were shown to agree with reference PM_2.5 measured by the BAM-1020 with R² of 0.85, MNE of 30.91%, and RMSE of 3.73 μg/m³, which meet the criteria for hotspot identification and personal exposure study purposes. The Kriging method was further used to establish the spatial-temporal distribution of PM_2.5 concentrations in the urban area. Results showed that the highest average PM_2.5 concentration occurred during autumn and winter due to monsoon and topographic effects. From a diurnal perspective, the highest level of PM_2.5 concentration was observed during the daytime due to heavy traffic emissions and industrial production. Based on the present ANN-based microenvironment source contribution assessment model, temples, fried chicken shops, traffic emissions in shopping and residential zones, and industrial activities such as the mechanical manufacturing and precision metal machining were identified as the sources of PM_2.5. The numerical algorithm coupled with the LCS network presented in this study is a practical framework for PM_2.5 hotspots and source identification, aiding decision-makers in reducing atmospheric PM_2.5 concentrations and formulating regional air pollution control strategies.
Changes in wintertime visibility across China over 2013–2019 and the drivers: A comprehensive assessment using machine learning method
2024, Science of the Total Environment
Effective emission reduction measures have largely lowered the particulate concentration in China, but low-visibility events still occur frequently, greatly affecting people's daily life, travel, and health. In the context of carbon neutrality strategy and climate change, the mechanisms governing visibility changes may be undergoing a transformation. To address this critical issue, we have undertaken a comprehensive assessment by employing a novel approach that combines site observations, model-derived datasets, and machine learning techniques. Our analysis of the dataset shows varying degrees of improvement in wintertime visibility in regions such as North China, South China, and the Fenwei Plain over 2013–2019, but an unexpected deterioration (approximately 1 km yr⁻¹) in central and southern China (CSC). We further elucidate key roles of PM_2.5 reduction in these regions with visibility improvement; whereas the unsatisfactory visibility trend in CSC was caused by combined effect of relative humidity (RH) increase (47 %), aerosol hygroscopicity (κ) enhancement (9 %), and boundary layer (BLH) reduction (8 %), which greatly overwhelms the effect of PM_2.5 reduction recently. Moreover, the study reveals a growing influence of RH on the wintertime visibility, reaching 40 % ± 24 % to the total contribution in 2019, while that of PM_2.5 declined to 18 % ± 19 % and is expected to further diminish with emission reduction. Note those often-neglected factors-temperature, wind speed, BLH, and κ, account for over 40 % of the total contribution. Though the importance of aerosol hygroscopic growth to visibility was found decreasing recently, it retains non-negligible impacts on driving inter-annual visibility trends. This study yields innovative insights for air pollution control, underscoring the imperative of region-specific strategies to mitigate low-visibility events.

View all citing articles on Scopus

^☆: This paper has been recommended for acceptance by Pavlos Kassomenos.

View full text

Using a land use regression model with machine learning to estimate ground level PM2.5☆

Highlights

Abstract

Graphical abstract

Introduction

Section snippets

Study area

Measured PM2.5 concentrations for the study period

Discussion

Conclusion

Declaration of competing interest

Acknowledgements

Sci. Total Environ.

Environ. Res.

Atmos. Environ.

Sci. Total Environ.

Environ. Int.

Environ. Pollut.

Environ. Int.

Atmos. Environ.

Atmos. Environ.

Environ. Res.

Sci. Total Environ.

Environ. Int.

Atmos. Environ.

Atmos. Environ.

Atmos. Environ.

Urban For. Urban Green.

Sci. Total Environ.

Sci. Total Environ.

Environ. Int.

Environ. Res.

Environ. Pollut.

Sci. Total Environ.

Sci. Total Environ.

Environ. Pollut.

Build. Environ.

Atmos. Environ.

Using a land use regression model with machine learning to estimate ground level PM_2.5☆

Measured PM_2.5 concentrations for the study period