Using a land use regression model with machine learning to estimate ground level PM2.5

https://doi.org/10.1016/j.envpol.2021.116846Get rights and content

Highlights

  • Estimating long-term daily PM2.5 concentration with machine learning models.

  • Land-use patterns were included in machine learning models by using land-use regression.

  • Explanatory power of daily PM2.5 concentration was increased from 0.58 to 0.94.

  • XGboost outperformed random forest and deep neural network algorithms.

Abstract

Ambient fine particulate matter (PM2.5) has been ranked as the sixth leading risk factor globally for death and disability. Modelling methods based on having access to a limited number of monitor stations are required for capturing PM2.5 spatial and temporal continuous variations with a sufficient resolution. This study utilized a land use regression (LUR) model with machine learning to assess the spatial-temporal variability of PM2.5. Daily average PM2.5 data was collected from 73 fixed air quality monitoring stations that belonged to the Taiwan EPA on the main island of Taiwan. Nearly 280,000 observations from 2006 to 2016 were used for the analysis. Several datasets were collected to determine spatial predictor variables, including the EPA environmental resources dataset, a meteorological dataset, a land-use inventory, a landmark dataset, a digital road network map, a digital terrain model, MODIS Normalized Difference Vegetation Index (NDVI) database, and a power plant distribution dataset. First, conventional LUR and Hybrid Kriging-LUR were utilized to identify the important predictor variables. Then, deep neural network, random forest, and XGBoost algorithms were used to fit the prediction model based on the variables selected by the LUR models. Data splitting, 10-fold cross validation, external data verification, and seasonal-based and county-based validation methods were used to verify the robustness of the developed models. The results demonstrated that the proposed conventional LUR and Hybrid Kriging-LUR models captured 58% and 89% of PM2.5 variations, respectively. When XGBoost algorithm was incorporated, the explanatory power of the models increased to 73% and 94%, respectively. The Hybrid Kriging-LUR with XGBoost algorithm outperformed the other integrated methods. This study demonstrates the value of combining Hybrid Kriging-LUR model and an XGBoost algorithm for estimating the spatial-temporal variability of PM2.5 exposures.

Introduction

The health effects of ambient fine particulate matter (PM) have been broadly investigated worldwide for decades and PM ranked as the sixth leading risk factor globally for death and disability (Forouzanfar et al., 2016). It is well documented that PM2.5 (PM with an aerodynamic diameter < 2.5 μm) is associated with various short-term and long-term adverse health effects, including increased risk of respiratory diseases, cardiovascular mortality, type 2 diabetes mellitus, hypertensive disorders of pregnancy, neurological hospital admissions, and premature mortality (Bai et al., 2020; Kioumourtzoglou et al., 2016; Pope III & Dockery, 2006). As such, a more precise method to assess PM2.5 exposures is needed in environmental epidemiological studies. Some prior studies estimated PM2.5 concentrations by applying data from sparse monitoring stations. This method is not feasible for large cohort studies, particularly those focused on rural areas. Hence, environmental modelling methods need to be developed in order to capture PM2.5 spatial and temporal continuous variations at a sufficient resolution based on limited monitoring stations.

Previous studies have used aerosol optical depth (AOD) to retrieve ground-level PM estimations or have used it as a predictor variable for model improvement (Di et al., 2019; Shtein et al., 2019; Stafoggia et al., 2019). Although satellite-derived images could provide AOD properties which cover an almost global surface at a moderate spatial resolution, these remote sensing images are easily affected by cloudy weather and water/snow glint reflectance (Sayer et al., 2013; Stafoggia et al., 2019). Because of this limitation, AOD products have commonly been applied for continents and not islands or areas with cloudy weather conditions (Wu et al., 2018). For instance, Taiwan is an island and could have missing data of AOD at a rate higher than 70%. Therefore, an alternative method to estimate PM should be developed for areas such as Taiwan that may have too much missing data.

Land-Use Regression (LUR) has been used to depict intra-urban air pollution concentration variation in fine spatial-temporal resolution worldwide because of its ability to consider different types of land-use variables in assessing target pollutants (Beelen et al., 2013; Eeftens et al., 2012; Wu et al., 2017; Young et al., 2016). LUR models use a set of geographic sources as predictor variables to develop multiple linear regressions in order to estimate air pollution concentrations. A more defined land-use predictor dataset was applied because it offers the potential to obtain a LUR model with greater explanatory power. Thus, land-use/land cover dataset could play an important role in affecting model performance (Beelen et al., 2013; Eeftens et al., 2012; Hellack et al., 2017). The strength of LUR is that important predictor variables that affect air pollution concentration can be identified by a stepwise variable selection procedure, thereby reducing the dimension of predictor variables in the resultant models. Recently, more research in the field of air pollution modelling has utilized non-linear statistical methods, such as generalized additive model, to improve estimation accuracy (Li et al., 2013; Yang et al., 2018; Zou et al., 2017) and machine learning algorithms (Chen et al., 2019; Di et al., 2019; Shtein et al., 2019; Stafoggia et al., 2019; Weichenthal et al., 2016). Among them, machine learning models, with their ability to capture non-linear relationships between predictor variables and air pollution concentrations, improve a model’s prediction performance more than do traditional regression approaches. Araki et al. (2018) used random forest (RF) to estimate metropolitan NO2 variation in Japan, and the results of RF outperformed traditional regression approaches (Araki et al., 2018). Another study developed multiple modelling approaches to estimate air pollution concentration in Europe, and it also concluded that the results of RF and artificial neural networks were better than linear regression methods (Chen et al., 2019). A limited number of studies have used extreme gradient boosting (XGBoost) algorithms to estimate air pollution concentration. Offering the ability to enable slow learners to investigate the bias of a loss function, XGBoost has potential to outperform some tree-based machine learning algorithms (Chen and Guestrin, 2016). However, one of the shortcomings of machine learning models is that most of them are unable to select important variables before training each algorithm. Training estimation models with an enormous amount of predictor variables may result in overfitting issues and may result in reduced computational efficiency. Due to the lack of variable selection procedures, predictors used in these algorithms may have limited interpretability and limited reasonable explanations for variations in air pollution. Thus, it is important to use a variable selection method before training a machine learning algorithm.

Given that the shortcoming of machine learning models in selecting proper predictor variables can be solved by applying LUR, this study aims to use an integrated approach combining LUR and machine learning models to improve the daily concentration estimation of PM2.5 during 2000–2016 on the main island of Taiwan. In detail, important predictor variables identified by the stepwise variable selection from the LUR procedures were applied to develop the prediction models using three types of machine learning algorithms, namely deep neural network (DNN), RF, and XGBoost for improving the accuracy of PM2.5 variation predictions. The mixed spatial prediction models that combine the strength of LUR in identifying the most influential emission predictors and the predictability of machine learning in estimating non-linear trends would be more broadly effective than techniques that rely solely on LUR or machine learning.

Section snippets

Study area

Taiwan is an island located southeast of China, across the Taiwan Strait. The area of Taiwan is 36,193 km2, and the country consists of 14 counties and 368 townships. The total population of Taiwan is 23.5 million people and 22 million motorcycles and cars are registered on the island (MOTC, 2020). For decades heavy traffic has given rise to concerns about air pollution on this densely populated island. The topography of Taiwan is characterized by a great plain in the western portion and major

Measured PM2.5 concentrations for the study period

Fig. 3. illustrates the observed daily averages of PM2.5 concentrations in Taiwan from 2006 to 2016. Seasonal variations were obtained from the daily trend for each year. The highest level of PM2.5 concentrations appeared during winters, while the lowest levels appeared during summers (Fig. S1). The annual mean PM2.5 concentration (30.47 μg/m3) was higher than the air quality standard in Taiwan (15 μg/m3). The highest levels of PM2.5 concentration were observed at traffic stations while the

Discussion

Findings of this study demonstrate that the Hybrid-Kriging LUR integrated with XGBoost had the best performance (R2 = 0.94, RMSE = 4.41 μg/m3) among the proposed models. Compared with conventional LUR, the best model increased explanatory power by 36%, while the R2 increased from 0.58 to 0.94. It is worth noting that incorporating Kriging interpolation, LUR, and machine learning methods can improve model efficiency by reducing variable dimension, as well as increasing statistical ability.

Conclusion

This study utilizes conventional LUR, Hybrid Kriging-LUR, and two LUR models incorporated with deep neural network, random forest, and XGBoost machine learning algorithms for PM2.5 variation estimation in Taiwan. After comparing the eight models, the results showed that the proposed method incorporating Hybrid Kriging-LUR and XGBoost machine learning algorithms estimates PM2.5 concentrations more accurately in Taiwan than do other models. Moreover, utilizing a stepwise variable selection

Declaration of competing interest

The authors declare no conflict of interest.

Acknowledgements

This study is funded by the Ministry of Science and Technology, R.O.C. (MOST 108-2621-M-006-017 -) and Academia Sinica, Taiwan, under “Trans-disciplinary PM2.5 Exposure Research in Urban Areas for Health-oriented Preventive Strategies(II) ”. Project No.: AS–SS–110-02. The authors are grateful to the National Aeronautics and Space Administration (NASA) and to the U.S. Geological Survey (USGS) for data.

References (52)

  • J. Kammer et al.

    Observation of nighttime new particle formation over the French Landes forest

    Sci. Total Environ.

    (2018)
  • K. Kim et al.

    A review on the human health impact of airborne particulate matter

    Environ. Int.

    (2015)
  • S.C. Lee et al.

    Characteristics of emissions of air pollutants from burning of incense in a large environmental chamber

    Atmos. Environ.

    (2004)
  • L. Li et al.

    Estimating spatiotemporal variability of ambient air pollutant concentrations with a hierarchical model

    Atmos. Environ.

    (2013)
  • C.Y. Lin et al.

    Long-range transport of aerosols and their impact on the air quality of Taiwan

    Atmos. Environ.

    (2005)
  • D.J. Nowak et al.

    Air pollution removal by urban trees and shrubs in the United States

    Urban For. Urban Green.

    (2006)
  • A. Sæbø et al.

    Plant species differences in particulate matter accumulation on leaf surfaces

    Sci. Total Environ.

    (2012)
  • B. Srimuruganandam et al.

    Source characterization of PM10 and PM2.5 mass using a chemical mass balance model at urban roadside

    Sci. Total Environ.

    (2012)
  • M. Stafoggia et al.

    Estimation of daily PM10 and PM2.5 concentrations in Italy, 2013–2015, using a spatiotemporal land-use random-forest model

    Environ. Int.

    (2019)
  • S. Weichenthal et al.

    A land use regression model for ambient ultrafine particles in Montreal, Canada: a comparison of linear regression and a machine learning approach

    Environ. Res.

    (2016)
  • C.D. Wu et al.

    Land-use regression with long-term satellite-based greenness index and culture-specific sources to model PM2.5 spatial-temporal variability

    Environ. Pollut.

    (2017)
  • C.D. Wu et al.

    A hybrid kriging/land-use regression model to assess PM2.5 spatial-temporal variability

    Sci. Total Environ.

    (2018)
  • H. Xu et al.

    National PM2.5 and NO2 exposure models for China based on land use regression, satellite measurements, and universal kriging

    Sci. Total Environ.

    (2019)
  • S. Yin et al.

    Quantifying air pollution attenuation within urban parks: an experimental approach in Shanghai, China

    Environ. Pollut.

    (2011)
  • K. Yu et al.

    Indoor air pollution from gas cooking in five Taiwanese families

    Build. Environ.

    (2015)
  • Y. Zhan et al.

    Spatiotemporal prediction of continuous daily PM2.5 concentrations across China using a spatially explicit machine learning algorithm

    Atmos. Environ.

    (2017)
  • Cited by (80)

    View all citing articles on Scopus

    This paper has been recommended for acceptance by Pavlos Kassomenos.

    View full text