Applicability of machine learning in modeling of atmospheric particle pollution in Bangladesh

Shahriar, Shihab Ahmad; Kayes, Imrul; Hasan, Kamrul; Salam, Mohammed Abdus; Chowdhury, Shawan

doi:10.1007/s11869-020-00878-8

Applicability of machine learning in modeling of atmospheric particle pollution in Bangladesh

Published: 20 July 2020

Volume 13, pages 1247–1256, (2020)
Cite this article

Download PDF

Air Quality, Atmosphere & Health Aims and scope Submit manuscript

Applicability of machine learning in modeling of atmospheric particle pollution in Bangladesh

Download PDF

5797 Accesses
33 Citations
Explore all metrics

Abstract

Atmospheric particle pollution causes acute and chronic health effects. Predicting the concentrations of PM_2.5 and PM₁₀, therefore, is a prerequisite to avoid the consequences and mitigate the complications. This research utilized the machine learning (ML) models such as linear-support vector machine (L-SVM), medium Gaussian-support vector machine (M-SVM), Gaussian process regression (GPR), artificial neural network (ANN), random forest regression (RFR), and a time series model namely PROPHET. Atmospheric NO_X, SO₂, CO, and O₃, along with meteorological variables from Dhaka, Chattogram, Rajshahi, and Sylhet for the period of 2013 to 2019, were utilized as exploratory variables. Results showed that the overall performance of GPR performed better particularly for Dhaka in predicting the concentration of both PM_2.5 and PM₁₀ while ANN performed best in case of Chattogram and Sylhet for predicting PM_2.5. However, in terms of predicting PM₁₀, M-SVM and RFR were selected respectively. Therefore, this study recommends utilizing “ensemble learning” models by combining several best models to advance application of ML in predicting pollutants’ concentration in Bangladesh.

Air pollution prediction with machine learning: a case study of Indian cities

Article 15 May 2022

K. Kumar & B. P. Pande

Machine learning algorithms to forecast air quality: a survey

Article Open access 16 February 2023

Manuel Méndez, Mercedes G. Merayo & Manuel Núñez

Influence of Temperature and Relative Humidity on PM2.5 Concentration over Delhi

Article 12 May 2023

Vaishali, Gaurav Verma & Rupesh M. Das

Introduction

Atmospheric particulate matter (PM) pollution, particularly PM_2.5 and PM₁₀, poses a severe and growing threat to global public health (Orioli et al. 2018). Exposure to the high concentration of PM has a strong association with different health hazards such as respiratory diseases, cancer, and cardiovascular disease. (Kim et al. 2015). In a clinical meta-analysis, Kim et al. (2015) revealed that about 3% of cardiopulmonary and 5% of lung cancer deaths are attributable to PM exposure globally. The study also argued that the existence of PM in the atmosphere poses more threat to public health than that of other ambient air pollutants. Moreover, a new study revealed that an increase of 1 g m⁻³ in PM_2.5 could accelerate the death rate of the coronavirus disease 2019 (COVID-19) by 15% (Wu et al. 2020). Thus, numerous scientific studies illustrated the strong evidence of the association between health hazards and PM concentration. It occurs, mostly, for the size and composition of the particles. Both particles are constituted by other subclasses of pollutants with the major ones being water-soluble ions, i.e., sulfates, nitrates, ammonium, and minor constituents such as metal ions, organic and elemental carbon, and volatile organics. They can be emitted into the air from natural or anthropogenic sources, and secondary formation in the atmosphere (Lu et al. 2016).

Since air quality is vital for health and the environment, it is essential to regulate proper controlling mechanisms. Pollution modeling can act as a preliminary step of controlling mechanisms (Salnikov and Karatayev 2011). Generally, atmospheric pollution modeling (APM) is defined as the numerical tool that illustrates the casual relationships among emissions, meteorology, atmospheric concentrations, depositions, and other factors (Daly and Zannetti 2007). The APM techniques mainly categorized into three types, i.e., physical model, dispersion model, and machine learning model (Sportisse 2007). However, some other models are broadly implemented in atmospheric sciences, i.e., Gaussian models, Lagrangian models, and Eulerian Models. Commonly, these models work based on continuous emission records and conservation of mass (Gaussian models), the trajectory of air parcel, and wind data (Lagrangian models), and gridded atmospheric properties (Eulerian models). Apart from those, a prognostic model, namely, chemical transport model (CTM) (e.g., GEOS-Chem, CMAQ, WRF-Chem) processes emission, transport, and chemical conversion of trace gases and aerosols simultaneously with meteorological parameters. These models incorporate atmospheric science and multi-processing computational approaches, including the real-time updated emission inventory inputs, and meteorological records (Daly and Zannetti 2007). Unfortunately, the application of these models is further limited by some complexities in terms of geophysical characteristics, i.e., land use and terrain (Jiménez and Dudhia 2013). However, several recent studies argued that the traditional deterministic models struggle to capture the non-linearity among pollutants’ concentration, meteorology, land use, and emission and dispersion sources (Shimadera et al. 2016; Chen et al. 2017). To tackle and minimize the limitations of the models, however, machine learning algorithms seem promising (Rybarczyk and Zalakeviciute 2018).

The traditional statistical approaches are limited by describing the variables based on probability and statistical average. In contrast, machine learning models such as artificial neural network (ANN), support vector machine (SVM), and random forest regression (RFR) have been performed as the most popular classifiers to efficiently overcome the non-linear uncertainties and trends to accomplish better forecasting accuracy (Joharestani et al. 2019). Though the models do not unambiguously simulate the environmental process, in general, they exhibit better prognostic performance than the CTMs on spatiotemporal scale in the existence of extensive monitoring records (Marshall et al. 2008). Several studies have been conducted in different countries to evaluate the performance of machine learning models in the field of air quality modeling and forecasting (Kang et al. 2018). However, based on relevant literature, the study of machine learning in air pollution modeling was limited in Bangladesh, though multiple studies were performed to investigate the particulate pollution (Begum et al. 2011; Begum and Hopke 2018). The most used statistical technique to forecast air quality in Bangladesh was Seasonal ARIMA (Islam et al. 2020). Therefore, considering these observations, the study aims to investigate the application of machine learning models, i.e., ANN, L-SVM, M-SVM, Gaussian process regression (GPR), RFR, and a time series model namely PROPHET, on particle pollution modeling in four metropolitan cities in Bangladesh. Among them, ANN, L-SVM, and RFR were used in many air pollution research across the world. However, a limited study found in terms of investigating GPR, and M-SVM machine learning algorithms for pollution modeling (Rybarczyk and Zalakeviciute 2018). A new additive time series model PROPHET, developed by Facebook’s Core Data Science Team, was also implemented in this study to compare with the results of machine learning models. The rationale using this model was its specialty to forecast non-linear trends with yearly, weekly, seasonality, and holiday effects. Besides, the study will demonstrate the regional relationships between PM_2.5 and PM₁₀ concentration with meteorological parameters and the other air pollutants, i.e., nitrogen oxides (NO_x), sulfur-di-oxide (SO₂), carbon-mono-oxide (CO), and ozone (O₃), which will be later considered as the exploratory variables to investigate the models.

Methodology

Air monitoring stations and data

This study used 24-h air quality (PM_2.5, PM₁₀, NO_X, SO₂, CO, O₃) and meteorological data (mean temperature, rainfall, relative humidity, barometric pressure, wind speed, wind direction, and solar radiation) of four air monitoring stations, i.e., Dhaka, Chattogram, Rajshahi, and Sylhet, which were provided by the Department of Environment (DoE), Government of Bangladesh. Among them, Dhaka and Chattogram are ranked 19th and 76th having poor air quality, respectively, in the world (WHO 2018). The first air monitoring station (MS-1) in Dhaka was placed at 23.78° N and 90.36° E and characterized by heavy traffic and transportation. The MS-1 was positioned about 100 m away from the main road, and the height of the roof was approximately 7 m above the ground. The second air monitoring site (MS-2), Chattogram, was located at 22.32° N and 91.80° E. The sampling inlets of MS-2 were positioned on the flat roof of the monitoring site shelter, about 7 m above the ground. Unlike MS-1, the site was a residential area and, therefore, not much influenced by local sources. The location of the third monitoring site (MS-3), Rajshahi, was at 24.38° N and 88.61° E, which was approximately 3 km north from the downtown and 10 m away from a moderate traffic source. The sampler inlet was placed on a flat roof which was 5 m above the ground. The fourth and final monitoring site (MS-4) of this study was situated at Sylhet (24.89° N and 91.87° E). The location of MS-4 was 20 m far from the Kin Bridge of Sylhet and characterized with moderate traffic. The roof height was about 12 m above the ground. For every station, the intake nozzle of the sampler was placed 1.8 m above the roof with proper ventilation. To measure PM_2.5 and PM₁₀ concentrations, an automatic and real-time suspended particulate monitor (Beta Gauge 101 M; ENVIRONMENT SA, France) was used. The UV-fluorescence AF22M (TELEDYENE/API, USA), chemiluminescence gas analyzer AC32M, UV photometric ozone analyzer-42M, and dispersive infra-red carbon monoxide analyzer-12M (ENVIRONMENT SA, France) were utilized to measure the concentration of SO₂, NO_x, O₃, and CO, respectively. To maintain quality assurance and control of the data, calibration was routinely performed. While processing, the data were checked for outliers and if 75% of the data in a day were not available for any parameter due to power failure or equipment’s nonoperational, values were considered as non-representative and excluded from the analysis. The amount of total captured data for MS-1, MS-2, MS-3, and MS-4 was 90.4%, 90%, 87.6%, and 86.5%, respectively, from January 2013 to June 2019.

Data pre-processing

The study performed data pre-processing to maintain the consistency of the raw dataset containing 2372 observations. To process the checking and removal of spatiotemporal outliers from raw data, the Z scores method was used before the calculation of statistical parameters, in consistency with previous studies (Barzeghar et al. 2020). Firstly, the series data were transformed into Z scores. The observations in the transformed series were rejected from the original series meeting the following conditions: (i) having absolute Z score is greater than 4 (|Z_t| > 4), (ii) the increment from the previous value of the series is larger than 9 (Z_t–Z_t-1 > 9), and (iii) the ratio of the Z score value to its centered mean of order 3 (MA3) being greater than 2 (Z_t/MA3(Z_t) > 2). On the other hand, to correct the missing values, the study used the nearest neighbor method (NN) which aims to provide unbiased and valid estimates of associations based on information from the available data. It is also widely known as the standard method to handle missing values in many areas of research (Li et al. 2019). The algorithm is a similarity-based concept that relies on distance metrics. In this work, we used the Minkowski norm (D) given by Eq. (1) as a metric to evaluate distance in form of the Euclidean, when p = 2:

$$ \mathrm{D}={\left({\sum}_{i=1}^n{\left|{x}_i-{y}_i\right|}^p\right)}^{1/p} $$

(1)

where x_i and y_i are the test sample and training data, respectively.

Feature selection

A model-free method, i.e., Boruta algorithm (BA), was used in this study to identify the features for the models’ prediction. The overall procedure was conducted in six steps, i.e., (a) creating duplicate copies of predictors, i.e., meteorological parameters and pollutants; (b) performing the random shuffle original predictors and duplicate copies of predictors to remove their correlation with the outcomes; (c) applying RFA to find out the most important predictors based on the higher mean values; (d) estimating the Z score by using mean and standard deviation; (e) finding the Z_max score on duplicates predictors; (f) repeating the above steps till iteration completes for all the air monitoring stations separately.

PROPHET

PROPHET implements the decomposition of the time series with three main components which are seasonality, overall trends, and holidays (Papacharalampous and Tyralis 2018). The Eq. (2) is a result of those three components:

$$ y(t)=g(t)+s(t)+h(t)+{\in}_t $$

(2)

where g(t) = the trend function which models non-periodic changes in the value of the time series, s(t) = periodic changes (e.g., weekly and yearly seasonality), and h(t) = the effects of holidays which occur on potentially irregular schedules over one or more days; ∈_t = any idiosyncratic changes which are not accommodated by the model.

Artificial neural network

ANN models are based on the interactions of neurons by transferring signals to another one along with weighted connections (Feng et al. 2015). Besides, in ANN model systems, each neuron is coupled with all preceding neurons and following the layers by links (Bai et al. 2016). In the input layer, every input value is regarded as a neuron. For the success of ANN, all input values should be weighted firstly, and then, the weighted values are processed into the hidden layers. In that layer, every neuron produces output values. The Eq. (3) calculates the outputs:

$$ o=f\left({\sum}_{j=0}^n{w}_{ji}{x}_j\right) $$

(3)

where f = non-linear function, x_j = jth input to the neuron, v_j = jth synaptic weight, and n = the number of inputs (Gomez-Sanchis et al. 2006). All data sets need to be normalized before the experimentation of the ANN model. This is the basic method of artificial studies. The equation of normalization (Eq. 4) is given below:

$$ {NI}_{ij}=\frac{I_{\left(i,j\right)}-\min (j)}{\max (j)-\min (j)} $$

(4)

where I = input value, NI = standardized value, i = number of patterns, and j = value of variables. In this study, multilayer perceptrons (MLP) was used as it is the most classical type of ANN. After experimenting on several MLP structures, the study decided to utilize two hidden layers.

Random forest regression

As the RF model is a supervised learning algorithm, three user-defined properties should be determined in the application of this model. They are the number of predictors that are used to make each tree (mtry), the number of trees in the forest (ntree), and the minimum number of terminal nodes (nodesize). Three user-defined parameters should be determined in RF modeling, which are the number of variables used to grow each tree (mtry) that creates the strength of the tress in the forest and establishes the correlations among them, the number of trees in the forest (ntree), and the minimum number of terminal nodes (nodesize) which should be set to fit an RF. The predictive performance of the RF model, however, is enhanced by increasing the tree strength and decreasing the number of correlations among trees (Brokamp et al. 2017). Firstly, the n number of training sample subset is expressed as D₁, D₂, ……, D_n from total training data set D using BS. Secondly, based on the subsets, n number of regression trees is created. Accordingly, n number of regression result is obtained. Finally, the optimal output is established by aggregating the results of each regression trees. In this study, we selected mtry = 5 and ntree = 500 using BS from the input data. Therefore, regression trees grew based on training data for each of one-fourth the total samples.

Gaussian process regression

Gaussian process models (GPM) are probabilistic and non-parametric in nature which generally works on the basic principles of Bayesian probability. In this study, exponential GPR was experimented. The task of GPR was to infer a mapping from a set of “D” dimensional regression vectors denoted by the regression matrix X = [x₁, x₂, ….., x_n]^T to a vector of output data y. This denotes:

$$ D={\left\{{x}_{ij},{y}_i\right\}}_{i=1}^N;{x}_i\epsilon \boldsymbol{X};{y}_i\epsilon \mathrm{\mathbb{R}} $$

(5)

The outputs (y₁, y₂, ……., y_n) are usually assumed to be noisy realizations of the underlying function f (x_i). A GP model assumes that the output is a realization of a GP with a joint probability density function:

$$ f=N\left(m,k\right);m:\boldsymbol{X}\to \mathrm{\mathbb{R}};k:C\left({x}_i,{x}_j\right) $$

(6)

where m = mean function and k = covariant function. Generally, the GP model assumes that the output is a realization of a GP (here, noted as N in Eq. 6) with a joint probability density function with the mean covariance being functions of the inputs. Usually, the m is defined as 0 and k defines the characteristics of the process to be modeled. To make the predictions, the study used the posterior and the marginal likelihood for selecting hyperparameters. The posterior predictive distribution is expressed as the following equation:

$$ p\left(\left.y\right|x\right)=\iint p\left(\left.y\right|f,\boldsymbol{X},\theta \right),p\left(\left.f\right|\theta \right),p\left(\theta \right), d f, d\theta $$

(7)

The interest of the study was the log marginal likelihood, as the quality of its approximation and the posterior approximation of the study was linked:

$$ l\left(\theta \right)=-\frac{1}{2}\log \left(\left|k\right|\right)-\frac{1}{2}{y}^T{k}^{-1}y-\frac{N}{2}\mathit{\log}\left(2\pi \right) $$

(8)

Support vector machine

In this study, the predicted concentration of particulate matter was followed by following SVM operated formula:

$$ {\hat{Y}}_0=\sum \limits_{i=1}^nK\left({X}_i,{X}_0\right)\left({\alpha}_i-{\alpha}_i^{\ast}\right);K\left({X}_i,{X}_0\right)=\exp {\left(-\left|{X}_i-{X}_j\right|\right)}^2/\mathrm{Y} $$

(9)

where α_i and $ {\alpha}_i^{\ast } $= support vectors and K(X_i, X₀) = radial basis kernel function. The practice of an appropriate kernel function (KF) is one of the main features in SVM applications since SVMs are characterized by the usage of KF. It provides the capability of representing non-linear data in the input spaces that in essence are linear; then, an optimization procedure can be applied as in the linear case. This delivers a means to dimension the problem properly; however, the results still depend on the good selection of a set of training datasets. The Gaussian kernel function (GKF) is defined as-

$$ K\left({X}_i,{X}_0\right)=\exp \left(\frac{-{\left\Vert {x}_i-{x}_j\right\Vert}^2}{2{\sigma}^2}\right) $$

(10)

The GKF provides an estimate for the consistency of the forecast in the form of the variance of the predictive distribution and the analysis can be used to estimate the evidence in favor of a particular choice of the covariance function. The covariance or kernel function can be seen as a model of the data, thus providing a principled method for model selection (Singh et al. 2013). In this study, SVM in linear SVM (l-SVM) and medium Gaussian SVM (mG-SVM) was used.

Evaluation metrics

The performance of the models was evaluated based on the coefficient of determination (R²) (Eq. 12), root mean square error (RMSE) (Eq. 13), and mean absolute error (MAE) (Eq. 14)

$$ {\mathrm{R}}^2=\frac{\sum_{\mathrm{i}=1}^{\mathrm{n}}{\left({\mathrm{y}}_{\mathrm{i}}-{\mathrm{y}}_{\mathrm{mean}}\right)}^2}{\sum_{\mathrm{i}=1}^{\mathrm{n}}{\left({\mathrm{x}}_{\mathrm{i}}-{\mathrm{x}}_{\mathrm{mean}}\right)}^2} $$

(12)

$$ \mathrm{RMSE}=\sqrt{\frac{\sum_{\mathrm{i}=1}^{\mathrm{n}}\left({\left({\mathrm{x}}_{\mathrm{i}}-{\mathrm{y}}_{\mathrm{i}}\right)}^2\right.}{\mathrm{n}}} $$

(13)

$$ \mathrm{MAE}=\frac{1}{n}{\sum}_{i=1}^n\left|{x}_i-{y}_i\right| $$

(14)

where x_i = ith observed value, (i = 1, 2, 3, ……, n), x_mean = mean of observed value, y_i = ith simulated value, and y_mean = mean of simulated value.

Results and discussions

Data overview

The descriptive statistics for air pollutants and meteorological variables are presented in Table 1. Among the four stations, the highest annual mean concentrations of PM₁₀ were observed in Dhaka (160.4 μg m⁻³) and the lowest was observed in Rajshahi (109.4 μg m⁻³). Like PM₁₀, the highest concentration of PM_2.5 belonged to Dhaka (90.5 μg m⁻³) and the lowest concentration was in Rajshahi (67.1 μg m⁻³). The annual mean concentrations of PM_2.5 and PM₁₀ in Chattogram were 70.6 μg m⁻³ and 133.2 μg m⁻³ which was the 2nd highest among the stations. The seasonal and annual patterns of PM_2.5 and PM₁₀ were illustrated with the comparison of WHO air quality standards in Fig. S6. It demonstrated that annual averages of the PM_2.5 and PM₁₀ concentration in the air of the Dhaka, Chattogram, Rajshahi, and Sylhet are greater than the standards of WHO. In Dhaka, it is about six times greater than the standard. Moreover, the annual PM concentration of stations surpassed the value of Bangladesh Air Quality Standard (BNAAQS). The standard value of annual PM_2.5 and PM₁₀ according to BNAAQS is 15 μg m⁻³ and 50 μg m⁻³ respectively (Table S5).

Table 1 Annual mean (μ) ± standard error (SE), and standard deviation (SD) of the air pollutants’ concentration and meteorological parameters across the four air monitoring stations in Bangladesh during 2013–2019

Full size table

Seasonal variation of PM

The overall statistics of seasonal meteorological patterns across the stations are represented in supplementary Tables S1, S2, S3, and S4. It was cleared that the concentration of PM₁₀ and PM_2.5 across the stations was highest in winter whereas it was lowest in the monsoon season. The fluctuation pattern with the seasonal variation throughout the stations was almost the same. In winter, the highest mean concentration of PM_2.5 and PM₁₀ was observed in Dhaka (186.6 μg m⁻³ and 284.9 μg m⁻³ respectively) and the lowest in Sylhet (146.0 μg m⁻³) and Rajshahi (207.7 μg m⁻³) respectively. However, in monsoon, it was found the highest PM_2.5 concentrations in Rajshahi (31.4 μg m⁻³) and lowest in Sylhet (26.4 μg m⁻³). From the above statistics, it is clear that there is a relation among the particulate matters and meteorological variables of the seasons. Like this study, Manju et al. (2018) found the similar relationship among the meterological parameters and air pollutants in India.The correlation among the meteorological parameters and the particulate matters of the study is illustrated in Fig. S4 and Fig. S5. Begum et al. (2011) revealed that brick kilns are responsible for the highest concentration of PM in Dhaka during winter as northwestern wind transports the PMs from the brick kilns located in Dhaka. However, a positive correlation was found between PMs and temperature during the monsoon season in Dhaka. It can be addressed by high summer temperatures after the rainfall in that season. Besides, the combined effect of high wind and temperature can accelerate the concentration (Kayes et al. 2019). Apart from the significant correlation with meteorological parameters, PM was also highly correlated with other gaseous air pollutants (Fig. S5). In Dhaka, PM_2.5 was significantly correlated with PM₁₀, CO, NO_x, and SO₂, especially during March, April, May, and June of the year. At that time, for PM_2.5 and PM₁₀, the highest correlation was found with CO because of the on-road traffic congestion. During the pre-monsoon and post-monsoon period, the SO₂ was found highly correlated with NO_x in every air monitoring stations. During that period, for the NO_x emission, traffic was not the only significant source, but rather a substantial amount of NO_x was emitted to the atmosphere from the main source of SO₂ emissions. Begum et al. (2011) revealed that the main source of SO₂ emissions in Dhaka is brick kilns.

Model execution

Before the model execution, data splitting was carried out by dividing data into two subsets, i.e., training data and testing data. The study used unsupervised 5-fold cross-validation for data splitting. In each fold, the dataset is separated into two training sets (75% of total data) and a remaining 25% as a hold-out test set utilized to evaluate the performance of the models after the training process. After dividing the data sets, the study used BA to select the most important variables before running the models. The results showed that both for PM_2.5 and PM_10, temperature, RH, BP, and WD were the most important predictors among the meteorological variables across the stations (Fig. S7). For PM₁₀ prediction, the relative importance scores of the most significant meteorological predictor, Temp, in Dhaka (MS-1), Chattogram (MS-2), Rajshahi (MS-3) and Sylhet (MS-4) were 13.7, 13.8, 14.27, and 14.08, respectively. The next important meteorological variable was RH, for the stations (12.8, 12.9, 13.6, and 12.11 for MS-1, MS-2, and MS-3, respectively). However, unlike PM₁₀, the most influential meteorological predictor for PM_2.5 prediction in MS-1, MS-2, MS-3, and MS-4 was BP (15.12, 14.83, 14.7, and 14.74 for MS-1, MS-2, MS-3, and MS-4 respectively). On the other hand, among the chemical species fed into the models to predict PM_2.5 and PM₁₀, SO₂ and NO_x were the most influential exploratory variables. In terms of PM_2.5 prediction, the relative importance scores of SO₂ were 12.58, 12.55, 12.34, and 12.35 in MS-1, MS-2, MS-3, and MS-4 respectively.

PROPHET

The study used a time series model, i.e., PROPHET, to compare it with other machine learning models. Comparatively, PROPHET did not perform better than the machine learning models for the prediction of PM_2.5 and PM₁₀. In terms of PM_2.5 prediction, the R² values of the PROPHET for MS-1, MS-2, MS-3, and MS-4 were 0.72, 0.74, 0.78, and 0.75 respectively. It showed poor results in PM₁₀ prediction also. In particular, the model performs worst in MS-1 for PM₁₀ prediction. Samal et al. (2019) and Ye (2019) used this time series model for predicting air pollutants in India and China respectively. The performance of this model used in Samal et al. (2019) was better than this study. The RMSE value (= 3.54 μg m⁻³) of this model was much satisfactory for SPM simulation. Unlike this study, Ye (2019) found the RMSE values for PM_2.5 and PM₁₀ prediction were 10.34 μg m⁻³ and 15.5 μg m⁻³ respectively.

L-SVM and M-SVM

From Table 2, it is clear that both SVM models are comparable in terms of prediction metrics and they showed good performances in the prediction of daily mean PM_2.5 and PM₁₀ concentrations. In particular, M-SVM performed better than L-SVM. During the training period, M-SVM showed higher R² and lower RMSE values than the L-SVM. For PM_2.5 prediction of MS-1, MS-2, MS-3, and MS-4, the RMSE values of the M-SVM were 8.89 μg m⁻³, 10.6 μg m⁻³, 9.89 μg m⁻³, and 10.2 μg m⁻³, respectively, whereas for L-SVM, they were 8.57 μg m⁻³, 12.3 μg m⁻³, 10.7 μg m⁻³, and 10 μg m⁻³, respectively. Like PM_2.5, M-SVM showed better performance for PM₁₀ prediction. In terms of R² value, for PM₁₀ prediction, among the MS-1, MS-2, MS-3, and MS-4, the highest value was experimented in MS-3 (L-SVM = 0.93 and M-SVM = 0.94) for both models. The lowest MAE value (= 4.87 μg m⁻³) was found in MS-1 during the M-SVM execution. Over-fitting was controlled in this study during the model execution. Generally, over-fitting occurs when the results of testing are greater than the validation (Mehdipour et al. 2018). Singh et al. (2013) used an SVM model for predicting urban air quality in India where the RMSE value was 9.14 μgm⁻³ and 9.22 μgm⁻³ during testing and training period, respectively. However, in Tehran, Mehdipour et al. (2018) experimented much lower RMSE value (0.0501 μgm⁻³ and 0.519 μgm⁻³ during testing and training respectively) using SVM models to predict PM_2.5 concentration.

Table 2 Performance metrics (R², RMSE, and MAE) of the models during testing and training period. Here, MS-1, MS-2, MS-3, and MS-4 are representing Dhaka, Chattogram, Rajshahi, and Sylhet respectively. The values in italics indicate the best results among the models

Full size table

ANN

For the MS-2 and MS-4, the best prediction result was obtained using ANN. To predict the PM_2.5, the lowest RMSE value (= 9.42 μg m⁻³) and MAE (= 4.93 μg.m⁻³) were found in MS-2 during the test period. On the other hand, for PM₁₀ prediction of MS-1, MS-2, MS-3, and MS-4, the RMSE values of the ANN were 13.8 μg m⁻³, 14.2 μg m⁻³, 14.7 μg m⁻³, and 14.9 μg m⁻³, respectively. Özdemir and Taner (2014) used multiple linear regression and ANN to predict the PM₁₀ concentration in Turkey. The accuracy of the back-propagation feed-forward ANN (BPNN) with two hidden layers for the urban and industrial zone was 87% and 49% respectively. Feng et al. (2015) studied ANN with a trajectory model and wavelet model to improve the forecasting performance of PM_2.5 in China. Using the ANN model particularly, they found RMSE values ranged from 28.63–36.78 μg m⁻³.

RFR

RFR performed the best results for MS-4 to predict PM₁₀. During the testing period, RFR showed high R², and lower RMSE, and MAE values than the L-SVM, M-SVM, and PROPHET. For PM_2.5 prediction of MS-1, MS-2, MS-3, and MS-4, the RMSE values of the RFR were ranged from 9.2 to 11.6 μg m⁻³ and 9.59 to 11.9 μg m⁻³ during training and testing period respectively. Like PM_2.5, the performance of RFR was better than the SVMs. In terms of R², RMSE, and MAE value, for PM₁₀ prediction, among the MS-1, MS-2, MS-3, and MS-4, the highest value was experimented in MS-3 during model execution (R² = 0.91, RMSE = 13.7 μg m⁻³, MAE = 7.57 μg m⁻³). In China, Hu et al. (2017) experimented with RF and the R² value was 0.80 on average. However, Joharestani et al. (2019) used 23 features including meteorological variables, geographic data, and ground measured concentration data to predict PM_2.5 in Tehran. In that study, the R² and RMSE values varied from 0.66 to 0.78 and 14.47 to 15.30 μg m⁻³, respectively.

GPR

Among the models in this study, GPR showed the best performance particularly for both PM_2.5 and PM₁₀ concentration for MS-1 (R² = 0.91, RMSE = 7.68 μg m⁻³, MAE = 3.59 μg m⁻³ for PM_2.5; R² = 0.90, RMSE = 12.8 μg m⁻³, MAE = 7.62 μg m⁻³ for PM₁₀) and MS-3 (R² = 0.92, RMSE = 8.72 μg m⁻³, MAE = 4.17 μg m⁻³ for PM_2.5; R² = 0.91, RMSE = 12.1 μg m⁻³, MAE = 6.89 μg m⁻³ for PM₁₀) both in training and testing period. During training, the R², RMSE, and MAE values were ranged from 0.91 to 0.94, 7.68 to 11.3 μg m⁻³, and 3.59 to 6.87 μg m⁻³ for PM_2.5, respectively, and 0.87 to 0.95, 12.5 to 11.2 μg m⁻³, and 6.76 to 7.61 μg m⁻³ for PM₁₀, respectively. The worst performance of GPR was found for PM₁₀ prediction in MS-2 (Table 2). A study in Tehran, i.e., Mehdipour et al. (2018), used Bayesian network to predict PM_2.5 where the final RMSE value was 0.1077. Figure 1 represents the overall validation results of the models.

Model comparison and proposed model

Figure 2 represents the selection of the best model for the monitoring stations. The training results showed that the over-fitting was controlled perfectly in this study. From Fig. 1 and Table 2, it was clear that for PM_2.5 and PM₁₀, the PROPHET time series model performed worse than the machine learning models. However, GPR showed the best performance among all the models, particularly in MS-1 and MS-3. Therefore, GPR was selected as the best model for the prediction of both PM_2.5 and PM₁₀ in MS-1 and MS-3. In contrast, ANN well performed only for PM_2.5 prediction in MS-2 and MS-4. Unlike PM_2.5, RFR and M-SVM were selected for PM₁₀ for MS-4 and MS-2 respectively. The results of the models are further compared in terms of exploratory variables. Initially, the study used only meteorological variables to predict particulate matters. The initial results of the model validation using only meteorological parameters are presented in the supplementary section (Table S6). Finally, when the chemical species such as NO_X, SO₂, CO, and O₃ were fed into the machine learning models, the study found more meaningful results than before. The use of source pollutants in the models decreases the RMSE values for the models. Therefore, in terms of developing the machine learning models, the study recommends the use of more source pollutants and meteorological variables to reveal more fruitful results.

Conclusion

Machine learning provides reliable forecasting of atmospheric pollution. This research, therefore, explores the application of ML models in the management of PM and air quality in Bangladesh. Five models, i.e., ANN, L-SVM, M-SVM, RFR, and GPR were used in fulfilling the purpose of the study with a comparison of time series model namely PROPHET. Among them, for Dhaka and Rajshahi, GPR showed the best results in terms of R², RMSE, and MAE evaluation metrics. Therefore, the study recommended using GPR to predict the concentration of both PM_2.5 and PM₁₀ in those two stations. However, to predict the PM_2.5 and PM₁₀ concentration in Chattogram and Sylhet, the study referred to individual models. The proposed model for PM_2.5 for Chattogram and Sylhet was ANN, whereas for PM₁₀, the models were M-SVM and L-SVM respectively. However, the study recommends using data for a longer period to examine the performance of the models. Moreover, the hybrid models could be an option to compare it with these models. After all, the obtained results from this study revealed that the machine learning can offer convenient information that the government officials and policy makers of different countries can utilize it to issue early alerts of atmospheric pollution incidents and accordingly protect the citizens from exposure.

References

Bai Y, Li Y, Wang X, Xie J, Li C (2016) Air pollutants concentrations forecasting using back propagation neural network based on wavelet decomposition with meteorological conditions. Atmos Pollut Res 7(3):557–566. https://doi.org/10.1016/j.apr.2016.01.004
Article Google Scholar
Barzeghar V, Sarbakhsh P, Hassanvand MS et al (2020) Long-term trend of ambient air PM₁₀, PM_2.5, and O₃ and their health effects in Tabriz city, Iran, during 2006–2017. Sustain Cities Soc 54:101988. https://doi.org/10.1016/j.scs.2019.101988
Article Google Scholar
Begum BA, Hopke PK (2018) Ambient air quality in Dhaka Bangladesh over two decades: impacts of policy on air quality. Aerosol Air Qual Res 18:1910–1920. https://doi.org/10.4209/aaqr.2017.11.0465
Article CAS Google Scholar
Begum BA, Biswas SK, Hopke PK (2011) Key issues in controlling air pollutants in Dhaka, Bangladesh. Atmos Environ 45(40):7705–7713. https://doi.org/10.1016/j.atmosenv.2010.10.022
Article CAS Google Scholar
Brokamp C, Jandarov R, Rao MB, LeMasters G, Ryan P (2017) Exposure assessment models for elemental components of particulate matter in an urban environment: a comparison of regression and random forest approaches. Atmos Environ 151:1–11. https://doi.org/10.1016/j.atmosenv.2016.11.066
Article CAS Google Scholar
Chen J, Chen H, Wu Z, Hu D, Pan JZ (2017) Forecasting smog-related health hazard based on social media and physical sensor. Infor Syst 64:281–291. https://doi.org/10.1016/j.is.2016.03.011
Article Google Scholar
Daly A, Zannetti P (2007) Air pollution modeling--an overview. In: Zannetti P (ed) Ambient air pollution. The EnviroCopm Institute, California, pp 15–28 http://home.iitk.ac.in/~anubha/Modeling.pdf
Google Scholar
Feng X, Li Q, Zhu Y, Hou J, Jin L, Wang J (2015) Artificial neural networks forecasting of PM_2.5 pollution using air mass trajectory based geographic model and wavelet transformation. Atmos Environ 107:118–128. https://doi.org/10.1016/j.atmosenv.2015.02.030
Article CAS Google Scholar
Gomez-Sanchis J, Martín-Guerrero JD, Soria-Olivas E et al (2006) Neural networks for analysing the relevance of input variables in the prediction of tropospheric ozone concentration. Atmos Environ 40(32):6173–6180. https://doi.org/10.1016/j.atmosenv.2006.04.067
Hu X, Belle JH, Meng X, Wildani A, Waller LA, Strickland MJ, Liu Y (2017) Estimating PM2.5 concentrations in the conterminous United States using the random forest approach. Environ Sci Technol 51(12):6936–6944. https://doi.org/10.1021/acs.est.7b01210
Article CAS Google Scholar
Islam MM, Sharmin M, Ahmed F (2020) Predicting air quality of Dhaka and Sylhet divisions in Bangladesh: a time series modeling approach. Air Qual Atmos Health 13:607–615. https://doi.org/10.1007/s11869-020-00823-9
Article CAS Google Scholar
Jiménez PA, Dudhia J (2013) On the ability of the WRF model to reproduce the surface wind direction over complex terrain. J Appl Meteorol Climatol 52:1610–1617. https://doi.org/10.1175/JAMC-D-12-0266.1
Article Google Scholar
Joharestani MZ, Cao C, Ni X, Bashir B, Talebiesfandarani S (2019) PM_2.5 prediction based on random forest, XGBoost, and deep learning using multisource remote sensing data. Atmosphere 10(7):373. https://doi.org/10.3390/atmos10070373
Article Google Scholar
Kang GK, Gao JZ, Chiao S et al (2018) Air quality prediction: big data and machine learning approaches. Int J Environ Sci Develop 9(1):8–16. https://doi.org/10.18178/ijesd.2018.9.1.1066
Article Google Scholar
Kayes I, Shahriar SA, Hasan K et al (2019) The relationships between meteorological parameters and air pollutants in an urban environment. Global J Environ Sci Manag 5(3):265–278. https://doi.org/10.22034/gjesm.2019.03.01
Article CAS Google Scholar
Kim KH, Kabir E, Kabir S (2015) A review on the human health impact of airborne particulate matter. Environ Int 74:136–143. https://doi.org/10.1016/j.envint.2014.10.005
Article CAS Google Scholar
Li C, Wang Z, Li B, Peng ZR, Fu Q (2019) Investigating the relationship between air pollution variation and urban form. Build Environ 147:559–568. https://doi.org/10.1016/j.buildenv.2018.06.038
Article Google Scholar
Lu HY, Mwangi JK, Wang LC, Wu YL, Tseng CY, Chang KH (2016) Atmospheric PM_2.5 characteristics and long-term trends in Tainan city, southern Taiwan. Aerosol Air Qual Res 16(10):2488–2511. https://doi.org/10.4209/aaqr.2016.07.0332
Article CAS Google Scholar
Manju A, Kalaiselvi K, Dhananjayan V, Palanivel M, Banupriya GS, Vidhya MH, Panjakumar K, Ravichandran B (2018) Spatio-seasonal variation in ambient air pollutants and influence of meteorological factors in Coimbatore, southern India. Air Qual Atmos Health 11(10):1179–1189. https://doi.org/10.1007/s11869-018-0617-x
Article CAS Google Scholar
Marshall JD, Nethery E, Brauer M (2008) Within-urban variability in ambient air pollution: comparison of estimation methods. Atmos Environ 42:1359–1369. https://doi.org/10.1016/j.atmosenv.2007.08.012
Article CAS Google Scholar
Mehdipour V, Stevenson DS, Memarianfard M, Sihag P (2018) Comparing different methods for statistical modeling of particulate matter in Tehran, Iran. Air Qual Atmos Health 11(10):1155–1165. https://doi.org/10.1007/s11869-018-0615-z
Article CAS Google Scholar
Orioli R, Cremona G, Ciancarella L, Solimini AG (2018) Association between PM₁₀, PM_2.5, NO₂, O₃ and self-reported diabetes in Italy: a cross-sectional, ecological study. PLoS One 13(1):e0191112. https://doi.org/10.1371/journal.pone.0191112
Article CAS Google Scholar
Özdemir U, Taner S (2014) Impacts of meteorological factors on PM₁₀: artificial neural networks (ANN) and multiple linear regression (MLR) approaches. Environ Foren 15(4):329–336. https://doi.org/10.1080/15275922.2014.950774
Article Google Scholar
Papacharalampous GA, Tyralis H (2018) Evaluation of random forests and Prophet for daily streamflow forecasting. Adv Geosci 45:201–208. https://doi.org/10.5194/adgeo-45-201-2018
Article Google Scholar
Rybarczyk Y, Zalakeviciute R (2018) Machine learning approaches for outdoor air quality modelling: a systematic review. Appl Sci 8(12):2570. https://doi.org/10.3390/app8122570
Article Google Scholar
Salnikov VG, Karatayev MA (2011) The impact of air pollution on human health: focusing on the Rudnyi Altay industrial area. Am J Environ Sci 7(3):286–294. https://doi.org/10.3844/ajessp.2011.286.294
Article CAS Google Scholar
Samal KKR, Babu KS, Das SK, Acharaya A (2019) Time series based air pollution forecasting using SARIMA and Prophet model. In proceedings of the 2019 international conference on information technology and computer communications, pp 80-85. https://doi.org/10.1145/3355402.3355417
Shimadera H, Kojima T, Kondo A (2016) Evaluation of air quality model performance for simulating long-range transport and local pollution of PM_2.5 in Japan. Adv Meteorol 2016:5694251. https://doi.org/10.1155/2016/5694251
Article Google Scholar
Singh KP, Gupta S, Rai P (2013) Identifying pollution sources and predicting urban air quality using ensemble learning methods. Atmos Environ 80:426–437. https://doi.org/10.1016/j.atmosenv.2013.08.023
Article CAS Google Scholar
Sportisse B (2007) A review of current issues in air pollution modeling and simulation. Comput Geosci 11:159–181. https://doi.org/10.1007/s10596-006-9036-4
Article Google Scholar
WHO (2018) WHO global ambient air quality database (update 2018). World Health Organization. https://www.who.int/airpollution/data/cities/en/
Wu X, Nethery RC, Sabath BM, Braun D, Dominici F (2020) Exposure to air pollution and COVID-19 mortality in the United States. medRxiv. https://doi.org/10.1101/2020.04.05.20054502
Ye Z (2019) Air pollutants prediction in Shenzhen based on ARIMA and Prophet method. In E3S web of conferences, EDP sciences, 136:p05001). https://doi.org/10.1051/e3sconf/201913605001

Download references

Acknowledgments

We would like to thank the Department of Environment, Government of Bangladesh for providing the data and gratefully acknowledge Shajedul Islam from the Department of Environmental Science and Disaster Management, NSTU, for his support in data preparation.

Author information

Authors and Affiliations

Department of Environmental Science and Disaster Management, Noakhali Science and Technology University, Noakhali, 3814, Bangladesh
Shihab Ahmad Shahriar, Imrul Kayes, Kamrul Hasan & Mohammed Abdus Salam
School of Biological Sciences, The University of Queensland, Brisbane, Australia
Shawan Chowdhury

Authors

Shihab Ahmad Shahriar
View author publications
You can also search for this author in PubMed Google Scholar
Imrul Kayes
View author publications
You can also search for this author in PubMed Google Scholar
Kamrul Hasan
View author publications
You can also search for this author in PubMed Google Scholar
Mohammed Abdus Salam
View author publications
You can also search for this author in PubMed Google Scholar
Shawan Chowdhury
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Imrul Kayes.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Electronic supplementary material

ESM 1

(DOCX 4514 kb)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Shahriar, S.A., Kayes, I., Hasan, K. et al. Applicability of machine learning in modeling of atmospheric particle pollution in Bangladesh. Air Qual Atmos Health 13, 1247–1256 (2020). https://doi.org/10.1007/s11869-020-00878-8

Download citation

Received: 16 April 2020
Accepted: 09 July 2020
Published: 20 July 2020
Issue Date: October 2020
DOI: https://doi.org/10.1007/s11869-020-00878-8

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Applicability of machine learning in modeling of atmospheric particle pollution in Bangladesh

Abstract

Similar content being viewed by others

Air pollution prediction with machine learning: a case study of Indian cities

Machine learning algorithms to forecast air quality: a survey

Influence of Temperature and Relative Humidity on PM2.5 Concentration over Delhi

Introduction

Methodology

Air monitoring stations and data

Data pre-processing

Feature selection

PROPHET

Artificial neural network

Random forest regression

Gaussian process regression

Support vector machine

Evaluation metrics

Results and discussions

Data overview

Seasonal variation of PM

Model execution

PROPHET

L-SVM and M-SVM

ANN

RFR

GPR

Model comparison and proposed model

Conclusion

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Electronic supplementary material

ESM 1

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation