Coupling of quantile regression into boosted regression trees (BRT) technique in forecasting emission model of PM10 concentration

Shaziayani, Wan Nur; Ul-Saufie, Ahmad Zia; Ahmat, Hasfazilah; Al-Jumeily, Dhiya

doi:10.1007/s11869-021-01045-3

Coupling of quantile regression into boosted regression trees (BRT) technique in forecasting emission model of PM₁₀ concentration

Open access
Published: 24 May 2021

Volume 14, pages 1647–1663, (2021)
Cite this article

Download PDF

You have full access to this open access article

Air Quality, Atmosphere & Health Aims and scope Submit manuscript

Coupling of quantile regression into boosted regression trees (BRT) technique in forecasting emission model of PM₁₀ concentration

Download PDF

Wan Nur Shaziayani¹,
Ahmad Zia Ul-Saufie ORCID: orcid.org/0000-0002-4084-6647²,
Hasfazilah Ahmat² &
…
Dhiya Al-Jumeily³

1956 Accesses
14 Citations
Explore all metrics

Abstract

Air pollution is currently becoming a significant global environmental issue. The sources of air pollution in Malaysia are mobile or stationary. Motor vehicles are one of the mobile sources. Stationary sources originated from emissions caused by urban development, quarrying and power plants and petrochemical. The most noticeable contaminant in the Peninsular of Malaysia is the particulate matter (PM₁₀), the highest contributor of Air Pollution Index (API) compared to other pollution parameters. The aim of this study is to determine the best loss function between quantile regression (QR) and ordinary least squares (OLS) using boosted regression tree (BRT) for the prediction of PM₁₀ concentration in Alor Setar, Klang and Kota Bharu, Malaysia. Model comparison statistics using coefficient of determination (R²), prediction accuracy (PA), index of agreement (IA), normalized absolute error (NAE) and root mean square error (RMSE) show that QR is slightly better than OLS with the performance of R² (0.60–0.73), PA (0.78–0.85), IA (0.86–0.92), NAE (0.15–0.17) and RMSE (9.52–22.15) for next-day predictions in BRT model.

Air Quality Index Prediction of Bangalore City Using Various Machine Learning Methods

Hybrid Neural Networks and Boosted Regression Tree Models for Predicting Roadside Particulate Matter

Article Open access 30 March 2016

A new model of air quality prediction using lightweight machine learning

Article 03 May 2022

Introduction

The Air Pollution Index (API) describes the current state of air quality in a given region. The Department of Environment (DOE), Ministry of Environment and Water is one of the government agencies responsible for monitoring air quality at 68 stations in Malaysia. The API was then introduced to measure the cleanliness and efficiency of the air (Leong et al. 2020). The Malaysia Ambient Air Quality Guidelines (MAAQG) is used to determine the level of air quality in Malaysia and is used to measure the concentration levels of particles less than 2.5 μm (PM_2.5), particles less than 10 μm (PM₁₀), carbon monoxide (CO), sulphur dioxide (SO₂), nitrogen dioxide (NO₂) and ozone (O₃). When the concentration level is above the level specified in the MAAQG for a long period of time, it will cause negative effects on health and the environment. The API in Malaysia is listed in Table 1 with its categorization as good, moderate, unhealthy and hazardous. Generally, PM₁₀ is identified as a major pollutant that causes unhealthy conditions (DOE 2018). Therefore, PM₁₀ is the main focus of this study.

Table 1 Malaysia Air Pollution Index

Full size table

According to Azmi et al. (2010), the main causes of air pollution in Malaysia are either mobile sources from cars, buses and planes or stationary sources from power plants, open burning and wildfires, industrial facilities and others. The occurrence of haze in Malaysia is as a result of biomass burning since 1982 interrupting everyday life in Malaysia (Latif et al. 2018). Several haze episodes have been reported since then. These extreme episodes occurred in 1997, 2005 and 2015. Severe haze episodes were recorded in 1997 due to forest fires and large-scale plantations, especially in southern Sumatra and central Kalimantan, both in a neighbouring country, Indonesia. The city of Kuching, Sarawak located in East Malaysia was one of the areas affected by air pollution and haze in Sarawak East Malaysia in 1997. The Kuching API was recorded above 850 during the haze, the most alarming haze in Malaysia (Zakri et al. 2018). A further episode of extreme haze was reported in 2005 (Sahani et al. 2014) which was mainly on the Peninsula’s west coast of Malaysia. At that time, the smoke haze heavily affected the Klang Valley and its surrounding area. It reached its height at the haze emergencies on 11 August 2005, as the Air Pollution Index (API) reading in Port Klang and Kuala Selangor was recorded to be above 500. The latest extreme and long haze episode in Malaysia was reported in September 2015 (Huijnen et al. 2016). PM₁₀ concentration is the most significant major pollutant released by human activity (Sapini et al. 2015). Specifically the PM₁₀ concentration, in most cities of Southeast Asia (Reddington et al. 2014) and in Malaysia (Juneng et al. 2011), is justified as the main atmospheric pollutant. PM₁₀ contributed most to Malaysia’s API until 2017. In mid-2017, PM_2.5 had a greater impact on APIs in Malaysia until 2018 (DOE 2018).

There has been a growing interest in using many statistical models in the prediction of air pollution in recent years. One of these is regression techniques which have been used for a long time as predictive tools in many fields especially in the prediction of air pollution. The benefits of regression models are for its ease of use and efficient execution. However, these models are not very good in the prediction of complex situations, as the linear relationship between the selected parameters is determined (Abdullah et al. 2016). The statistical method is limited in clarifying the factors influencing PM₁₀, due to statistical assumptions and the homogeneity of the data. Recent studies have attempted to develop powerful computing intelligence models using machine learning algorithms such as the neural network to predict the complex PM₁₀ concentration system, which indicate that such models can easily predict the desired value (Abdullah et al. 2017). However, machine learning, more specifically the neural network, is usually used as a black box where there is no specific understanding of the physical characteristic of the technique (Viotti et al. 2002).

The boosted regression tree (BRT) model, another type of machine learning, which combines the advantages of regression trees with the boosted adaptive method, has recently been used in air pollution prediction studies. The boosting method was first developed by Friedman in (2001), and later added a stochastic aspect to the boosting algorithm through a random sample of the training data sets (Friedman 2002). In addition, it can also be used as a general method that is useful to improve the model accuracy of each learning algorithm. The BRT produces an ensemble model by boosting the loss function (such as root mean square error) of the user-defined number of additional trees by minimizing it. In contrast to the black box technique, the BRT method would evaluate the response of variables based on the individual model variable. It is therefore possible to determine, rank and describe the relationship between variables (Yahaya et al. 2019). The BRT is also capable of handling various types of inputs (i.e. categorical and continuous data) and accepts missing values (Motevalli et al. 2019) and able to deal with multiple forms of loss functions (Ridgeway 2012), such as Gaussian, Laplace, quantile regression (QR), Bernoulli and Poisson.

The loss function is one of the BRT model factor considerations. Ordinary least squares (OLS) loss function has been used by many studies, for the purpose of minimizing the squared error for continuous predictors, which resulted in a better correlation between the observed value and the estimation of the generalized boosting model (GBM) (Gu et al. 2019). However, datasets that have outliers such as air pollution data are not suitable to be used in OLS function. According to Kudryavtsev (2009), QR has become an important robust alternative tool, as it is more resistant to outliers and it is free function and does not have any properties.

The QR has the ability to be more useful and precise, since the non-central location of a distribution can be represented in all quantiles (Lingxin and Naiman 2007). The QR has the capability of including models for all quantiles, evaluating the entire function and calculating the central tendency (such as mean, median and mode) in the entire function of the variable of interest. The advantage of QR is for its robustness against non-OLS distribution which was found by Schlink et al. (2010). It can also be adapted to unbalanced observational frequencies. Due to this property, QR was considered and selected as a loss function strategy for this study.

The aim of this study is to derive air pollution modelling based on the loss function of QR using the BRT method. It is clear from the literature that no study has been conducted using such a method to predict PM₁₀ concentrations. The finding from the proposed methodology is compared with the prediction obtained from the OLS loss function using the BRT method.

Methodology

The process of data preparation has been conducted in detail to reach for developing the model evaluation as illustrated in Fig. 1. The flow diagram is adapted to the author’s research and is reconstructed.

Data preparation

Three urban sites were selected for this study. Table 2 shows the characterization of each station. All stations are located in the peninsular Malaysia. Alor Setar station (CA0040) is located in the northern region, Klang (CA0011) is located in the west coast region and Kota Bharu (CA022) is located in the east coast region as shown in Fig. 2. Data are operated by the Department of Environment’s continuous air quality monitoring (CAQM) stations in Malaysia. CAQM is an integrated ambient air quality monitoring device, is outfitted with a variety of ambient air analyses and sensors to identify particular pollutants. The analyses and sensors operate in a continuous mode, with data collected being captured on a microcomputer-based data acquisition system (DAS) that also controls the performance of the analyses and sensors. On an hourly basis, data is collected and transferred to a central computer for review and reporting. The United States Environmental Protection Agency (USEPA) has authorized the monitoring instruments and operational protocols of CAQM stations (Kamarul Zaman et al. 2017).

Table 2 Characteristics of monitoring station sites

Full size table

For data exploration, a descriptive analysis is carried out to determine the existence of extreme values or missing values. Missing data is a problem commonly faced by researchers in environmental studies. Data discontinuities are a major obstacle to the prediction models that require continuous information for the majority of the parts to be used. The absence of any data prevents the ability to accurately conclude or interpret the observation (Noor et al. 2014). The missing data must be processed, because complete data are required to perform statistical analysis. This study used linear interpolation for missing data imputation. According to Noor et al. (2015), this linear interpolation method estimates the missing data better than that of the other methods.

Data pre-processing

Maximum daily data used in this study were furnished by the Department of Environment (DOE), Ministry of Environment and Water of Malaysia for the period of 2002 to 2017. The data for this project are confidential, but may be obtained with Data Use Agreements with the Department of Environment (DOE), Ministry of Environment and Water of Malaysia. The data was 80% randomly selected for training and another 20% for the validation of the model (80% for model development and 20% to evaluate the performance of the model). The variables used in this study consist of gaseous nitrogen dioxide (NO₂; ppb), carbon monoxide (CO; ppb), sulphur dioxide (SO₂; ppb), ozone concentration (O₃; ppb), particulate matter concentration (PM₁₀; μgm⁻³) and meteorological parameters such as wind speed (WS; km/h), relative humidity (RH; %) and temperature (T; °C), as the predictors used to predict PM₁₀ concentrations 3 days ahead. All the selected parameters in this study have an influence on forecasts of PM₁₀ concentrations for 3 days ahead, and had been used by previous researchers, as summarized in Table 3. The general models for this study are shown in Table 4.

Table 3 Variable selection by the previous studies in the prediction of PM₁₀ concentration level

Full size table

Table 4 General model of BRT

Full size table

where

PM_10,D + 1:: Next-day prediction of PM₁₀ concentration
PM_10,D + 2:: Next 2 days prediction of PM₁₀ concentration
PM_10,D + 3:: Next 3 days prediction of PM₁₀ concentration
PM_10,D:: Particulate matter (μg/m³)
CO_D:: Carbon monoxides (ppb)
NO_2,D:: Nitrogen dioxide (ppb)
SO_2,D:: Sulphur dioxide (ppb)
O_3,D:: Ozone (ppb)
RH_D:: Relative humidity (%)
T_D:: Temperature (°C)
WS_D:: Wind speed (km/h)

Model development

BRT is a method used to increase the accuracy of a single model by fitting a number of models and combining them for prediction purposes. BRT uses regression trees from the classification and regression tree (CART) and constructs boosts to combine model sets (Grunwald et al. 2020). In the BRT, there are several tuning parameters that need to be controlled such as the number of trees (nt), the learning rate (lr) which is the shrinkage parameter used in each iteration to reduce the contribution of the tree, the complexity of the tree (tc) or the interaction depth which is the maximum tree depth of variable interactions. This study fitted BRT models with varying values for nt (10,000), lr (0.01) and tc (5). In version 3.4.2 of the R software, the BRT model was fitted with version 1.6-3.1 of the GBM. The GBM offers three methods for estimating the optimum number of trees, i.e. the cross validation (CV), the independent test set (test) and the out-of-bag estimation (OOB).

This research used 10-fold cross validation as suggested by Ridgeway (2010) to get the optimum number of trees, and then, ten separate testing sets were averaged. Rather than worrying about the block being suitable for testing, CV employs them all, one at a time, and summarizes the results at the end. The independent test set (test) approach uses a single holdout base dataset to determine the optimum number of tree (Ridgeway 2007). This research used a 50% held out test set to find the optimum number of trees as suggested by Ridgeway (2017). Out-of-bag estimation (OOB) is used to evaluate the classifier. According to Martinez-Munoz and Suarez (2010), individual classifiers are trained in standard bagging on independent bootstrap samples extracted with replacement from the set of original data. In general, the size of these samples is chosen to align with the number of the original training dataset. This prescription is arbitrary and does not have to be optimal in terms of the ensemble’s generalization accuracy. The accuracy of the voting classifier is equal to the average of classifier. Bag.fraction 0.5 was used in this research, as suggested by Ridgeway (2020), to improve predictive performance while using the OOB method.

BRT constructs a model as a weighted sum of functions similar to other boosting algorithms. The BRT algorithm steps are summarized accordingly:

$$ {F}_0(x)=\arg \underset{\gamma }{\min}\sum \limits_{i=1}^nL\left({y}_i,\gamma \right) $$

(1)

$$ {F}_m(x)={F}_{m-1}(x)+{\gamma}_m{h}_m(x) $$

(2)

Start the model with a constant value F₀(x).The BRT algorithm steps consist of a suitable decision tree and a loss function to determine how well a study is predicted. At each stage, the decision tree h_m(x) is chosen to minimize the loss given the current model F_m − 1 and its fit F_m − 1(x_i). The residuals r_{i, m} are computed:

$$ {r}_{i,m}=-{\left[\frac{\partial L\left({y}_i,F\left({x}_i\right)\right)}{\partial F\left({x}_i\right)}\right]}_{F(x)={F}_{m-1}(x)} $$

(3)

r_{i, m} is the negative gradient of the ith sample in the mth as the number of trees. h_m(x) is set to use the r_{i, m} as the target variable. Fit a regression tree to the residual r_{i, m} values and create the leaf node area R_{j, m} for j = 1, 2, …, J. The weights are obtained by solving the problem of minimization:

$$ {\gamma}_{jm}=\underset{\gamma }{\mathrm{argmin}}\sum \limits_{i=1}^nL\left({y}_i,{F}_{m-1}\left({x}_i\right)+\gamma \right) $$

(4)

The square error is the loss function for the deterministic prediction:

$$ L\left({y}_i,F\left({x}_i\right)\right)=\frac{1}{2}{\left({y}_i-F\left({x}_i\right)\right)}^2 $$

(5)

For quantile regression, the expression below is used when the α (quantile) value is in range 0 to 1.

$$ \mathrm{L}\left({\mathrm{y}}_{\mathrm{i}},\mathrm{F}\left({\mathrm{x}}_{\mathrm{i}}\right),\upalpha \right)=\left\{\begin{array}{c}\upalpha \left({\mathrm{y}}_{\mathrm{i}}-\mathrm{F}\left({\mathrm{x}}_{\mathrm{i}}\right)\right),\kern0.5em \mathrm{if}\kern0.5em {\mathrm{y}}_{\mathrm{i}}\ge \mathrm{F}\left({\mathrm{x}}_{\mathrm{i}}\right)\\ {}\begin{array}{cc}\left(\upalpha -1\right)\left({\mathrm{y}}_{\mathrm{i}}-\mathrm{F}\left({\mathrm{x}}_{\mathrm{i}}\right)\right),& \mathrm{otherwise}\end{array}\end{array}\right. $$

(6)

R_{j, m} is a leaf node, the jth being the number of leaf in the tree and υ is a learning rate. Update the current model:

$$ {F}_m(x)={F}_{m-1}(x)+\upsilon \sum \limits_{j=1}^J{\gamma}_{jm}I\left(x\in {R}_{jm}\right) $$

(7)

It is a method of looping that fits the regression tree. Then, once the first tree is added to the model, tree error prediction will be taken into account to balance and boost the accuracy of the next tree.

$$ {F}_m(x)=\sum \limits_{m=1}^M\sum \limits_{j=1}^J{\upsilon \gamma}_{jm}I\left(x\in {R}_{jm}\right) $$

(8)

Model evaluation

Performance indicators in this research work are used to determine the accuracy and errors of BRT with different loss function (OLS and QR). The indicators used to identify the best method for the prediction of PM₁₀ concentration were the root mean square error (RMSE), normalized absolute error (NAE), predictive accuracy (PA), agreement index (IA) and coefficient of determination (R²). The RMSE and NAE were used to find a model error where a value closer to 0 demonstrated a better model. Meanwhile, the other three performance indicators, i.e. IA, PA and R², were used to verify the accuracy of the model outcome, where a higher accuracy is given by a value closer to 1. The equations displayed in Table 5 have been indicated by Ul-Saufie et al. (2015).

Table 5 Performance indicator

Full size table

N = Number of sample hourly measurement of a selected sites.

P_i = Predicted values of hourly data.

O_i = Observed values of hourly.

$ \overline{O} $ = Mean of the observed values of hourly data.

$ \overline{\mathrm{P}} $ = Mean of the predicted values of hourly data.

Results and discussion

The descriptive statistics and box plots for maximum daily PM₁₀ concentrations in Alor Setar, Klang and Kota Bharu from 2002 to 2017 are shown in Fig. 3. Concentrations of PM₁₀ were very high in Klang, Selangor with maximum concentrations 643 μg/m³ over the threshold limit of 150 μg/m³, followed by Alor Setar (385 μg/m³) and Kota Bharu (198 μg/m³). This relates to the fact that Klang is the 13th busiest shipping port and the 16th busiest port in the world. Klang is one of the densely populated and developed areas in Malaysia as there are many industries and business activities in Port Klang. Alor Setar, Klang and Kota Bharu witnessed high particulate events as well as extreme events that promote the increase in PM₁₀ concentrations since the skewness value for Alor Setar (4.03), Klang (4.89) and Kota Bharu (1.72). The distribution is highly skewed, as described in Shaziayani et al. (2018), if the skewness is less than − 1 or greater than + 1. Box plot shows that Alor Setar experienced the highest PM₁₀ concentration in 2016. According to the DOE, this condition is affected by land and forest fires in Sumatra Central, Indonesia, carried by the Southwest Monsoon winds. Klang reached the highest PM₁₀ level during the haze emergency declared on 11 August 2005 as the Air Pollution Index (API) exceeded 500. Due to massive land and forest fires in Sumatra and Kalimantan, Indonesia, Kota Bharu had suffered degradation in air quality during Southwest Monsoon from August to September 2015.

The MAAQG control values for CO, NO₂, O₃, PM₁₀ and SO₂ are 8750 ppb (8-h mean reading), 40 ppb (24-h mean reading), 60 ppb (8-h mean reading), 50 μg/m³ (24-h mean reading) and 40 ppb (24-h mean reading). The analysed data for Alor Setar, such as mean, median, standard deviation, skewness, kurtosis and maximum data, are listed in Table 6. The mean values for all five air pollutants which are PM₁₀ (41.99 μg/m³), O₃ (34.27 ppb), CO (560.30 ppb), NO₂ (15.20 ppb) and SO₂ (1.05 ppb) indicate that the average concentration in Alor Setar for 16 years was below the Malaysia Ambient Air Quality Guidelines (MAAQG) for the period from 2002 to 2017. Furthermore, the mean values for meteorological parameters are represented by RH (89.35%), T (32.42 °C) and WS (10.53 km/h). Skewness shows positive values for all air pollutant values. The highest positive skewness value for CO, NO₂, O₃, PM₁₀ and SO₂ is 1.71, 1.10, 0.82, 4.03 and 0.99 indicating the existence of extreme events.

Table 6 Descriptive statistics for Alor Setar, Kedah

Full size table

Table 7 gives the summary of the descriptive statistics for all parameters’ maximum daily data of Klang for 2002 to 2017. The mean values for the area in 16 years are higher than their respective median which indicates that the pollutant distributions are positively skewed (also called right-skewed). The maximum value for air pollutants was PM₁₀ 643 μg/m³, O₃ 127ppb, CO 10,500 ppb, NO₂ 128 ppb and SO₂ 150 ppb. Klang has the highest mean and median values compared to other locations. This may be due to the fact that extensive industry operates in Port Klang, the most densely populated and developed region in Malaysia (AL-Dhurafi et al. 2017, 2018). It has the smallest standard deviation, despite the highest central tendency value, indicating that this area has continuously encountered very high concentrations.

Table 7 Descriptive statistics for Klang, Selangor

Full size table

Table 8 demonstrates the result of the descriptive analysis of air pollutant concentration and meteorological parameter for Kota Bharu, Kelantan. The mean values for PM₁₀ (48.73 μg/m³), O₃ (29.21 ppb), CO (926.26 ppb) and NO₂ (15.15 ppb) were higher than the median value. Therefore, the distributions of these measurements were skewed to the right, indicating that there were several observations of high concentration of air pollutant occurred in the years 2002–2017. Meanwhile, the mean value for RH (91.86%) and T (31.36 °C) was lower than the median value which indicates the distribution of data was skewed to the left. These results show that the weather in Kota Bharu is mainly hot and dry, which means that the observation of humidity this year seems to be less humid.

Table 8 Descriptive statistics for Kota Bharu, Kelantan

Full size table

The relative influence (RI) was computed to identify the strength of each predictor-response variable relationship. According to Sayegh et al. (2016), the BRT modelling technique can be used to identify the influence of different predictors on response variable. The most important predictor identified for the maximum daily PM₁₀ concentration for the next day (D + 1) was PM₁₀ concentration for the previous day, where Alor Setar has 90.17%, Kota Bharu 59.72% and Klang 54.68%. PM₁₀ concentration for the previous day played a remarkable role in explaining more than 50% of the variance in the BRT model. The least important predictor was found to be SO₂, where Alor Setar has 0.30%, Kota Bharu 2.77% and Klang 3.02% (Fig. 4).

The BRT models using OLS loss function and compared test, 10-fold CV and OOB methods are shown in Table 9. Performance indicator has been used to assess the accuracy of the fit to the BRT model in order to determine which method better predicts PM₁₀ concentration in Alor Setar, Klang and Kota Bharu for the 3 days ahead. This study predicts up to 3 days ahead because, according to Perimula (2012), the government will be able to announce warning status if the API exceeds 101 for more than 72 h.

Table 9 Comparison method of best iteration (OLS)

Full size table

The best OLS loss function in BRT models with the lowest total ranking is shown in Table 10. For error measurements, the values are ranked from the smallest (rank = 1) to the largest (rank = 3), and for accuracy measurements, the values are ranked from the largest (rank = 1) to the smallest (rank = 3). The total ranking has been determined. This procedure was repeated until the next 3-day (D + 3) prediction to decide the best BRT models for the three stations in this study.

Table 10 Ranking of performance indicators for the BRT model to predict D + 1 PM₁₀ concentration

Full size table

The results show that for the next-day prediction independent test set is better than OOB and CV for all sites. The coefficient of determination (R²) for Alor Setar, Klang and Kota Bharu was 0.70, 0.60 and 0.65, respectively, while the RMSE value was 10.35, 22.13 and 10.27, respectively. The R² values between the fitted model data and the data set were found to be more than 0.5, suggesting that the model is appropriate and good for the next day’s prediction by using an independent test set. The R² between the observations and the fitted model obtained from this study indicates how well the BRT model fits.

A comparison among the performances of the lowest error (NAE and RMSE) value and comparable IA, PA and R² values as for Alor Setar (independent test set), Klang (CV) and Kota Bharu (OOB) indicates that the best method for each site is different for the second-day prediction.

However, the next 3-day prediction suggests that the CV is the best method for Alor Setar and Klang, but for Kota Bharu independent test set is the best method which predicts PM₁₀ concentration. Overall, the model’s performance verified that the next-day prediction is better than the next 2-day and next 3-day prediction.

Descriptive analysis shows that the data for this study is non-central condition because it contains outlier; therefore, this study uses quantile regression as explained by Kudryavtsev (2009). Performance indicators have been used to identify the best quantile to predict the next-day (D + 1) PM₁₀ concentration at Alor Setar as summarized in Table 11. Of the five performance indicators used, NAE and IA indicate that 0.5 quantile gave better fit than other quantiles, but the valley differed by just 0.01 with 0.55 quantile. However, RMSE, PA and R² have shown that 0.55 quantile is the best quantile in PM₁₀ concentration models. 0.55 quantile was therefore used to predict the PM₁₀ concentration models for the OOB method. For CV and OOB methods, the presented results demonstrate that 0.5 gave better fit than other quantiles.

Table 11 Performance indicators for PM₁₀ concentration prediction (D + 1)

Full size table

After choosing the right quantile to present the best PM₁₀ concentration prediction models for the next day, repeat the same process for finding the best quantile for the next 2-day and next 3-day prediction for all selected locations.

The best quantile to predict the next 2-day (D + 2) PM₁₀ concentration at Alor Setar is reported in Table 12. Results show that 0.5 is the best quantile for CV and Test method for the next 2-day prediction, while for OOB is 0.4. The chosen quantile for the next 3-day at Alor Setar is described in Table 13. The findings revealed that all methods (OOB, CV and Test) have the same result, which is 0.55 as the best quantile.

Table 12 Performance indicators for PM₁₀ concentration prediction (D + 2)

Full size table

Table 13 Performance indicators for PM₁₀ concentration prediction (D + 3)

Full size table

After selecting the best weighting for OOB, CV and Test, the next step is to determine the best method for the next-day, the next 2-day and the next 3-day prediction. The best weighting function was identified for the next day, the next 2 days and the next 3 days in Table 14 for all three monitoring stations by repeating the same procedure for the proposed PM₁₀ concentration prediction method.

Table 14 Comparing the result between quantile regression

Full size table

The best prediction model for next-day PM₁₀ concentration in Alor Setar is OOB (quantile = 0.55) with an error of 0.1464 (NAE) and 9.3260 (RMSE), with an accuracy of 0.9177 (IA), 0.8546 (PA) and 0.7291 (R²). For Klang and Kota Bharu, CV (quantile 0.5) is the best method. The CV and Test models were selected to predict the PM₁₀ concentration for the next 2-day while for the next 3-day only Alor Setar shows that OOB (quantile = 0.55) is the best method with performance indicators 0.2463 (NAE), 14.4598 (RMSE), 0.6496 (IA), 0.5553 (PA) and 0.3078 (R²). Overall, the results showed that quantile values of 0.5, 0.55 and 0.6 obtained the best quantile results when combined with the BRT method.

The best loss function representing each monitoring station can be identified according to the results of the performance indicator in Table 15. Of the five performance indicators applied, all sites indicate that QR was slightly better than OLS. This is supported by Khan et al. (2019), which states that QR can be utilized for the prediction of extreme events.

Table 15 Comparing the best performance of statistical models for predicting PM₁₀ concentration

Full size table

Norazrin et al. (2018) investigated the Bayesian regression model using conjugate prior distribution and get the results for RMSE (4.66 to 9.88), IA (0.900 to 0.929), PA(0.830 to 0.866) and R² (0.614 to 0.665). While, Park et al. (2018) predicted PM₁₀ concentration in Seoul metropolitan subway stations using artificial neural network (ANN) model and presented R² of 0.39 to 0.81. On the other hand, Abdullah et al. (2020) showed the results from performance error RMSE (126.73–164.98) and NAE (0.33–0.43) by using multiple linear regression for PM₁₀ forecasting during episodic trans-boundary haze event in Malaysia. In addition, Shaziayani et al. (2018) reported that feed forward back propagation performs better than general regression neural network in Seberang Jaya, Pulau Pinang with an IA of as much as 0.7796 for the next day, 0.6033 for the next 2-day and 0.8024 for the next 3-day predictions.

Overall, this implies that the values of performance indicators of this study are almost the same as those of previous researchers. This paper shows that alpha 0.5, 0.55 and 0.60 are the best quantile as recommended by Ul-Saufie et al. (2012), which is appropriate for data on air pollution in Malaysia. Therefore, the proposed model can be used as an alternative method to predict the concentration of PM₁₀ in Malaysia.

Figure 5 shows the comparison between the observed value and predicted value of Alor Setar, Kota Bharu and Klang for the validate data set. The optimum setting value from the training data set is tuned with the number of learning rate at 0.01 and iteration at 10,000. By using the optimum value found in the training process, the accuracy of this BRT prediction is found to be 60.33 to 91.77%.

Conclusion

Overall, these results indicate that the quantile regression has fulfilled the assumptions and the good model for BRT for predicting maximum daily PM₁₀ concentration. The study findings show that the values of NAE (0.15–0.17), RMSE (9.33–22.25), R² (0.60–0.73), IA (0.85–0.92) and PA (0.78–0.85) were good for the next-day predictions. Most of the results used 0.5 as the best quantile which represents the median data, but 0.55 and 0.6 had also been chosen as the best quantile because the model has more number of outliers compare to the other models. Overall, the results showed that the number of quantile is greater than the median value (0.5). In conclusion, QR is an alternative loss function for BRT to predict the 3 days ahead of PM₁₀ concentration for all sites and suitable for data containing influence outlier. This model can help local authority to take action to reduce the effect of haze in Malaysia.

Data availability

The data for this project are confidential, but may be obtained with Data Use Agreements with the Department of Environment (DOE), Ministry of Environment and Water of Malaysia.

References

Abdullah S, Ismail M, Fong SY, Ahmed AMAN (2016) Evaluation for long term PM10 concentration forecasting using multi linear regression (MLR) and principal component regression (PCR) models. EnvironmentAsia 9:101–110. https://doi.org/10.14456/ea.2016.13
Article Google Scholar
Abdullah S, Ismail M, Fong SY, Ahmed AMAN (2017) Evaluation for long term PM10 concentration forecasting using multi linear regression (MLR) and principal component regression (PCR) models. Environ Asia 9:101–110
Google Scholar
Abdullah S, Napi NNLM, Ahmed AN, Mansor WNW, Mansor AB, Ismail M, Abdullah AM, Ramly ZTA (2020) Development of multiple linear regression for particulate matter (PM10) forecasting during episodic transboundary haze event in Malaysia. Atmosphere 11:1–14. https://doi.org/10.3390/atmos11030289
Article CAS Google Scholar
AL-Dhurafi N, Masseran N, Zamzuri ZH, Razali AM (2017) Modeling unhealthy Air Pollution Index using a peaks-over- threshold method. Environ Eng Sci 35:101–110
Article Google Scholar
AL-Dhurafi NA, Masseran N, Zamzuri ZH (2018) Compositional time series analysis for Air Pollution Index data. Stochastic Environ Res Risk Assess 32(10):2903–2911
Article Google Scholar
Azmi SZ, Latif MT, Ismail AS, Juneng L, Jemain AA (2010) Trend and status of air quality at three different monitoring stations in the Klang Valley, Malaysia. Air Qual Atmos Health 3:53–64. https://doi.org/10.1007/s11869-009-0051-1
Article CAS Google Scholar
Brunelli U, Piazza V, Pignato L, Sorbello F, Vitabile S (2007). Two-days ahead prediction of daily maximum concentrations of SO₂, O₃, PM₁₀, NO₂, CO in the urban area of Palermo, Italy. Atmos Environ, 41:2967–2995
Chelani AB, Gajghate DG, Hasan MZ (2002) Prediction of ambient PM₁₀ and toxic metals using artificial neural networks. J Air Waste Manage Assoc 52:805–810
Article CAS Google Scholar
Corani G (2005) Air quality prediction in Milan: feed-forward neural networks, pruned neural networks and lazy learning. Ecol Model 185:513–529
Article Google Scholar
DOE (2018) Department of Environment, Malaysia. Malaysia Environmental Quality Report 2018. Kuala Lumpur: Ministry of Energy, Science, Technology, Environment and Climate Change, Malaysia
Fernando HJS, Mammarella MC, Grandoni C, Fedele P, Di Marco R, Dimitrova R, Hyde P (2012) Forecasting PM₁₀ in metropolitan areas: efficacy of neural networks. Environ Pollut 163:62–67
Article CAS Google Scholar
Friedman JH (2001) Greedy function approximation: a gradient boosting machine. Ann Stat 29:1189–1232
Article Google Scholar
Friedman JH (2002) Stochastic gradient boosting. Computational Stat Data Anal 38:367–378
Article Google Scholar
Grunwald L, Schneider AK, Schröder B, Weber S (2020) Predicting urban cold-air paths using boosted regression trees. Landscape Urban Planning:201. https://doi.org/10.1016/j.landurbplan.2020.103843
Gu H, Wang J, Ma L, Shang Z, Zhang Q (2019) Insights into the BRT (boosted regression trees) method in the study of the climate-growth relationship of Masson pine in subtropical China. Forests 10:1–20. https://doi.org/10.3390/f10030228
Article Google Scholar
Huijnen V, Wooster MJ, Kaiser JW, Gaveau DLA, Flemming J, Parrington M, Inness A, Murdiyarso D, Main B, Van Weele M (2016) Fire carbon emissions over maritime Southeast Asia in 2015 largest since 1997. Sci Rep 6
Juneng L, Latif MT, Tangang F (2011) Factors influencing the variations of PM10 aerosol dust in Klang Valley, Malaysia during the summer. Atmos Environ 45:4370–4378. https://doi.org/10.1016/j.atmosenv.2011.05.045
Article CAS Google Scholar
Kamarul Zaman NAF, Kanniah KD, Kaskaoutis DG (2017) Estimating particulate matter using satellite based aerosol optical depth and meteorological variables in Malaysia. Atmos Res 193:142–162. https://doi.org/10.1016/j.atmosres.2017.04.019
Article CAS Google Scholar
Khan N, Shahid S, Juneng L, Ahmed K, Ismail T, Nawaz N (2019) Prediction of heat waves in Pakistan using quantile regression forests. Atmos Res 221:1–11. https://doi.org/10.1016/j.atmosres.2019.01.024
Article Google Scholar
Kudryavtsev AA (2009) Using quantile regression for rate-making. Insurance, Math Econ 45:296–304
Article Google Scholar
Latif MT, Othman M, Idris N, Juneng L, Abdullah AM, Hamzah WP, Khan MF, Sulaiman NMN, Jewaratnam J, Aghamohammadi N, Sahani M, Xiang CJ, Ahamad F, Amil N, Darus M, Varkkey H, Tangang F, Jaafar AB (2018) Impact of regional haze towards air quality in Malaysia. A review. Atmos Environ 177:28–44. https://doi.org/10.1016/j.atmosenv.2018.01.002
Article CAS Google Scholar
Leong WC, Kelani RO, Ahmad Z (2020) Prediction of Air Pollution Index (API) using support vector machine (SVM). Jf Enviro Chemical Eng 8:103208
Article CAS Google Scholar
Lingxin H, Naiman DQ (2007). Quantile regression, United Kingdom : Sage Publications
Liu W, Li X, Chen Z, Zeng G, León T, Liang J, Huang G, Gao Z, Jiao S, He X, Lai M (2015) Land use regression models coupled with meteorology to model spatial and temporal variability of NO₂ and PM₁₀ in Changsha, China. Atmos Environ 116:272–280
Article CAS Google Scholar
Lu WZ, Wang WJ, Wang XK, Yan SH, Lam JC (2004) Potential assessment of a neural network model with PCA/RBF approach for forecasting pollutant trends in Mong Kok urban air, Hong Kong. Environ Res 96:79–87
Article CAS Google Scholar
Martinez-Munoz G, Suarez A (2010) Out-of-bag estimation of the optimal sample size in bagging. Pattern Recognit 43:143–152
Article Google Scholar
McKendry IG (2002) Evaluation of artificial neural networks for fine particulate pollution (PM₁₀ and PM_2.5) forecasting. J Air Waste Manage Assoc 52:1096–1101
Article CAS Google Scholar
Motevalli A, Naghibi SA, Hashemi H, Berndtsson R, Pradhan B, Gholami V (2019) Inverse method using boosted regression tree and k-nearest neighbour to quantify effects of point and non-point source nitrate pollution in groundwater. J Cleaner Prod 228:1248–1263. https://doi.org/10.1016/j.jclepro.2019.04.293
Article CAS Google Scholar
Navares R, Aznarte JL (2020) Predicting air quality with deep learning LSTM: towards comprehensive models. Ecol Inform 55:101019
Article Google Scholar
Nejadkoorki F, Baroutian S (2012) Forecasting extreme PM10 concentrations using artificial neural networks. Int J Environ Res 6:277–284
CAS Google Scholar
Noor NM, Yahaya AS, Ramli NA, Abdullah MMAB (2014) Mean imputation techniques for filling the missing observations in air pollution dataset. Key Eng Mater 594-595:902–908
Article Google Scholar
Noor NM, Yahaya AS, Ramli NA, Abdullah MMAB (2015) Filling the missing data of air pollutant concentration using single imputation methods. Appl Mech Mater 754–755:923–932. https://doi.org/10.4028/www.scientific.net/amm.754-755.923
Article Google Scholar
Norazrin R, Yahaya AS, Hamid AH, Shukri A, Abdul H (2018) Predicting PM10 concentration using Bayesian regression with non-informative prior and conjugate prior model. Engineering Sci Res 3(2):59–65. https://doi.org/10.26666/rmp.jesr.2018.2.9
Article Google Scholar
Park S, Kim M, Kim M, Namgung HG, Kim KT, Cho KH, Kwon SB (2018) Predicting PM10 concentration in Seoul metropolitan subway stations using artificial neural network (ANN). J Hazard Mater 341:75–82. https://doi.org/10.1016/j.jhazmat.2017.07.050
Article CAS Google Scholar
Perez P (2012) Combined model for PM₁₀ forecasting in a large city. Atmos Environ 60:271–276
Article CAS Google Scholar
Perimula Y (2012). HAZE: steps taken to reduce hot spots. New Strait Times. Online: http://www.nst.com.my/opinion/letters-to-the-editor/haze-steps-taken-to-reduce-hot-spots-1.98115. Accessed 10 October 2012
Popescu M, Ilie C, Panaitescu L, Lungu ML, Ilie M, Lungu D (2013) Artificial neural networks forecasting of the PM₁₀ quantity in London considering the Harwell and Rochester stoke PM₁₀ measurements. J Environ Prot Ecol 14:1473–1481
Google Scholar
Reddington CL, Yoshioka M, Balasubramaniam R, Ridley D, Toh DY, Arnold SR, Spracklen DV (2014) Environ Res Lett 9:1–12
Article Google Scholar
Ridgeway G (2007). Generalized boosted models: a guide to the gbm package
Ridgeway G (2010) GBM: generalized boosted regression models. R packages version 1:6–3.1
Google Scholar
Ridgeway G (2012). gbm: Generalized Boosted Regression Models. R package. TRL, 2007. Primary NO2 Emissions from Road Vehicles in the Hatfield and Bell Common Tunnels. Published Project Report PPR262. TRL, 2011. The Highways Agency Roadside Air Pollution Monitoring Network Report 2010 1
Ridgeway G (2017). Gbm: generalized boosted regression models. R Package Version 2.1.3. https://CRAN.R-project.org/package=gbm
Ridgeway G (2020) Generalized boosted models: a guide to the gbm package. Compute 1:1–12
Google Scholar
Sahani M, Zainon NA, Mahiyuddin WWR, Latif MT, Hod R, Khan MF, Tahir NM, Chan CC (2014) A case-crossover analysis of forest fire haze events and mortality in Malaysia. Atmos Environ 96:257–265
Article CAS Google Scholar
Sapini ML, Rahim NZBA, Noorani MSM (2015) The behaviour of PM10 and ozone in Malaysia through non-linear dynamical systems. AIP Conference Proceedings 1682. https://doi.org/10.1063/1.4932452
Sayegh A, Tate JE, Ropkins K (2016) Understanding how roadside concentrations of NOx are influenced by the background levels, traffic density, and meteorological conditions using boosted regression trees. Atmos Environ 127:163–175. https://doi.org/10.1016/j.atmosenv.2015.12.024
Article CAS Google Scholar
Schlink U, Thiem A, Kohajda T, Richter M, Strebel K (2010) Quantile regression of indoor air concentrations of volatile organic compound (VOC). Sci Total Environ 408:3840–3851
Article CAS Google Scholar
Shaziayani WN, Ul-saufie AZ, Ahmat H (2018). A 24-hour forecasting of PM10 concentration in urban area. doi:https://doi.org/10.1063/1.5054208
Ul-Saufie AZ, Yahaya AS, Ramli A, Hamid HA (2012a) Future PM10 concentration prediction using quantile regression models. Ipcbee 37:15–19
CAS Google Scholar
Ul-Saufie AZ, Yahaya AS, Ramli A, Hamid HA (2012b) Robust regression models for predicting PM10 concentration in an industrial area. Int J Eng Technol 2:364–370
Google Scholar
Ul-Saufie AZ, Yahaya AS, Ramli A, Hamid HA (2015) PM10 concentrations short term prediction using feedforward backpropagation and general regression neural network in a sub-urban area. J Environ Sci Technol 8:59–73. https://doi.org/10.3923/jest.2015.59.73
Article CAS Google Scholar
Viotti P, Liuti G, Di Genova P (2002) Atmospheric urban pollution: applications of an artificial neural network (ANN) to the city of Perugia. Ecol Model 148:27–46. https://doi.org/10.1016/S0304-3800(01)00434-3
Article CAS Google Scholar
Yahaya NZ, Ibrahim ZF, Yahaya J (2019) The used of the boosted regression tree optimization technique to analyse an air pollution data. Int J Recent Technol Eng 8:1565–1575. https://doi.org/10.35940/ijrte.b3807.118419
Article Google Scholar
Zakri NL, Saudi ASM, Juahir H, Toriman ME, Abu IF, Mahmud MM, Khan MF (2018) Identification source of variation on regional impact of air quality pattern using chemometric techniques in Kuching, Sarawak. Int J Eng Technol 7:49
Article CAS Google Scholar

Download references

Acknowledgements

Thank you to Universiti Teknologi MARA for their support and also thanks to the Department of Environment Malaysia for providing air quality monitoring data.

Funding

The research was funded by 600-IRMI/FRGS 5/3 (289/2019).

Author information

Authors and Affiliations

Faculty of Computer and Mathematical Sciences, Universiti Teknologi MARA, 13500, Bukit Mertajam, Pulau Pinang, Malaysia
Wan Nur Shaziayani
Faculty of Computer and Mathematical Sciences, Universiti Teknologi MARA, 40450, Shah Alam, Selangor, Malaysia
Ahmad Zia Ul-Saufie & Hasfazilah Ahmat
Faculty of Engineering and Technology, Liverpool John Moores University, Liverpool, UK
Dhiya Al-Jumeily

Authors

Wan Nur Shaziayani
View author publications
You can also search for this author in PubMed Google Scholar
Ahmad Zia Ul-Saufie
View author publications
You can also search for this author in PubMed Google Scholar
Hasfazilah Ahmat
View author publications
You can also search for this author in PubMed Google Scholar
Dhiya Al-Jumeily
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ahmad Zia Ul-Saufie.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Shaziayani, W.N., Ul-Saufie, A.Z., Ahmat, H. et al. Coupling of quantile regression into boosted regression trees (BRT) technique in forecasting emission model of PM₁₀ concentration. Air Qual Atmos Health 14, 1647–1663 (2021). https://doi.org/10.1007/s11869-021-01045-3

Download citation

Received: 17 February 2021
Accepted: 10 May 2021
Published: 24 May 2021
Issue Date: October 2021
DOI: https://doi.org/10.1007/s11869-021-01045-3

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Coupling of quantile regression into boosted regression trees (BRT) technique in forecasting emission model of PM₁₀ concentration

Abstract

Similar content being viewed by others

Air Quality Index Prediction of Bangalore City Using Various Machine Learning Methods

Hybrid Neural Networks and Boosted Regression Tree Models for Predicting Roadside Particulate Matter

A new model of air quality prediction using lightweight machine learning

Introduction