River water quality index prediction and uncertainty analysis: A comparative study of machine learning models

https://doi.org/10.1016/j.jece.2020.104599Get rights and content

Highlights

  • A new ensemble machine learning model is developed to predict the Water Quality Index (WQI).

  • Observed water quality variables in the Lam Tsuen River in Hong-Kong are used to predict the WQI.

  • The prediction uncertainty associated with model structure and input variable selection is quantified.

  • The three modeled considered, and the ETR model in particular, provide accurate WQI predictions.

Abstract

The Water Quality Index (WQI) is the most common indicator to characterize surface water quality. This study introduces a new ensemble machine learning model called Extra Tree Regression (ETR) for predicting monthly WQI values at the Lam Tsuen River in Hong Kong. The ETR model performance is compared with that of the classic standalone models, Support Vector Regression (SVR) and Decision Tree Regression (DTR). The monthly input water quality data including Biochemical Oxygen Demand (BOD), Chemical Oxygen Demand (COD), Dissolved Oxygen (DO), Electrical Conductivity (EC), Nitrate-Nitrogen (NO3 -N), Nitrite-Nitrogen (NO2 -N), Phosphate (PO43-), potential for Hydrogen (pH), Temperature (T) and Turbidity (TUR) are used for building the prediction models. Various input data combinations are investigated and assessed in terms of prediction performance, using numerical indices and graphical comparisons. The analysis shows that the ETR model generally produces more accurate WQI predictions for both training and testing phases. Although including all the ten input variables achieves the highest prediction performance (R2test=0.98, RMSEtest=2.99), a combination of input parameters including only BOD, Turbidity and Phosphate concentration provides the second highest prediction accuracy (R2test=0.97, RMSEtest=3.74). The uncertainty analysis relative to model structure and input parameters highlights a higher sensitivity of the prediction results to the former. In general, the ETR model represents an improvement on previous approaches for WQI prediction, in terms of prediction performance and reduction in the number of input parameters.

Introduction

Surface water in rivers plays a vital role in the environment, social health and economic development [[1], [2], [3]]. Various elements affect water quality in rivers, including natural factors such as rainfall and erosion, and human factors such as urban, agricultural and manufacturing practices [[4], [5], [6]]. Because surface water is the primary source of freshwater worldwide, its degradation can lead to significant consequences on drinking water availability and, more generally, on economic development and future strategies [[7], [8], [9]]. The interaction of rivers with the surrounding environment, with the related exchange of urban, industrial and agricultural contaminants along their path, leads to water pollution [10,6,11,12].

The Water Quality Index (WQI) “as a classification indicator” has been widely utilized to evaluate and categorize the quality of both ground and surface waters and plays a significant role in water resources management [[13], [14], [15]]. This index integrates several physical and chemical factors in a single parameter, to quantify the state of water quality [16,17]. Computing this indicator provides an efficient approach for water quality assessment.

The application of the WQI was first presented by [18,19] and later adopted and modified by numerous practitioners [[20], [21], [22]]. All these applications originated from the general concept of Water Quality Index proposed by the National Sanitation Foundation (NSF) of the United States, and the NSFWQI [18] is the most commonly adopted method worldwide.

In general, WQI formulations include lengthy calculations and thus require considerable time and effort. Also, the conventional methods for computing the WQI need a large number of physical and chemical data, typically at daily intervals. Hence, alternative approaches to calculate WQI in a computationally-efficient and accurate way are required. Such improvement may benefit environmental engineers when monitoring and assessing water quality.

Over the past few decades, Artificial Intelligence (AI), in the form of machine learning models, has been increasingly applied to solve various environmental engineering problems, including river water quality modeling [6,[23], [24], [25], [26]].Machine learning models represent a significant innovation in the research on monitoring and control of several engineering processes [[27], [28], [29]] and their algorithms can be used to carry out accurate predictions without the need for complex programming. Machine learning models are ultimately based on data mining and identification of patterns between data, through the construction of algorithms using a subset of the dataset (training data) and verification of the prediction performance using a separate subset of the dataset (testing set) [[30], [31], [32]].

Our literature review has evidenced great attention to WQI simulation using AI models [33]. Tripathi and Singal [34] employed the Principal Components Analysis (PCA) method to select the optimal input variable combination and provide an innovating Water Quality Index computation method for the Ganges river in India. Using this method, they could reduce 28 parameters to only nine, including Electrical Conductivity (EC), power of Hydrogen (pH), Dissolved Oxygen (DO), Total Dissolved Solids (TDS), Sulfate (SO4), Magnesium (Mg), Chlorine (Cl), Total Coliform (TC) and Biochemical Oxygen Demand (BOD), which led to a dramatic reduction in computational time. Zali et al. [35] addressed the effects of six major input parameters, namely DO, BOD, Chemical Oxygen Demand (COD), pH, Nitrate (NO3) and Suspended Solids (SS), on the Water Quality Index using Artificial Neural Networks (ANNs). Conducting a sensitivity analysis, they identified the relative importance of each parameter in WQI determination and concluded that DO, SS, and NO3 are the key input parameters. Nigam and SM [36] used a fuzzy-based model to calculate the ground Water Quality Index and compared its prediction performance with other common calculation methods, finding that the fuzzy-based model outperforms them in prediction and water quality classification. Srinivas and Singh [37] developed a novel fuzzy decision-making method to predict WQI in rivers, employing an Interactive Fuzzy model (IFWQI); their findings indicate a significant improvement in WQI prediction performance of their proposed model compared to the classic fuzzy method. Yaseen et al. [38] assessed the WQI prediction performance of hybrid methods based on Adaptive Neuro-Fuzzy Inference System (ANFIS), namely ANFIS-FCM (Fuzzy C-Means data clustering), ANFIS-GP (Grid Partition), and ANFIS-SC (Subtractive Clustering); they observed ANFIS-SC to be the best performing model. Hameed et al. [39] used two ANN models, namely Back-Propagation Neural Network (BPNN) and Radial Basis Function Neural Network (RBFNN), for describing the relationship between WQI and several chemical parameters (i.e. BOD, COD, DO, Nitrate, pH, Suspended Solids) in tropical environments, and obtained better prediction results with the RBFNN model.

While typical AI models based on ANN and ANFIS are widely developed for WQI simulation, environmental scientists have been exploring other robust and reliable AI models [25,26,33]. Tree-based models, such as Decision Trees (DTs), are another popular approach applied successfully for various hydrological and environmental problems, such as rainfall forecasting [40]. The Support Vector Machine (SVM) model is also acknowledged as a powerful machine learning technique for both linear and nonlinear regression problems and has been used in various scientific issues with high prediction accuracy [[41], [42], [43]]. Concerning the application of decision-tree and support vector regression models for water quality parameter prediction, Granata et al. [44] developed a Support Vector Regression (SVR) model and a Regression Tree (RT) model to predict concentrations of Total Suspended Solids (TSS), Total Dissolved Solids (TDS), COD and BOD; they found the SVR model to provide the best predictions. Li et al. [45] proposed a hybrid SVR model with FireFly Algorithm (FFA) to predict WQI using monthly water quality parameter data, showing a significant improvement in prediction performance when compared to the standalone SVR model. Kamyab-Talesh et al. [46] studied the optimization of the SVM model to identify the parameters that mostly affect the WQI and found that Nitrate is the most important parameter for WQI prediction. Wang et al. [47] investigated the performance of three machine learning models, SVR, SVR-GA (Genetic Algorithm) and SVR-PSO (Particle Swarm Optimization), to predict WQI by employing the spectral indicators Difference Index (DI), Normalized DI and Ratio Index (RI) which were obtained from remote sensing, and found the SVR-PSO to be the best performing model.

Although various AI models have been introduced for modeling WQI, several limitations are still observed, such as model hyper parameters tuning, case study stochasticity characteristics, and model flexibility. Ensemble machine learning models offer a way to overcome those limitations and are recently gaining popularity [48]. Several ensemble bagging and boosting models are available, such as MadaBoost, Gradient Boost, AdaBoost and Random Forest, and have proved to be significantly useful tools for prediction [49,50]. More recently developed models, such as Extra Tree or Bagging Regression, are simpler to code and have shown potential to replace other classic AI algorithms [51,52]. The ensemble approaches are a combination of standalone models such as SVM and Decision Trees to construct a better performing prediction model [[53], [54], [55]]. The application of ensemble machine learning models for water quality studies has been limited and is the focus of this investigation. Specifically, this study presents a novel ensemble machine learning model, namely the Extra Tree Regression (ETR) ensemble method, to overcome several drawbacks in the available approaches for WQI prediction, such as the large number of input variables and their uncertainty.

A numerous benchmark models such as ANNs, and ANFIS can be compared with the ETR model in the present study; however, two standalone machine learning models, namely a Decision Tree Regression (DTR) model, which is based on a tree-based algorithm, and a Support Vector Regression (SVR) model, are adopted as benchmark models for the ETR model in the current study. Using the Python programming language in the Anaconda data science platform, the prediction performance of the three models is quantified and compared, and an uncertainty analysis relative to model structure and input parameter selection is carried out too.

Section snippets

Study area and data collection

The Lam Tsuen River (Fig. 1) in Hong Kong is considered for the present study. The river flows from west of Tai Po, in the New Territories of Hong Kong, into the Tolo Harbour. The Lam Tsuen River, which is the second-longest river in Hong Kong, has a length of about 10.8 km and its catchment area is about 21 km2. The river originates from the north of the Tai Mo mountaintop of the Sei Fong Shan Mountains, which is the tallest mountaintop in Hong Kong with a height of 740 m above the sea level.

Methodology

Fig. 2 shows an overview of the methodology used in the present study, described in detailed in the following sections.

Assessment of predictive models

In this study, several input parameter combinations are investigated to determine the optimal combination for WQI prediction. Table 2 presents the values of the Pearson correlation coefficient between the various input parameters considered and the WQI. Note that, while other techniques exist to evaluate input variable combinations, such as Cross-Correlation Function (CCF), Partial Auto-Correlation Function (PACF), Auto-Correlation Function (ACF) and Gamma Technique (GT), the Pearson

Conclusion

This study investigates the performance of three machine learning models, based on standalone Support Vector Regression (SVR) and Decision Tree Regression (DTR) models or based on a novel ensemble Extra Tree Regression (ETR) algorithm, to predict the Water Quality Index (WQI) of the Lam Tsuen River in Hong Kong using water quality parameters measured at monthly intervals.

Several input parameters combinations (M1-M10) are considered. Results indicate the ETR model with ten input parameters to

CRediT authorship contribution statement

Seyed Babak Haji Seyed Asadollah: Data curation, Writing - original draft, Software. Ahmad Sharafati: Conceptualization, Methodology, Writing - review & editing, Validation, Supervision. Davide Motta: Investigation, Writing - review & editing. Zaher Mundher Yaseen: Visualization, Writing - review & editing.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgement

The authors would like to express their gratitude and appreciation to the Office of the Government Chief Information Officer of the Government of the Hong Kong Special Administrative Region for providing the dataset needed for this study. Our appreciation is extended to the respected editor and reviewers for their constructive comment.

References (96)

  • N.M. Gazzaz et al.

    Artificial neural network modeling of the water quality index for Kinta River (Malaysia) using water quality variables as predictors

    Marine pollution bulletin

    (2012)
  • R.C. Grabowski et al.

    Hydrogeomorphology—ecology interactions in river systems

    River Res. Appl.

    (2016)
  • A. Gupta Das

    Implication of environmental flows in river basin management

    Phys. Chem. Earth, Parts A/B/C

    (2008)
  • S.M. Pandhiani et al.

    Time-series prediction of streamflows of malaysian rivers using data-driven techniques

    J. Irrig. Drain. Eng.

    (2020)
  • N.B. Bhatti et al.

    Socio-economic impact assessment of small dams based on T-Paired sample test using SPSS software

    Civ. Eng. J.

    (2019)
  • C. Cordier et al.

    Culture of microalgae with ultrafiltered seawater: a feasibility study

    Sci. Med. J.

    (2020)
  • A.P. Singh et al.

    Managing water quality of a river using an integrated geographically weighted regression technique with fuzzy decision-making model

    Environ. Monit. Assess.

    (2019)
  • H. Cheng et al.

    Meeting China’s Water Shortage Crisis: Current Practices and Challenges

    (2009)
  • G. Shahzad et al.

    Rapid performance evaluation of water supply services for strategic planning

    Civ. Eng. J.

    (2019)
  • P. Sihag et al.

    Modelling of impact of water quality on recharging rate of storm water filter system using various kernel function based regression

    Model. Earth Syst. Environ.

    (2018)
  • D. Katyal

    Water quality indices used for surface water vulnerability assessment

    Int. J. Environ. Sci.

    (2011)
  • G.S. Solangi et al.

    Analysis of Indus Delta groundwater and surface water suitability for domestic and irrigation purposes

    Civ. Eng. J.

    (2019)
  • C.J. Vörösmarty et al.

    Global threats to human water security and river biodiversity

    Nature

    (2010)
  • P. Debels et al.

    Evaluation of water quality in the Chillán River (Central Chile) using physicochemical parameters and a modified water quality index

    Environ. Monit. Assess.

    (2005)
  • A. Lumb et al.

    A review of genesis and evolution of water quality index (WQI) and some future directions

    Water Qual. Expo. Heal.

    (2011)
  • A.A. Bordalo et al.

    A water quality index applied to an international shared river basin: the case of the Douro River

    Environ. Manage.

    (2006)
  • R.M. Brown et al.

    A Water Quality Index- Do We Dare

    (1970)
  • R.K. Horton

    An index number system for rating water quality

    J. Water Pollut. Control Fed.

    (1965)
  • C.G. Cude

    Oregon water quality index a tool for evaluating water quality management effectiveness 1

    JAWRA J. Am. Water Resour. Assoc.

    (2001)
  • A. Said et al.

    An innovative index for evaluating water quality in streams

    Environ. Manage.

    (2004)
  • M. Dehghani et al.

    Uncertainty analysis of streamflow drought forecast using artificial neural networks and Monte-Carlo simulation

    Int. J. Climatol.

    (2014)
  • S. Mandal et al.

    Modeling of arsenic (III) removal by evolutionary genetic programming and least square support vector machine models

    Environ. Process.

    (2014)
  • M.J. Alizadeh et al.

    Effect of river flow on the quality of estuarine and coastal waters using machine learning models

    Eng. Appl. Comput. Fluid Mech.

    (2018)
  • K. Kargar et al.

    Estimating longitudinal dispersion coefficient in natural streams using empirical models and machine learning algorithms

    Eng. Appl. Comput. Fluid Mech.

    (2020)
  • P. Sihag et al.

    Prediction of unsaturated hydraulic conductivity using adaptive neuro-fuzzy inference system (ANFIS)

    ISH J. Hydraul. Eng.

    (2019)
  • P. Sihag et al.

    Modelling of infiltration of sandy soil using gaussian process regression

    Model. Earth Syst. Environ.

    (2017)
  • Z.M. Yaseen et al.

    An enhanced extreme learning machine model for river flow forecasting: state-of-the-art, practical applications in water resource engineering area and future research direction

    J. Hydrol.

    (2018)
  • M. Najafzadeh et al.

    Evaluation of neuro-fuzzy GMDH-based particle swarm optimization to predict longitudinal dispersion coefficient in rivers

    Environ. Earth Sci. Springer

    (2016)
  • M. Najafzadeh et al.

    Prediction of local scour depth downstream of sluice gates using data-driven models

    ISH J. Hydraul. Eng.

    (2017)
  • M. Najafzadeh et al.

    Prediction of riprap stone size under overtopping flow using data-driven models

    Int. J. River Basin Manag. IAHR

    (2018)
  • T.M. Tung et al.

    A survey on river water quality modelling using artificial intelligence models: 2000–2020

    J. Hydrol.

    (2020)
  • M.A. Zali et al.

    Sensitivity analysis for water quality index (WQI) prediction for Kinta River, Malaysia

    World Appl. Sci. J.

    (2011)
  • U. Nigam et al.

    Development of Computational Assessment Model of Fuzzy Rule Based Evaluation of Groundwater Quality Index: Comparison and Analysis with Conventional Index

    (2019)
  • R. Srinivas et al.

    Application of fuzzy multi-criteria approach to assess the Water quality of River ganges

    Soft Computing: Theories and Applications

    (2018)
  • Z.M. Yaseen et al.

    Hybrid adaptive neuro-fuzzy models for water quality index estimation

    Water Resour. Manag.

    (2018)
  • M. Hameed et al.

    Application of artificial intelligence (AI) techniques in water quality index prediction: a case study in tropical region, Malaysia

    Neural Comput. Appl.

    (2017)
  • B. Xiang et al.

    The application of a decision tree and stochastic forest model in summer precipitation prediction in Chongqing

    Atmosphere (Basel)

    (2020)
  • P. Aghelpour et al.

    Long-term monthly average temperature forecasting in some climate types of Iran, using the models SARIMA, SVR, and SVR-FA

    Theor. Appl. Climatol.

    (2019)
  • Cited by (181)

    View all citing articles on Scopus
    View full text