River water quality index prediction and uncertainty analysis: A comparative study of machine learning models
Graphical abstract
Introduction
Surface water in rivers plays a vital role in the environment, social health and economic development [[1], [2], [3]]. Various elements affect water quality in rivers, including natural factors such as rainfall and erosion, and human factors such as urban, agricultural and manufacturing practices [[4], [5], [6]]. Because surface water is the primary source of freshwater worldwide, its degradation can lead to significant consequences on drinking water availability and, more generally, on economic development and future strategies [[7], [8], [9]]. The interaction of rivers with the surrounding environment, with the related exchange of urban, industrial and agricultural contaminants along their path, leads to water pollution [10,6,11,12].
The Water Quality Index (WQI) “as a classification indicator” has been widely utilized to evaluate and categorize the quality of both ground and surface waters and plays a significant role in water resources management [[13], [14], [15]]. This index integrates several physical and chemical factors in a single parameter, to quantify the state of water quality [16,17]. Computing this indicator provides an efficient approach for water quality assessment.
The application of the WQI was first presented by [18,19] and later adopted and modified by numerous practitioners [[20], [21], [22]]. All these applications originated from the general concept of Water Quality Index proposed by the National Sanitation Foundation (NSF) of the United States, and the NSFWQI [18] is the most commonly adopted method worldwide.
In general, WQI formulations include lengthy calculations and thus require considerable time and effort. Also, the conventional methods for computing the WQI need a large number of physical and chemical data, typically at daily intervals. Hence, alternative approaches to calculate WQI in a computationally-efficient and accurate way are required. Such improvement may benefit environmental engineers when monitoring and assessing water quality.
Over the past few decades, Artificial Intelligence (AI), in the form of machine learning models, has been increasingly applied to solve various environmental engineering problems, including river water quality modeling [6,[23], [24], [25], [26]].Machine learning models represent a significant innovation in the research on monitoring and control of several engineering processes [[27], [28], [29]] and their algorithms can be used to carry out accurate predictions without the need for complex programming. Machine learning models are ultimately based on data mining and identification of patterns between data, through the construction of algorithms using a subset of the dataset (training data) and verification of the prediction performance using a separate subset of the dataset (testing set) [[30], [31], [32]].
Our literature review has evidenced great attention to WQI simulation using AI models [33]. Tripathi and Singal [34] employed the Principal Components Analysis (PCA) method to select the optimal input variable combination and provide an innovating Water Quality Index computation method for the Ganges river in India. Using this method, they could reduce 28 parameters to only nine, including Electrical Conductivity (EC), power of Hydrogen (pH), Dissolved Oxygen (DO), Total Dissolved Solids (TDS), Sulfate (SO4), Magnesium (Mg), Chlorine (Cl), Total Coliform (TC) and Biochemical Oxygen Demand (BOD), which led to a dramatic reduction in computational time. Zali et al. [35] addressed the effects of six major input parameters, namely DO, BOD, Chemical Oxygen Demand (COD), pH, Nitrate () and Suspended Solids (SS), on the Water Quality Index using Artificial Neural Networks (ANNs). Conducting a sensitivity analysis, they identified the relative importance of each parameter in WQI determination and concluded that DO, SS, and are the key input parameters. Nigam and SM [36] used a fuzzy-based model to calculate the ground Water Quality Index and compared its prediction performance with other common calculation methods, finding that the fuzzy-based model outperforms them in prediction and water quality classification. Srinivas and Singh [37] developed a novel fuzzy decision-making method to predict WQI in rivers, employing an Interactive Fuzzy model (IFWQI); their findings indicate a significant improvement in WQI prediction performance of their proposed model compared to the classic fuzzy method. Yaseen et al. [38] assessed the WQI prediction performance of hybrid methods based on Adaptive Neuro-Fuzzy Inference System (ANFIS), namely ANFIS-FCM (Fuzzy C-Means data clustering), ANFIS-GP (Grid Partition), and ANFIS-SC (Subtractive Clustering); they observed ANFIS-SC to be the best performing model. Hameed et al. [39] used two ANN models, namely Back-Propagation Neural Network (BPNN) and Radial Basis Function Neural Network (RBFNN), for describing the relationship between WQI and several chemical parameters (i.e. BOD, COD, DO, Nitrate, pH, Suspended Solids) in tropical environments, and obtained better prediction results with the RBFNN model.
While typical AI models based on ANN and ANFIS are widely developed for WQI simulation, environmental scientists have been exploring other robust and reliable AI models [25,26,33]. Tree-based models, such as Decision Trees (DTs), are another popular approach applied successfully for various hydrological and environmental problems, such as rainfall forecasting [40]. The Support Vector Machine (SVM) model is also acknowledged as a powerful machine learning technique for both linear and nonlinear regression problems and has been used in various scientific issues with high prediction accuracy [[41], [42], [43]]. Concerning the application of decision-tree and support vector regression models for water quality parameter prediction, Granata et al. [44] developed a Support Vector Regression (SVR) model and a Regression Tree (RT) model to predict concentrations of Total Suspended Solids (TSS), Total Dissolved Solids (TDS), COD and BOD; they found the SVR model to provide the best predictions. Li et al. [45] proposed a hybrid SVR model with FireFly Algorithm (FFA) to predict WQI using monthly water quality parameter data, showing a significant improvement in prediction performance when compared to the standalone SVR model. Kamyab-Talesh et al. [46] studied the optimization of the SVM model to identify the parameters that mostly affect the WQI and found that Nitrate is the most important parameter for WQI prediction. Wang et al. [47] investigated the performance of three machine learning models, SVR, SVR-GA (Genetic Algorithm) and SVR-PSO (Particle Swarm Optimization), to predict WQI by employing the spectral indicators Difference Index (DI), Normalized DI and Ratio Index (RI) which were obtained from remote sensing, and found the SVR-PSO to be the best performing model.
Although various AI models have been introduced for modeling WQI, several limitations are still observed, such as model hyper parameters tuning, case study stochasticity characteristics, and model flexibility. Ensemble machine learning models offer a way to overcome those limitations and are recently gaining popularity [48]. Several ensemble bagging and boosting models are available, such as MadaBoost, Gradient Boost, AdaBoost and Random Forest, and have proved to be significantly useful tools for prediction [49,50]. More recently developed models, such as Extra Tree or Bagging Regression, are simpler to code and have shown potential to replace other classic AI algorithms [51,52]. The ensemble approaches are a combination of standalone models such as SVM and Decision Trees to construct a better performing prediction model [[53], [54], [55]]. The application of ensemble machine learning models for water quality studies has been limited and is the focus of this investigation. Specifically, this study presents a novel ensemble machine learning model, namely the Extra Tree Regression (ETR) ensemble method, to overcome several drawbacks in the available approaches for WQI prediction, such as the large number of input variables and their uncertainty.
A numerous benchmark models such as ANNs, and ANFIS can be compared with the ETR model in the present study; however, two standalone machine learning models, namely a Decision Tree Regression (DTR) model, which is based on a tree-based algorithm, and a Support Vector Regression (SVR) model, are adopted as benchmark models for the ETR model in the current study. Using the Python programming language in the Anaconda data science platform, the prediction performance of the three models is quantified and compared, and an uncertainty analysis relative to model structure and input parameter selection is carried out too.
Section snippets
Study area and data collection
The Lam Tsuen River (Fig. 1) in Hong Kong is considered for the present study. The river flows from west of Tai Po, in the New Territories of Hong Kong, into the Tolo Harbour. The Lam Tsuen River, which is the second-longest river in Hong Kong, has a length of about 10.8 km and its catchment area is about 21 km2. The river originates from the north of the Tai Mo mountaintop of the Sei Fong Shan Mountains, which is the tallest mountaintop in Hong Kong with a height of 740 m above the sea level.
Methodology
Fig. 2 shows an overview of the methodology used in the present study, described in detailed in the following sections.
Assessment of predictive models
In this study, several input parameter combinations are investigated to determine the optimal combination for prediction. Table 2 presents the values of the Pearson correlation coefficient between the various input parameters considered and the WQI. Note that, while other techniques exist to evaluate input variable combinations, such as Cross-Correlation Function (CCF), Partial Auto-Correlation Function (PACF), Auto-Correlation Function (ACF) and Gamma Technique (GT), the Pearson
Conclusion
This study investigates the performance of three machine learning models, based on standalone Support Vector Regression (SVR) and Decision Tree Regression (DTR) models or based on a novel ensemble Extra Tree Regression (ETR) algorithm, to predict the Water Quality Index (WQI) of the Lam Tsuen River in Hong Kong using water quality parameters measured at monthly intervals.
Several input parameters combinations () are considered. Results indicate the ETR model with ten input parameters to
CRediT authorship contribution statement
Seyed Babak Haji Seyed Asadollah: Data curation, Writing - original draft, Software. Ahmad Sharafati: Conceptualization, Methodology, Writing - review & editing, Validation, Supervision. Davide Motta: Investigation, Writing - review & editing. Zaher Mundher Yaseen: Visualization, Writing - review & editing.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgement
The authors would like to express their gratitude and appreciation to the Office of the Government Chief Information Officer of the Government of the Hong Kong Special Administrative Region for providing the dataset needed for this study. Our appreciation is extended to the respected editor and reviewers for their constructive comment.
References (96)
- et al.
Assessment of water quality in groundwater resources of Iran using a modified drinking water quality index (DWQI)
Ecol. Indic.
(2013) - et al.
Use of the water quality index and dissolved oxygen deficit as simple indicators of watersheds pollution
Ecol. Indic.
(2007) - et al.
Use of water quality indices to verify the impact of Córdoba City (Argentina) on Suquı́a River
Water Res.
(2000) - et al.
Use of principal component analysis for parameter selection for development of a novel water quality index: a case study of river Ganga India
Ecol. Indic.
(2019) - et al.
M-AdaBoost-A based ensemble system for network intrusion detection
Expert Syst. Appl.
(2020) Evaluation of decision trees: a multi-criteria approach
Comput. Oper. Res.
(2004)- et al.
Decision tree regression for soft classification of remote sensing data
Remote Sens. Environ.
(2005) - et al.
Predicting electricity energy consumption: a comparison of regression analysis, decision tree and neural networks
Energy
(2007) - et al.
Uncertainty analysis of water quality index (WQI) for groundwater quality evaluation: application of Monte-Carlo method for weight allocation
Ecol. Indic.
(2020) Real-time probabilistic forecasting of river water quality under data missing situation: deep learning plus post-processing techniques
J. Hydrol.
(2020)
Artificial neural network modeling of the water quality index for Kinta River (Malaysia) using water quality variables as predictors
Marine pollution bulletin
Hydrogeomorphology—ecology interactions in river systems
River Res. Appl.
Implication of environmental flows in river basin management
Phys. Chem. Earth, Parts A/B/C
Time-series prediction of streamflows of malaysian rivers using data-driven techniques
J. Irrig. Drain. Eng.
Socio-economic impact assessment of small dams based on T-Paired sample test using SPSS software
Civ. Eng. J.
Culture of microalgae with ultrafiltered seawater: a feasibility study
Sci. Med. J.
Managing water quality of a river using an integrated geographically weighted regression technique with fuzzy decision-making model
Environ. Monit. Assess.
Meeting China’s Water Shortage Crisis: Current Practices and Challenges
Rapid performance evaluation of water supply services for strategic planning
Civ. Eng. J.
Modelling of impact of water quality on recharging rate of storm water filter system using various kernel function based regression
Model. Earth Syst. Environ.
Water quality indices used for surface water vulnerability assessment
Int. J. Environ. Sci.
Analysis of Indus Delta groundwater and surface water suitability for domestic and irrigation purposes
Civ. Eng. J.
Global threats to human water security and river biodiversity
Nature
Evaluation of water quality in the Chillán River (Central Chile) using physicochemical parameters and a modified water quality index
Environ. Monit. Assess.
A review of genesis and evolution of water quality index (WQI) and some future directions
Water Qual. Expo. Heal.
A water quality index applied to an international shared river basin: the case of the Douro River
Environ. Manage.
A Water Quality Index- Do We Dare
An index number system for rating water quality
J. Water Pollut. Control Fed.
Oregon water quality index a tool for evaluating water quality management effectiveness 1
JAWRA J. Am. Water Resour. Assoc.
An innovative index for evaluating water quality in streams
Environ. Manage.
Uncertainty analysis of streamflow drought forecast using artificial neural networks and Monte-Carlo simulation
Int. J. Climatol.
Modeling of arsenic (III) removal by evolutionary genetic programming and least square support vector machine models
Environ. Process.
Effect of river flow on the quality of estuarine and coastal waters using machine learning models
Eng. Appl. Comput. Fluid Mech.
Estimating longitudinal dispersion coefficient in natural streams using empirical models and machine learning algorithms
Eng. Appl. Comput. Fluid Mech.
Prediction of unsaturated hydraulic conductivity using adaptive neuro-fuzzy inference system (ANFIS)
ISH J. Hydraul. Eng.
Modelling of infiltration of sandy soil using gaussian process regression
Model. Earth Syst. Environ.
An enhanced extreme learning machine model for river flow forecasting: state-of-the-art, practical applications in water resource engineering area and future research direction
J. Hydrol.
Evaluation of neuro-fuzzy GMDH-based particle swarm optimization to predict longitudinal dispersion coefficient in rivers
Environ. Earth Sci. Springer
Prediction of local scour depth downstream of sluice gates using data-driven models
ISH J. Hydraul. Eng.
Prediction of riprap stone size under overtopping flow using data-driven models
Int. J. River Basin Manag. IAHR
A survey on river water quality modelling using artificial intelligence models: 2000–2020
J. Hydrol.
Sensitivity analysis for water quality index (WQI) prediction for Kinta River, Malaysia
World Appl. Sci. J.
Development of Computational Assessment Model of Fuzzy Rule Based Evaluation of Groundwater Quality Index: Comparison and Analysis with Conventional Index
Application of fuzzy multi-criteria approach to assess the Water quality of River ganges
Soft Computing: Theories and Applications
Hybrid adaptive neuro-fuzzy models for water quality index estimation
Water Resour. Manag.
Application of artificial intelligence (AI) techniques in water quality index prediction: a case study in tropical region, Malaysia
Neural Comput. Appl.
The application of a decision tree and stochastic forest model in summer precipitation prediction in Chongqing
Atmosphere (Basel)
Long-term monthly average temperature forecasting in some climate types of Iran, using the models SARIMA, SVR, and SVR-FA
Theor. Appl. Climatol.
Cited by (181)
Development of machine learning-based burst capacity models for pipelines containing dent-gouges with synthetic full-scale burst test data generated using tabular generative adversarial network
2024, Engineering Applications of Artificial IntelligenceMachine learning assisted molecular modeling from biochemistry to petroleum engineering: A review
2024, Geoenergy Science and EngineeringPrediction of ground water quality in western regions of Tamilnadu using LSTM network
2024, Groundwater for Sustainable DevelopmentAssessing water quality of an ecologically critical urban canal incorporating machine learning approaches
2024, Ecological InformaticsA combination of multivariate statistics and machine learning techniques in groundwater characterization and quality forecasting
2024, Geosystems and GeoenvironmentMonitoring the Industrial waste polluted stream - Integrated analytics and machine learning for water quality index assessment
2024, Journal of Cleaner Production