A feature reconstruction-based multi-task regression model for cyanobacterial distribution forecasting along the water column

doi:10.1016/j.jclepro.2021.126025

Journal of Cleaner Production

Volume 292, 10 April 2021, 126025

https://doi.org/10.1016/j.jclepro.2021.126025 Get rights and content

Highlights

•
Cyanobacterial water pollution threatens the cleaner ecosystem and human health.
•
Data on cyanobacteria cell density at 11 water depths were collected for modelling.
•
A multi-task regression model has been built to share knowledge among water depths.
•
∼20% forecasting error has been reduced compared to nonlinear single-task models.
•
The model applicability has been demonstrated by real-world data of a tropical lake.

Abstract

Cyanobacterial water pollution has been threatening the cleaner ecosystem and urban sustainability due to the harmfulness to aquatic ecosystems and human health, which triggers the development of an effective forecasting tool for cyanobacterial blooms. Along the water column, the variations in cyanobacteria cell densities show various distribution patterns and are influenced by multiple environmental factors. Most data-driven models treat cyanobacteria forecasting at a specific water depth as a single task, which fails to share knowledge amongst water depths, resulting in unfavourable forecasting accuracy. This is why an increasing number of nonlinear black-box models have been built for cyanobacteria forecasting but at the expense of model interpretability. This study aims to investigate whether forecasting accuracy and model interpretability can be enhanced by (i) using easily accessible predictors and (ii) developing a feature reconstruction-based multi-task regression model with knowledge sharing amongst water depths. Real-world data from a tropical lake are used to evaluate the effectiveness of the model. For the studied lake, the highest average cyanobacteria cell density occurs at 1.0 m, after which it decreases by over 30% at 5.5 m. The correlation coefficients of time-serial cyanobacteria cell densities between adjacent water depths are greater than 0.95 (P < 0.001). The forecasting results indicate that, compared to single-task nonlinear models, 20.59%, 16.25%, and 22.70% error reductions, measured by the mean square error, are achieved for one-day-ahead, two-day-ahead, and three-day-ahead cyanobacterial bloom forecasts. The accurate bloom and non-bloom signals under the proposed model are up to 94.81% and 98.28%. Based on the proposed model, the relative importance of predictors, the sparsity of regression coefficients, and the covariance relationship of regression coefficients can interpret the model adequately and elucidate the mechanism of knowledge sharing and forecasting accuracy improvement.

Graphical abstract

Introduction

Harmful cyanobacterial blooms have raised worldwide concerns as they are threatening the cleaner ecosystem (Nuamah et al., 2020) and human health (Burford et al., 2020). Cyanobacterial blooms not only lead to the degradation of water quality and the death of aquatic life (Huisman et al., 2018) but also produce toxins, e.g. microcystins, that threaten the sustainability of drinking water and food safety (Liu et al., 2018). As a measure of the sustainable management of freshwater resources, the accurate forecasting of cyanobacterial blooms in freshwater bodies assists decision-makers in taking proactive action to mitigate risks (Davidson et al., 2016) and helps residents to mitigate the potentially adverse effects of using contaminated water (Chen et al., 2015). The distribution of cyanobacteria depends on the water depth (Ni et al., 2018). Compared to cyanobacteria forecasting in surface water, the vertical distribution forecasting for cyanobacteria cell densities is even more crucial as freshwater at different water depths contributes to aquatic production (e.g. aquaculture and water intakes for waterworks) and recreational activities (e.g. fishing, boating, and swimming). The variations in cyanobacteria cell densities along the water column are influenced by multiple factors. Such variations and multivariate aquatic system make cyanobacterial blooms challenging to be forecasted.

Conventional forecasting models for cyanobacterial blooms include process-based models and data-driven models (Rousso et al., 2020). Process-based models (Qin et al., 2015) strongly depend on the domain knowledge of cyanobacterial blooms, and such models have many hyper-parameters to calibrate. They are commonly used for scenario analysis (Chen et al., 2015). The representative data-driven models include double exponential smoothing (DES) (Shuhaibar and Riffat, 2008), multiple linear regression (MLR) (Rajaee and Boroumand, 2015), auto-regression integrated moving average (ARIMA) (Chen et al., 2015), fuzzy logic (Kim et al., 2014), Bayesian hierarchical modelling (Obenour et al., 2014), hidden Markov model (Jiang et al., 2016), evolutionary algorithms (Cao et al., 2016), artificial neural networks (ANN) (Tian et al., 2017), and support vector regression (SVR) (Lou et al., 2017). ANN and SVR models have been recently used to assess water eutrophication with algae blooms (Nieto et al., 2019) and simulate cyanobacterial blooms in the tidal freshwater (Shen et al., 2019). These data-driven models normally focus on forecasting at a specific water depth. Individual models for different water depths need to be built when using such models for vertical distribution forecasting, which not only increases modelling workload but also neglects knowledge sharing amongst water depths. This is why increasingly complicated nonlinear black-box models have been employed to improve forecasting accuracy but at the expense of model interpretability. For example, the adaptive particle swarm optimisation-support vector regression (APSO-SVR) has been developed (Yao et al., 2020) to enhance the forecasting accuracy of a basic SVR model. Its counterpart for classification/prediction—namely, adaptive particle swarm optimisation-support vector machine (Moodi et al., 2020)—has also presented favourable performance. Although these black-box data-driven models, including ANN- and SVR-based models, offer passable performance via mining some nonlinear rules, the weak interpretability restricts the applicability (Zhang et al., 2020).

The spatial-temporal information could enhance the understanding of a system, which has broad applications, such as the understanding of the spatial-temporal characteristic of water quality (Zhang and Chen, 2011) and the analysis of a multi-event system (Dubey et al., 2017). Except for the time dimension of data monitoring, the water column can be naturally regarded as a spatial dimension. As an early attempt, Torres et al. (2011) built an ANN model with three outputs to forecast cyanobacteria cell density at three water depths, including the surface, euphotic zone limit, and bottom. Mattei et al. (2018) built a depth-resolved ANN model to forecast marine phytoplankton production at six fixed depths. The depth-resolved model was found superior in forecasting accuracy compared to the depth-integrated model with multiple co-predictors and one depth-integrated phytoplankton output (Mattei et al., 2018). The phenomenon of accuracy improvement in these models is attributed to the multi-output mechanism (Xu et al., 2019) with shared hidden neurons, which helps integrating common information to improve model performance. Although such a knowledge-sharing method is useful, the practical requirements of model interpretability and output stability may weaken the applications of multi-output ANN models, which trigger the development of a simple model with competitive accuracy and better interpretability. Due to the complicated nonlinear functions, the ANN models lack the most intuitive and quantitative interpretation of knowledge sharing amongst water depths. The uncertainties in multiple hyper-parameters and the random seeds used in the ANN models generally result in nonstable training results, which may not decrease the forecasting accuracy but could cause different interpretations for such models. Inspired by the knowledge-sharing experience amongst water depths (Mattei et al., 2018), instead of using complicated nonlinear models, such as extreme learning models (Albadra and Tiuna, 2017), deep learning models (Wang et al., 2020), and reinforcement learning models (Chen et al., 2020), this study moves forward a step to investigate whether forecasting accuracy and model interpretability can be enhanced by (i) using easily accessible predictors and (ii) developing a linear multi-task learning model with knowledge sharing amongst water depths.

The problem characteristics of multiple water depths and common water-quality predictors naturally correspond to a type of intelligent models, namely, multi-task learning. Multi-task learning is a transfer learning paradigm in the machine learning field (Pan and Yang, 2009). The core idea of multi-task learning lies in sharing the domain information of related tasks to improve learning performance (Zhang and Yang, 2017). It has been widely used in many fields, such as spatial-temporal event forecasting at multiple sites (Zhao et al., 2015), water quality prediction for urban waste-water stations (Liu et al., 2016), lake water quality prediction at macroscales (Collins et al., 2019), and energy demand forecasting (Tan et al., 2020). A multi-task regression can be implemented either by linear or nonlinear models, in which knowledge (e.g. predictors, features, and parameters) can be shared either implicitly or explicitly (Pan and Yang, 2009). In this study, a linear multi-task regression model with explicit knowledge sharing is extended to improve model interpretability. To date, the multi-task modelling paradigm has not been developed for cyanobacterial distribution forecasting along the water column. In this study, unlike traditional single-task regression modelling that focuses only on a single depth without any relations to other depths, a feature reconstruction-based multi-task regression (FRMTR) model is developed to fuse data, share knowledge amongst water depths, and forecast the distribution of cyanobacteria cell density along the water column. In the proposed FRMTR model, high-level features are constructed from easily accessible predictors to enhance the forecasting ability. More details are explained in Section 2.2. The forecasting accuracy of the FRMTR model is compared with advanced models, including ANN and APSO-SVR. For evaluating the applicability of the proposed model, a whole year data of 11 water depths of a tropical lake in Singapore are first provided by Singapore’s National Water Agency. Such data are employed for both the model evaluation and the mechanism understanding of the forecasting model and cyanobacterial distributions along the water column.

Section snippets

Materials and methods

This part introduces the study site and related data in Section 2.1. The proposed model and its performance interpretability are presented in Section 2.2.

Results and discussion

By a conventional 7:3 division for model training and testing, time series data were divided into a training dataset (70%, i.e. the first 256 days) and a testing dataset (30%, i.e. the remaining 109 days). Other possible divisions, e.g. 6:4 or 8:2, should also be valid as the time series of cyanobacteria transformation value in Fig. 3b has been stationary. After feature reconstruction, all target values and features were standardised using their means and standard variations. All forecasting

Conclusion

Accurate cyanobacterial distribution forecasting contributes to the proactive control of water pollution and sustainable water consumption. A feature reconstruction-based multi-task regression model—namely, FRMTR—has been built for cyanobacterial distribution forecasting along the water column. The model applicability has been demonstrated using one whole year monitoring data from a tropical lake in Singapore. Compared to conventional single-task models, FRMTR shares knowledge amongst water

CRediT authorship contribution statement

Peng Jiang: Conceptualization, Data curation, Methodology, Formal analysis, Visualization, Writing - original draft, Writing - review & editing, Funding acquisition. Yibin Huang: Conceptualization, Methodology, Formal analysis, Visualization, Writing - original draft. Xiao Liu: Conceptualization, Methodology, Writing - review & editing, Validation, Funding acquisition. Jingjie Zhang: Data curation, Writing - review & editing, Validation, Funding acquisition. Karina Yew-Hoong Gin: Writing -

Declaration of competing interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgements

This research was supported by the National Research Foundation, Prime Minister’s Office, Singapore under its Campus for Research Excellence and Technological Enterprise (CREATE) programme [grant number R-706-001-101-281], the China Postdoctoral Science Foundation [grant number 2018M640397], the National Natural Science Foundation of China [grant number 71673188], and the Shenzhen Science and Technology Innovation Commission [grant number KQJSCX20180322152024270]. The authors would like to

References (61)

C. Bergmeir et al.
On the use of cross-validation for time series predictor evaluation
Inf. Sci.
(2012)
M. Burford et al.
Perspective: advancing the research agenda for improving understanding of cyanobacteria in a future of global change
Harmful Algae
(2020)
H. Cao et al.
Spatially-explicit forecasting of cyanobacteria assemblages in freshwater lakes by multi-objective hybrid evolutionary algorithms
Ecol. Model.
(2016)
Q. Chen et al.
Online forecasting chlorophyll a concentrations by an auto-regressive integrated moving average model: feasibilities and potentials
Harmful Algae
(2015)
K. Davidson et al.
Forecasting the risk of harmful algal blooms
Harmful Algae
(2016)
R. Dubey et al.
An spatiotemporal information system based wide-area protection fault identification scheme
Int. J. Electr. Power Energy Syst.
(2017)
J. Guo et al.
The seasonal variation of microbial communities in drinking water sources in Shanghai
J. Clean. Prod.
(2020)
P. Jiang et al.
A framework based on Hidden Markov model with adaptive weighting for microcystin forecasting and early-warning
Decis. Support Syst.
(2016)
Y. Kim et al.
A wavelet-based autoregressive fuzzy model for forecasting algal blooms
Environ. Model. Software
(2014)
G. Liu et al.
Characteristics and mechanisms of microcystin-LR adsorption by giant reed-derived biochars: role of minerals, pores, and functional groups
J. Clean. Prod.
(2018)

Y.-W. Cheung et al.

Lag order and critical values of the augmented Dickey–Fuller test

J. Bus. Econ. Stat.

(1995)

S. Collins et al.

Winter precipitation and summer temperature predict lake water quality at macroscales

Water Resour. Res.

(2019)

Cited by (6)

Time-series modelling of harmful cyanobacteria blooms by convolutional neural networks and wavelet generated time-frequency images of environmental driving variables
2023, Water Research
Early warning systems for harmful cyanobacterial blooms (HCBs) that enable precautional control measures within water bodies and in water works are largely based on inferential time-series modelling. Among deep learning techniques, convolutional neural networks (CNNs) are widely applied for recognition of pictorial, acoustic and thermal images. Time-frequency images of environmental drivers generated by wavelets may provide crucial signals for modelling of HCBs to be recognized by CNNs. This study applies CNNs for time-series modelling of HCBs of Microcystis sp. in four South Korean rivers between 2016 and 2022 by means of time-frequency images of environmental drivers within the lead time of HCBs. After estimating the cardinal dates of beginning, peak, and ending of HCBs, wavelet analysis identified key drivers by phase analysis and generated time-frequency images of the drivers within the cardinal dates for 3, 4 and 5 years. Performances of CNNs were compared in terms of four determinants of input images: methods of estimating critical timings, the number of segments, time-series continuity, and image size. The resulting CNNs predicted high or low intensities of HCBs with a mean accuracy of 97.79 ± 0.06% and F1-score 97.49 ± 0.06% for training dataset, and a mean accuracy of 95.01 ± 0.06% and F1-score 93.30 ± 0.07% for testing dataset. Predictions of Microcystis abundances by CNNs achieved a mean MSE of 2.58 ± 2.46 and a mean R² of 0.78 ± 0.20 for training, and a mean MSE of 2.76 ± 2.42 and a mean R² of 0.55 ± 0.20 for testing dataset. Precipitation and discharge appeared to be the best performing drivers for qualitative and quantitative predictions of HCBs pointing at the nonstationary nature of river habitats. This study highlights the opportunities of time-series modelling by CNNs driven by wavelet generated time-frequency images of key environmental variables for forecasting of HCBs.
Predicting polycyclic aromatic hydrocarbons in surface water by a multiscale feature extraction-based deep learning approach
2021, Science of the Total Environment
Citation Excerpt :
However, it is difficult or less accurate to predict pollutants with traditional models due to the complex physical-chemical process-induced uncertainty of parameter values and the complexity of the simulation (Pyo et al., 2020). With the rapid development of computational capability, machine learning has become a feasible and effective alternative (Jiang et al., 2021). The machine learning approach does not require quantifying known theoretical and empirical knowledge through mathematical equations but allows the analysis and identification of patterns in monitored data to create prediction rules related to pollutant dynamics (Rousso et al., 2020).
Accurate and effective prediction of polycyclic aromatic hydrocarbons (PAHs) in surface water remains a substantial challenge due to the limited understanding of the dynamic processes. To assist integrated surface water management, a novel hybrid surface water PAH prediction model based on a two-stage decomposition approach and deep learning algorithm was proposed. Specifically, a two-stage decomposition technique consisting of complete ensemble empirical mode decomposition with adaptive noise (CEEMDAN) and variational mode decomposition (VMD) was first introduced to decompose the data into several subsequences to extract the main fluctuations and trends of the PAH sequence. Subsequently, the deep learning algorithm long short-term memory (LSTM) was employed to explore the latent dynamic characteristics of each subsequence. Finally, the predicted values of the subsequences were integrated to obtain the final predicted results. An empirical study was conducted based on PAH data of eight major rivers in Saxony, Germany. The empirical results proved that the CEEMDAN-VMD-LSTM model outperformed other benchmark data-driven methods in predicting PAHs in surface water because it combined the advantages of two-stage decomposition and deep learning methods. The mean absolute error (MAE), root mean square error (RMSE), and coefficient of determination (R²) of the model were 27.89, 37.92 and 0.85, respectively. The proposed hybrid method can achieve effective and accurate water quality prediction and is an effective tool for surface water management.
Cyanobacterial risk prevention under global warming using an extended Bayesian network
2021, Journal of Cleaner Production
Citation Excerpt :
With the rapid development of smart environmental monitoring and data-driven technologies, intelligent models have been innovatively applied to addressed risks related to cyanobacterial blooms. Several applications have been documented, including cyanotoxin risk assessment (Shan et al., 2019), health risk prediction (Mellios et al., 2020), cyanobacterial vertical distribution forecasting (Jiang et al., 2021), and early warning of outbreaks (Park et al., 2021). The risk prevention manner based on monitoring and prediction programmes has been widely adopted (Gallardo-Rodríguez et al., 2019).
Cyanobacterial blooms under global warming are increasing worldwide, producing emerging contaminants, which threaten the health of human beings and aquatic ecosystems. The health burdens warrant the development of a useful risk-assessment tool and a holistic preventive-control scheme to prevent cyanobacterial blooms. This paper aims to integrate cyanobacterial risk assessment and risk preventive control by investigating the relationships amongst cyanobacterial blooms and multi-dimensional influencing variables. Two challenges hinder such a task. First, the time-series variations in cyanobacteria and influencing variables are uncertain and nonlinear. Second, there rarely exists an explicit modelling framework for integrating cyanobacterial risk assessment and risk preventive control. This study builds an extended Bayesian network model and proposes an integrated framework with functions of assessment, inference, preventive control, and visualisation of the risk of cyanobacterial blooms. Field data from a tropical lake are used to evaluate the model and framework. The proposed model achieves better performance than the seven models in comparison. The cyanobacterial risk is anticipated to increase by 38.5% under global warming. On the contrary, guided by the model and framework, the risk could be reduced by about 60% by taking the identified risk preventive control scheme. The cyanobacterial risk prevention would reduce aquatic emerging contaminants in drinking and recreational water sources.
>Water quality prediction of artificial intelligence model: a case of Huaihe River Basin, China
2024, Environmental Science and Pollution Research
Climate Change Impacts on Water Temperatures in Urban Lakes: Implications for the Growth of Blue Green Algae in Fairy Lake
2024, Water (Switzerland)
Multi-Task Regression with Process Knowledge-Based Forest Learners in Process Industries
2023, IEEE International Conference on Automation Science and Engineering

View full text

A feature reconstruction-based multi-task regression model for cyanobacterial distribution forecasting along the water column

Highlights

Abstract

Graphical abstract

Introduction

Section snippets

Materials and methods

Results and discussion

Conclusion

CRediT authorship contribution statement

Declaration of competing interest

Acknowledgements

Inf. Sci.

Harmful Algae

Ecol. Model.

Harmful Algae

Harmful Algae

Int. J. Electr. Power Energy Syst.

J. Clean. Prod.

Decis. Support Syst.

Environ. Model. Software

J. Clean. Prod.

Expert Syst. Appl.

Ecol. Model.

J. Clean. Prod.

J. Clean. Prod.

J. Hazard Mater.

Appl. Ocean Res.

Water Res.

Ecol. Model.

J. Clean. Prod.

Ecol. Model.

Environ. Model. Software

Mech. Syst. Signal Process.

J. Hydrol.

Water Res.

Sci. Total Environ.

Extreme learning machine: a review

Int. J. Appl. Eng. Res.

Computational performance analysis of neural network and regression models in forecasting the societal demand for agricultural food harvests

Int. J. Grid High Perform. Comput. (IJGHPC)

Model-free emergency frequency control based on reinforcement learning

IEEE Transactions on Industrial Informatics

Lag order and critical values of the augmented Dickey–Fuller test

J. Bus. Econ. Stat.

Winter precipitation and summer temperature predict lake water quality at macroscales

Water Resour. Res.