A feature reconstruction-based multi-task regression model for cyanobacterial distribution forecasting along the water column

https://doi.org/10.1016/j.jclepro.2021.126025Get rights and content

Highlights

  • Cyanobacterial water pollution threatens the cleaner ecosystem and human health.

  • Data on cyanobacteria cell density at 11 water depths were collected for modelling.

  • A multi-task regression model has been built to share knowledge among water depths.

  • ∼20% forecasting error has been reduced compared to nonlinear single-task models.

  • The model applicability has been demonstrated by real-world data of a tropical lake.

Abstract

Cyanobacterial water pollution has been threatening the cleaner ecosystem and urban sustainability due to the harmfulness to aquatic ecosystems and human health, which triggers the development of an effective forecasting tool for cyanobacterial blooms. Along the water column, the variations in cyanobacteria cell densities show various distribution patterns and are influenced by multiple environmental factors. Most data-driven models treat cyanobacteria forecasting at a specific water depth as a single task, which fails to share knowledge amongst water depths, resulting in unfavourable forecasting accuracy. This is why an increasing number of nonlinear black-box models have been built for cyanobacteria forecasting but at the expense of model interpretability. This study aims to investigate whether forecasting accuracy and model interpretability can be enhanced by (i) using easily accessible predictors and (ii) developing a feature reconstruction-based multi-task regression model with knowledge sharing amongst water depths. Real-world data from a tropical lake are used to evaluate the effectiveness of the model. For the studied lake, the highest average cyanobacteria cell density occurs at 1.0 m, after which it decreases by over 30% at 5.5 m. The correlation coefficients of time-serial cyanobacteria cell densities between adjacent water depths are greater than 0.95 (P < 0.001). The forecasting results indicate that, compared to single-task nonlinear models, 20.59%, 16.25%, and 22.70% error reductions, measured by the mean square error, are achieved for one-day-ahead, two-day-ahead, and three-day-ahead cyanobacterial bloom forecasts. The accurate bloom and non-bloom signals under the proposed model are up to 94.81% and 98.28%. Based on the proposed model, the relative importance of predictors, the sparsity of regression coefficients, and the covariance relationship of regression coefficients can interpret the model adequately and elucidate the mechanism of knowledge sharing and forecasting accuracy improvement.

Introduction

Harmful cyanobacterial blooms have raised worldwide concerns as they are threatening the cleaner ecosystem (Nuamah et al., 2020) and human health (Burford et al., 2020). Cyanobacterial blooms not only lead to the degradation of water quality and the death of aquatic life (Huisman et al., 2018) but also produce toxins, e.g. microcystins, that threaten the sustainability of drinking water and food safety (Liu et al., 2018). As a measure of the sustainable management of freshwater resources, the accurate forecasting of cyanobacterial blooms in freshwater bodies assists decision-makers in taking proactive action to mitigate risks (Davidson et al., 2016) and helps residents to mitigate the potentially adverse effects of using contaminated water (Chen et al., 2015). The distribution of cyanobacteria depends on the water depth (Ni et al., 2018). Compared to cyanobacteria forecasting in surface water, the vertical distribution forecasting for cyanobacteria cell densities is even more crucial as freshwater at different water depths contributes to aquatic production (e.g. aquaculture and water intakes for waterworks) and recreational activities (e.g. fishing, boating, and swimming). The variations in cyanobacteria cell densities along the water column are influenced by multiple factors. Such variations and multivariate aquatic system make cyanobacterial blooms challenging to be forecasted.

Conventional forecasting models for cyanobacterial blooms include process-based models and data-driven models (Rousso et al., 2020). Process-based models (Qin et al., 2015) strongly depend on the domain knowledge of cyanobacterial blooms, and such models have many hyper-parameters to calibrate. They are commonly used for scenario analysis (Chen et al., 2015). The representative data-driven models include double exponential smoothing (DES) (Shuhaibar and Riffat, 2008), multiple linear regression (MLR) (Rajaee and Boroumand, 2015), auto-regression integrated moving average (ARIMA) (Chen et al., 2015), fuzzy logic (Kim et al., 2014), Bayesian hierarchical modelling (Obenour et al., 2014), hidden Markov model (Jiang et al., 2016), evolutionary algorithms (Cao et al., 2016), artificial neural networks (ANN) (Tian et al., 2017), and support vector regression (SVR) (Lou et al., 2017). ANN and SVR models have been recently used to assess water eutrophication with algae blooms (Nieto et al., 2019) and simulate cyanobacterial blooms in the tidal freshwater (Shen et al., 2019). These data-driven models normally focus on forecasting at a specific water depth. Individual models for different water depths need to be built when using such models for vertical distribution forecasting, which not only increases modelling workload but also neglects knowledge sharing amongst water depths. This is why increasingly complicated nonlinear black-box models have been employed to improve forecasting accuracy but at the expense of model interpretability. For example, the adaptive particle swarm optimisation-support vector regression (APSO-SVR) has been developed (Yao et al., 2020) to enhance the forecasting accuracy of a basic SVR model. Its counterpart for classification/prediction—namely, adaptive particle swarm optimisation-support vector machine (Moodi et al., 2020)—has also presented favourable performance. Although these black-box data-driven models, including ANN- and SVR-based models, offer passable performance via mining some nonlinear rules, the weak interpretability restricts the applicability (Zhang et al., 2020).

The spatial-temporal information could enhance the understanding of a system, which has broad applications, such as the understanding of the spatial-temporal characteristic of water quality (Zhang and Chen, 2011) and the analysis of a multi-event system (Dubey et al., 2017). Except for the time dimension of data monitoring, the water column can be naturally regarded as a spatial dimension. As an early attempt, Torres et al. (2011) built an ANN model with three outputs to forecast cyanobacteria cell density at three water depths, including the surface, euphotic zone limit, and bottom. Mattei et al. (2018) built a depth-resolved ANN model to forecast marine phytoplankton production at six fixed depths. The depth-resolved model was found superior in forecasting accuracy compared to the depth-integrated model with multiple co-predictors and one depth-integrated phytoplankton output (Mattei et al., 2018). The phenomenon of accuracy improvement in these models is attributed to the multi-output mechanism (Xu et al., 2019) with shared hidden neurons, which helps integrating common information to improve model performance. Although such a knowledge-sharing method is useful, the practical requirements of model interpretability and output stability may weaken the applications of multi-output ANN models, which trigger the development of a simple model with competitive accuracy and better interpretability. Due to the complicated nonlinear functions, the ANN models lack the most intuitive and quantitative interpretation of knowledge sharing amongst water depths. The uncertainties in multiple hyper-parameters and the random seeds used in the ANN models generally result in nonstable training results, which may not decrease the forecasting accuracy but could cause different interpretations for such models. Inspired by the knowledge-sharing experience amongst water depths (Mattei et al., 2018), instead of using complicated nonlinear models, such as extreme learning models (Albadra and Tiuna, 2017), deep learning models (Wang et al., 2020), and reinforcement learning models (Chen et al., 2020), this study moves forward a step to investigate whether forecasting accuracy and model interpretability can be enhanced by (i) using easily accessible predictors and (ii) developing a linear multi-task learning model with knowledge sharing amongst water depths.

The problem characteristics of multiple water depths and common water-quality predictors naturally correspond to a type of intelligent models, namely, multi-task learning. Multi-task learning is a transfer learning paradigm in the machine learning field (Pan and Yang, 2009). The core idea of multi-task learning lies in sharing the domain information of related tasks to improve learning performance (Zhang and Yang, 2017). It has been widely used in many fields, such as spatial-temporal event forecasting at multiple sites (Zhao et al., 2015), water quality prediction for urban waste-water stations (Liu et al., 2016), lake water quality prediction at macroscales (Collins et al., 2019), and energy demand forecasting (Tan et al., 2020). A multi-task regression can be implemented either by linear or nonlinear models, in which knowledge (e.g. predictors, features, and parameters) can be shared either implicitly or explicitly (Pan and Yang, 2009). In this study, a linear multi-task regression model with explicit knowledge sharing is extended to improve model interpretability. To date, the multi-task modelling paradigm has not been developed for cyanobacterial distribution forecasting along the water column. In this study, unlike traditional single-task regression modelling that focuses only on a single depth without any relations to other depths, a feature reconstruction-based multi-task regression (FRMTR) model is developed to fuse data, share knowledge amongst water depths, and forecast the distribution of cyanobacteria cell density along the water column. In the proposed FRMTR model, high-level features are constructed from easily accessible predictors to enhance the forecasting ability. More details are explained in Section 2.2. The forecasting accuracy of the FRMTR model is compared with advanced models, including ANN and APSO-SVR. For evaluating the applicability of the proposed model, a whole year data of 11 water depths of a tropical lake in Singapore are first provided by Singapore’s National Water Agency. Such data are employed for both the model evaluation and the mechanism understanding of the forecasting model and cyanobacterial distributions along the water column.

Section snippets

Materials and methods

This part introduces the study site and related data in Section 2.1. The proposed model and its performance interpretability are presented in Section 2.2.

Results and discussion

By a conventional 7:3 division for model training and testing, time series data were divided into a training dataset (70%, i.e. the first 256 days) and a testing dataset (30%, i.e. the remaining 109 days). Other possible divisions, e.g. 6:4 or 8:2, should also be valid as the time series of cyanobacteria transformation value in Fig. 3b has been stationary. After feature reconstruction, all target values and features were standardised using their means and standard variations. All forecasting

Conclusion

Accurate cyanobacterial distribution forecasting contributes to the proactive control of water pollution and sustainable water consumption. A feature reconstruction-based multi-task regression model—namely, FRMTR—has been built for cyanobacterial distribution forecasting along the water column. The model applicability has been demonstrated using one whole year monitoring data from a tropical lake in Singapore. Compared to conventional single-task models, FRMTR shares knowledge amongst water

CRediT authorship contribution statement

Peng Jiang: Conceptualization, Data curation, Methodology, Formal analysis, Visualization, Writing - original draft, Writing - review & editing, Funding acquisition. Yibin Huang: Conceptualization, Methodology, Formal analysis, Visualization, Writing - original draft. Xiao Liu: Conceptualization, Methodology, Writing - review & editing, Validation, Funding acquisition. Jingjie Zhang: Data curation, Writing - review & editing, Validation, Funding acquisition. Karina Yew-Hoong Gin: Writing -

Declaration of competing interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgements

This research was supported by the National Research Foundation, Prime Minister’s Office, Singapore under its Campus for Research Excellence and Technological Enterprise (CREATE) programme [grant number R-706-001-101-281], the China Postdoctoral Science Foundation [grant number 2018M640397], the National Natural Science Foundation of China [grant number 71673188], and the Shenzhen Science and Technology Innovation Commission [grant number KQJSCX20180322152024270]. The authors would like to

References (61)

  • G. Liu et al.

    Model optimization of SVM for a fermentation soft sensor

    Expert Syst. Appl.

    (2010)
  • F. Mattei et al.

    A depth-resolved artificial neural network model of marine phytoplankton primary production

    Ecol. Model.

    (2018)
  • Z. Ni et al.

    Pollution control and in situ bioremediation for lake aquaculture using an ecological dam

    J. Clean. Prod.

    (2018)
  • L.A. Nuamah et al.

    Constructed wetlands, status, progress, and challenges. The need for critical operational reassessment for a cleaner productive ecosystem

    J. Clean. Prod.

    (2020)
  • B. Qin et al.

    Cyanobacterial bloom management through integrated monitoring and forecasting in large shallow eutrophic Lake Taihu (China)

    J. Hazard Mater.

    (2015)
  • T. Rajaee et al.

    Forecasting of chlorophyll-a concentrations in South San Francisco Bay using five different models

    Appl. Ocean Res.

    (2015)
  • B.Z. Rousso et al.

    A systematic literature review of forecasting and predictive models for cyanobacteria blooms in freshwater lakes

    Water Res.

    (2020)
  • J. Shen et al.

    A data-driven modeling approach for simulating algal blooms in the tidal freshwater of James River in response to riverine nutrient loading

    Ecol. Model.

    (2019)
  • Z. Tan et al.

    Combined electricity-heat-cooling-gas load forecasting model for integrated energy system based on multi-task learning and least square support vector machine

    J. Clean. Prod.

    (2020)
  • W. Tian et al.

    An optimization of artificial neural network model for predicting chlorophyll dynamics

    Ecol. Model.

    (2017)
  • X. Wang et al.

    A comprehensive integrated catchment-scale monitoring and modelling approach for facilitating management of water quality

    Environ. Model. Software

    (2019)
  • Y. Wang et al.

    Detection of weak transient signals based on wavelet packet transform and manifold learning for rolling element bearing fault diagnosis

    Mech. Syst. Signal Process.

    (2015)
  • M.R. Williams et al.

    Uncertainty in nutrient loads from tile-drained landscapes: effect of sampling frequency, calculation algorithm, and compositing strategy

    J. Hydrol.

    (2015)
  • X. Xiao et al.

    A novel single-parameter approach for forecasting algal blooms

    Water Res.

    (2017)
  • L. You et al.

    Investigation of pharmaceuticals, personal care products and endocrine disrupting chemicals in a tropical urban catchment and the influence of environmental factors

    Sci. Total Environ.

    (2015)
  • M.A.A. Albadra et al.

    Extreme learning machine: a review

    Int. J. Appl. Eng. Res.

    (2017)
  • B.P. Bv et al.

    Computational performance analysis of neural network and regression models in forecasting the societal demand for agricultural food harvests

    Int. J. Grid High Perform. Comput. (IJGHPC)

    (2020)
  • C. Chen et al.

    Model-free emergency frequency control based on reinforcement learning

    IEEE Transactions on Industrial Informatics

    (2020)
  • Y.-W. Cheung et al.

    Lag order and critical values of the augmented Dickey–Fuller test

    J. Bus. Econ. Stat.

    (1995)
  • S. Collins et al.

    Winter precipitation and summer temperature predict lake water quality at macroscales

    Water Resour. Res.

    (2019)
  • Cited by (6)

    • Predicting polycyclic aromatic hydrocarbons in surface water by a multiscale feature extraction-based deep learning approach

      2021, Science of the Total Environment
      Citation Excerpt :

      However, it is difficult or less accurate to predict pollutants with traditional models due to the complex physical-chemical process-induced uncertainty of parameter values and the complexity of the simulation (Pyo et al., 2020). With the rapid development of computational capability, machine learning has become a feasible and effective alternative (Jiang et al., 2021). The machine learning approach does not require quantifying known theoretical and empirical knowledge through mathematical equations but allows the analysis and identification of patterns in monitored data to create prediction rules related to pollutant dynamics (Rousso et al., 2020).

    • Cyanobacterial risk prevention under global warming using an extended Bayesian network

      2021, Journal of Cleaner Production
      Citation Excerpt :

      With the rapid development of smart environmental monitoring and data-driven technologies, intelligent models have been innovatively applied to addressed risks related to cyanobacterial blooms. Several applications have been documented, including cyanotoxin risk assessment (Shan et al., 2019), health risk prediction (Mellios et al., 2020), cyanobacterial vertical distribution forecasting (Jiang et al., 2021), and early warning of outbreaks (Park et al., 2021). The risk prevention manner based on monitoring and prediction programmes has been widely adopted (Gallardo-Rodríguez et al., 2019).

    • Multi-Task Regression with Process Knowledge-Based Forest Learners in Process Industries

      2023, IEEE International Conference on Automation Science and Engineering
    View full text