The effect of calibration data length on the performance of a conceptual hydrological model versus LSTM and GRU: A case study for six basins from the CAMELS dataset

doi:10.1016/j.cageo.2021.104708

Computers & Geosciences

Volume 149, April 2021, 104708

https://doi.org/10.1016/j.cageo.2021.104708 Get rights and content

Highlights

•
LSTM and GRU approach GR4H at a notable rate, considering their number of parameters.
•
GRU models benefit more efficiently from additional calibration data than LSTM.
•
Compared to LSTM, the calibration of GRU is more robust in the case of limited data.

Abstract

We systematically explore the effect of calibration data length on the performance of a conceptual hydrological model, GR4H, in comparison to two Artificial Neural Network (ANN) architectures: Long Short-Term Memory Networks (LSTM) and Gated Recurrent Units (GRU), which have just recently been introduced to the field of hydrology. We implemented a case study for six river basins across the contiguous United States, with 25 years of meteorological and discharge data. Nine years were reserved for independent validation; two years were used as a warm-up period, one year for each of the calibration and validation periods, respectively; from the remaining 14 years, we sampled increasing amounts of data for model calibration, and found pronounced differences in model performance. While GR4H required less data to converge, LSTM and GRU caught up at a remarkable rate, considering their number of parameters. Also, LSTM and GRU exhibited the higher calibration instability in comparison to GR4H. These findings confirm the potential of modern deep-learning architectures in rainfall-runoff modelling, but also highlight the noticeable differences between them in regard to the effect of calibration data length.

Introduction

According to Mount et al. (2016), a hydrological model is a functional relationship in which the model output (y) is determined by its structure (f), inputs (x), parameters (p), and the residual error ε between the modelled variable and its true (or observed) value: $y = f (x, p) + ε$

Following Mount et al. (2016), conceptual hydrological models put a strong emphasis on structure as a result of knowledge-driven model reduction, of how we formalize our concept of a hydrological system. In contrast, data-driven methods trust in the explanatory power of input data itself as they “simultaneously ‘discover’ the model structure and optimize its parameters [and minimize] the role of a priori hypotheses” (Mount et al., 2016). As a consequence, the number of parameters that need to be optimized (or calibrated) is relatively small for conceptual hydrological models (typically in the order of $10^{0}$ – $10^{2}$ ), and relatively high (in the order of $10^{3}$ – $10^{5}$ ) for data-driven models, such as Artificial Neural Networks (ANN).

Numerous studies have been devoted to different aspects of hydrological model calibration, such as the choice of the objective function Fowler et al. (2018), or the use of auxiliary information (Nijzink et al., 2018). Yet, the majority of hydrological models are still calibrated against discharge time series, using square error-based objective functions (Kisi et al., 2013; Arsenault et al., 2018), and their validity requires examination on independent data (Refsgaard et al., 2005). To that end, split-sampling techniques have been suggested by various authors as a basis for calibration-validation experiments, as diagnostic tools to assess the reliability of models and the robustness of their parameterization (e.g., Coron et al., 2012; Kisi et al., 2012; Thirel et al., 2015; Motavita et al., 2019).

In the light of limited data availability for calibration and validation, such techniques also provide a means to study the effect of the amount (or length) of calibration data on the reliability of models. A fair number of studies investigated this effect for conceptual hydrological models (e.g., Sorooshian et al., 1983; Yapo et al., 1996; Boughton, 2007; Perrin et al., 2007; Li et al., 2010; Arsenault et al., 2018; Motavita et al., 2019). For data-driven ANN models, Anctil et al. (2004), Cigizoglu and Kisi (2005), and Toth and Brath (2007) investigated the effect of calibration data in the context of runoff forecasting, i.e. on the skill of runoff prediction at lead times of several days, using observed runoff and meteorological forcing as predictors. Rainfall-runoff modelling, however, aims to predict runoff from meteorological variables, only. This is a different problem, and we found only one study – conducted by Gauch et al. (2019) – that investigates the effect of calibration data length on the performance of continental-scale data-driven rainfall-runoff models (namely, Extreme Gradient Boosting and Long Short-Term Memory Networks). However, the study does not compare this effect for conceptual hydrological models versus various data-driven models at basin scale, which is a gap that we want to address with the present study.

Thus, the objective of this study is to investigate the effect of calibration data length on the ability of a conceptual hydrological model (GR4H) and two data-driven ANN models to predict runoff from meteorological forcing, i.e. precipitation, temperature, and potential evaporation – at an hourly resolution. Hence, we do not limit our analysis to a single ANN architecture. Instead, we include two modern deep-learning architectures, Long Short-Term Memory Networks (LSTM) and Gated Recurrent Units (GRU), which have just recently been introduced to the field of hydrological modelling (Marçais and de Dreuzy, 2017; Kratzert et al., 2018; Shen, 2018; Xu and Niu, 2018; Reichstein et al., 2019). Specifically, Kratzert et al. (2018, 2019) have identified LSTM as superior to traditional ANN architectures and a conceptual hydrological model. Therefore, we specifically examine to what extent this superiority could depend on the amount of data that is available for model calibration.

Furthermore, our experimental design (Sect. 2.2) provides us a possibility to investigate the effect of model instability, i.e., the sensitivity of model performance on the validation period. This instability reflects both the variability between calibration sub-periods of the same length, and the variability in-between sub-periods of different lengths.

Altogether, this study aims to provide a new perspective on the role of calibration data length, and to serve as a guideline for establishing corresponding calibration-validation experiments. We start with an outline of the experimental setup as well as the underlying data for the meteorological forcing and the observed discharge (Sect. 2); the different models are described in Sect. 3, followed by details on model calibration (Sect. 3.3). We then provide the results of our experiments in Sect. 4 and conclude with Sect. 5.

Section snippets

Study area and data

In this study, the CAMELS dataset (Newman et al., 2015; Addor et al., 2017) of discharge and meteorological time series, as well as attributes of 671 river basins with relatively small human influence in the contiguous United States serves as a basis for basin selection. Additionally, we use the results by Kratzert et al. (2020), who presented the streamflow simulation efficiency of several hydrological models for 531 CAMELS basins at daily temporal resolution as a proxy estimate of

GR4H hydrological model

GR4H is a process-based, lumped, conceptual hydrological model, which was developed at Irstea by Mathevet (2005) for runoff predictions at an hourly resolution (Fig. 2). GR4H has four free parameters (Table 2, X1–X4) and mainly mimics the structure of the daily GR4J model (Perrin et al., 2003): precipitation is partitioned into two components, the production store (S; parameter $X 1$ describes its capacity), and effective rainfall. Water which percolates from the production store merges with the

Results and discussion

Fig. 5 illustrates the main results of our study regarding the effect of calibration data length on the performance of hourly runoff simulations by different models and for different basins as well. The figure shows the change of NSE over the independent 9-year validation period (2011–2019) as a function of the number of calibration years n. Each of the solid lines represents the average NSE value over the available combinations of sub-periods of length n used for calibration. The shades

Conclusions

In a recent review, Dramsch (2020) has pointed out that “machine learning has seen the fastest successes in domains where decisions are cheap (e.g., click advertising), data are readily available (e.g., online shops), and the environment is simple (e.g., games) or unconstrained (e.g., image generation). Geoscience generally is at the opposite […]. Decisions are expensive, […]. Data are expensive, sparse, and noisy. The environment is heterogeneous and constrained by physical limitations.” The

Funding

G.A. was financially supported by Geo.X, the Research Network for Geosciences in Berlin and Potsdam, and by ClimXtreme – the Research Network on Climate Change and Extreme Events.

Data availability

Discharge data as well as meteorological forcing data for the studied basins are available in the corresponding repository on Github (https://github.com/hydrogo/KALIv2, accessed 02.09.2020).

Computer code availability

We support this paper by making the entire computational workflow available online in the corresponding repository on Github (https://github.com/hydrogo/KALIv2, accessed 02.09.2020). The code is written in the Python 3 programming language (https://www.python.org/, accessed 02.09.2020) and is based on open-source software libraries, namely Numpy (Oliphant, 2006), Pandas (McKinney et al., 2010), Scipy (Virtanen et al., 2020), Numba (Lam et al., 2015), Tensorflow (Abadi et al., 2015), and Keras (

CRediT authorship contribution statement

Georgy Ayzel: ran the experiments, performed the analysis, and wrote the manuscript. Maik Heistermann: Co-designed and supervised the study, co-wrote the manuscript.

Declaration of competing interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

We thank Frederik Kratzert and two other anonymous referees for constructive and critical comments which substantially contributed to the development of the study and the manuscript.

References (63)

M. Abadi et al.
TensorFlow: large-scale Machine Learning on Heterogeneous Systems
(2015)
R.J. Abrahart et al.
Two decades of anarchy? Emerging themes and outstanding challenges for neural network river forecasting
Prog. Phys. Geography-Earth Environ.
(2012)
N. Addor et al.
The camels data set: catchment attributes and meteorology for large-sample studies
Hydrol. Earth Syst. Sci.
(2017)
F. Anctil et al.
Impact of the length of observed records on the performance of ann and of conceptual parsimonious rainfall-runoff forecasting models
Environ. Model. Software
(2004)
R. Arsenault et al.
The hazards of split-sample validation in hydrological model calibration
J. Hydrol.
(2018)
G. Ayzel
KALIv2: calibration and validation experiment results
Y. Bengio et al.
Learning long-term dependencies with gradient descent is difficult
IEEE Trans. Neural Network.
(1994)
T. de Boer-Euser et al.
Looking beyond general metrics for model comparison – lessons from an international model intercomparison study
Hydrol. Earth Syst. Sci.
(2017)
W. Boughton
Effect of data length on rainfall-runoff modelling
Environ. Model. Software
(2007)
K. Cho et al.
Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation
(2014)

F. Chollet

Keras

(2015)

H.K. Cigizoglu et al.

Flow prediction by three back propagation techniques using k-fold partitioning of neural network training data

Nord. Hydrol

(2005)

L. Coron et al.

Crash testing hydrological models in contrasted climate conditions: an experiment on 216 australian catchments

Water Resour. Res.

(2012)

J.S. Dramsch

Chapter one - 70 years of machine learning in geoscience in review

W.R. van Esse et al.

The influence of conceptual model structure on model performance: a comparative study for 237 French catchments

Hydrol. Earth Syst. Sci.

(2013)

K. Fowler et al.

Improved rainfall-runoff calibration for drying climate: choice of objective function

Water Resour. Res.

(2018)

M. Gauch et al.

The Proper Care and Feeding of Camels: How Limited Training Data Affects Streamflow Prediction

(2019)

I. Goodfellow et al.

Deep Learning

(2016)

H.V. Gupta et al.

Reconciling theory with observations: elements of a diagnostic approach to model evaluation

Hydrol. Process.

(2008)

S. Hochreiter et al.

Long short-term memory

Neural Comput.

(1997)

J.D. Hunter

Matplotlib: a 2d graphics environment

Comput. Sci. Eng.

(2007)

S.K. Jain et al.

Fitting of hydrologic models: a close look at the nash–sutcliffe index

J. Hydrol. Eng.

(2008)

O. Kisi et al.

Forecasting daily lake levels using artificial intelligence approaches

Comput. Geosci.

(2012)

O. Kisi et al.

Modeling rainfall-runoff process using soft computing techniques

Comput. Geosci.

(2013)

W.J.M. Knoben et al.

Technical note: inherent benchmark or not? comparing nash–sutcliffe and kling–gupta efficiency scores

Hydrol. Earth Syst. Sci.

(2019)

F. Kratzert et al.

Rainfall–runoff modelling using long short-term memory (lstm) networks

Hydrol. Earth Syst. Sci.

(2018)

F. Kratzert et al.

Towards learning universal, regional, and local hydrological behaviors via machine learning applied to large-sample datasets

Hydrol. Earth Syst. Sci.

(2019)

F. Kratzert et al.

A note on leveraging synergy in multiple meteorological datasets with deep learning for rainfall-runoff modeling

Hydrol. Earth Syst. Sci. Discuss.

(2020)

S.K. Lam et al.

Numba: a llvm-based python jit compiler

Y. LeCun et al.

Deep learning

Nature

(2015)

C.z. Li et al.

Effect of calibration data series length on performance and optimal parameters of hydrological model

Water Sci. Eng.

(2010)

Cited by (48)

Using a physics-based hydrological model and storm transposition to investigate machine-learning algorithms for streamflow prediction
2024, Journal of Hydrology
Machine learning (ML) algorithms have produced remarkable advances in streamflow prediction, exceeding the performance of calibrated conceptual and physics-based hydrological models that have been developed over many decades. ML algorithms seem to overcome the issue of errors known to be present in rainfall and streamflow estimates that have hindered the performance of hydrological models for decades. In this paper, we propose a methodology for testing and benchmarking ML algorithms using artificial data generated by physically-based hydrological models. Our approach makes it possible to design controlled numerical experiments that can improve our understanding of this new generation of black-box models. We conducted a diagnostics study to demonstrate our methodology in which we attempted to determine if ML algorithms can identify a function relating streamflow and rainfall. This exercise combined the implementation of the distributed hillslope-link hydrological model (HLM) on a 4,385 km² basin driven by precipitation fields created using the stochastic storm transposition (SST) framework, and an advanced deep learning algorithm based on gated recurrent unit (GRU)-Attention neural networks. The data generated allowed us to create prediction scenarios that are equivalent to the hindcast and real-time forecast problems. We proposed a set of scale-independent performance metrics to evaluate the results of our experiment and found that the GRU can correctly identify a predictive function for all analyzed locations in the river network. We concluded that under the circumstances tested in this study, deep learning can identify the transformation function when trained in Hindcast Mode, making it a powerful tool to determine the streamflow response of a basin to predetermined rainfall scenarios. However, it fails to significantly outperform the predictions of temporal persistence when tested in Forecast Mode.
Dynamic adaptive encoder-decoder deep learning networks for multivariate time series forecasting of building energy consumption
2023, Applied Energy
Accurate energy consumption prediction models can bring tremendous benefits to building energy efficiency, where the use of data-driven models allows models to be trained based on historical data and to obtain better prediction performance. However, a severe problem has been overlooked where the dynamic spatiotemporal dependence among buildings is often not considered. To settle this issue, this study proposes an asymmetric encoder-decoder learning framework where the spatial relationships and time-series features between multiple buildings are extracted by a convolutional neural network and a gated recurrent neural network to form new input data in the encoder. The decoder then makes predictions based on the input data with an attention mechanism. The model is trained and tested with energy consumption data from 10 different buildings at a university in China. The results show that our proposed model maintains high accuracy using small datasets while predicting different types of buildings, achieving an average R² of 0.964. Additionally, the accuracy loss in multi-step forecasting is considerably ameliorated by maintaining an average R² of 0.915 predicting 3 steps ahead of time. Benefiting from the asymmetric encoder-decoder structure, the proposed algorithm resolves the problem of accuracy loss using deep learning in a small dataset and multi-step time series forecasting. Further assisted by the anomaly detection function, the algorithm serves as a reliable guide for the user to effectively manage and control energy consumption in buildings.
Deciphering the black box of deep learning for multi-purpose dam operation modeling via explainable scenarios
2023, Journal of Hydrology
Operational rules of a multi-purpose dam are hidden due to missing of the records for decision-making processes. This study aims to assess the explainability of a deep learning model for the multi-purpose dam operation of Seomjin River, Juam, and Juam Control dams in the Seomjin River basin, South Korea. In this study, the Gated Recurrent Unit (GRU) algorithm is employed to predict the hourly water level of the dam reservoirs over 2002–2021. First, the GRU models are trained and validated using the local dam input (precipitation, inflow, and outflow) and output (water level) data to examine similarity/singularity in the operational patterns of these three dams. The hyper-parameters are optimized by the Bayesian algorithm. Secondly, the sensitivity test of the trained GRU model to altered input data (−40%, −20%, +20%, and +40%) is conducted to understand how the GRU models facilitate the input data to simulate the target output data (herein, hourly water level), which is known as explainability scenarios. Results show that the trained GRU models predict the hourly water level well across the three dams (above 0.9 of the Kling-Gupta Efficiency). Results from the explainability scenarios show a linear response to the altered inflow rates, but no response to altered precipitation. Furthermore, the GRU models show a site-specific response to altered outflow rates, depending on whether the observed outflow rate-water level relationship is linear or not. This study hints how to decipher the black box of deep learning in multi-purpose dam operation modeling via explainable scenarios.
A review of hybrid deep learning applications for streamflow forecasting
2023, Journal of Hydrology
Deep learning has emerged as a powerful tool for streamflow forecasting and its applications have garnered significant interest in the hydrological community. Despite the publication of several review articles on machine learning applications in streamflow forecasting, no review paper has yet focused explicitly on deep learning and its hybrid forms. This paper starts with some characteristics of deep learning models to provide a quick view of deep learning. Next, the configurations and characteristics of hybrid deep learning models, which is a hybridization of modeling techniques with deep learning, are discussed. Another vital role while implementing deep learning modeling is the methods applied for input and hyperparameter optimization. Finally, the limitations encountered in streamflow forecasting using deep learning models and recommendations for further research are outlined. This review covers related studies from 2017 to 2023 to provide the most recent snapshot of deep learning modeling applications in streamflow forecasting. These efforts are expected to contribute to the advancement of streamflow forecasting, potentially enabling more informed decision-making in water resource management.
A wind speed prediction system based on new data preprocessing strategy and improved multi-objective optimizer
2023, Renewable Energy
The way to predict wind speed more accurately has always been one of the most important problems. The current combination prediction models have some limitations in data preprocessing, optimization algorithms and so on, which decreased the accuracy of the model. In order to solve the above problems, a new wind speed prediction system is proposed including a new data preprocessing strategy, combined neural network prediction and an improved multi-objective optimizer based on Halton low-discrepancy sequence and multi-distributed disturbance operator. The main contributions include: Firstly, a new data preprocessing strategy (GMM-ICEE) is proposed, which not only retains the characteristics of the wind speed data, but also removes the noise well. Secondly, Halton low-discrepancy sequence is introduced in the population initialization, which can effectively escape from local optimal solution. Thirdly, a new disturbance operator is proposed, which effectively solve the problem of optimal stagnation. Through six groups of experiments and two groups of discussions, it is verified that the accuracy, stability, generalization ability and advanced nature of the wind speed prediction system are satisfactory. Compared with traditional methods, the proposed wind speed prediction system improved by about 10% in prediction accuracy (MAPE) and by about 5% in CPU operation time.
Statistical learning of water budget outcomes accounting for target and feature uncertainty
2023, Journal of Hydrology
Statistical learning seeks to learn statistics-based rules for data analysis tasks from known examples of inputs, or features, and corresponding outcomes and includes machine learning (ML) and deep learning (DL) algorithms. Data sets that are noisy, include significant uncertainty, and have extreme values hinder the learning process. In this study, aquifer recharge predictors are developed using four, random forest or gradient boosting ML methods and Long Short-Term Memory (LSTM) networks, a DL method to: (1) examine predictive skill when trained using noisy and uncertain data and (2) identify advantages of statistical learning implementations for prediction of water budget outcomes relative to process-based water budget calculations. Recharge was selected as the learning outcome because it is not observed and inherently uncertain. Precipitation, potential evapotranspiration (PET), and river discharge are the features, or inputs, and are calculated, or modelled, values and are not directly observed; consequently, they are expected to be noisy and uncertain because of contamination with measurement and model error. A common-sense baseline is developed and implemented to account for uncertainty and noise in outcomes for training and validation; the baseline provides delineation of a lower goodness-of-fit threshold that identifies when trained ML and DL models generate prediction skill and an upper goodness-of-fit threshold above which the models are learning to reproduce noise and bias. For statistical learning regression implementations, features and outcomes need to be transformed to be Gaussian-like. Inherent variability and extreme events in precipitation, discharge, and recharge data sets require power transformation, or at least scaling of logarithms, to enhance predictive skill. Identified advantages to statistical learning of water budget outcomes are the ability to use dimensionless trends for features and to represent a complex study site with the same level of effort as a simple site.

View all citing articles on Scopus

View full text

Research paperThe effect of calibration data length on the performance of a conceptual hydrological model versus LSTM and GRU: A case study for six basins from the CAMELS dataset

Highlights

Abstract

Introduction

Section snippets

Study area and data

GR4H hydrological model

Results and discussion

Conclusions

Funding

Data availability

Computer code availability

CRediT authorship contribution statement

Declaration of competing interest

Acknowledgments

TensorFlow: large-scale Machine Learning on Heterogeneous Systems

Two decades of anarchy? Emerging themes and outstanding challenges for neural network river forecasting

Prog. Phys. Geography-Earth Environ.

The camels data set: catchment attributes and meteorology for large-sample studies

Hydrol. Earth Syst. Sci.

Impact of the length of observed records on the performance of ann and of conceptual parsimonious rainfall-runoff forecasting models

Environ. Model. Software

The hazards of split-sample validation in hydrological model calibration

J. Hydrol.

KALIv2: calibration and validation experiment results

Learning long-term dependencies with gradient descent is difficult

IEEE Trans. Neural Network.

Looking beyond general metrics for model comparison – lessons from an international model intercomparison study

Hydrol. Earth Syst. Sci.

Effect of data length on rainfall-runoff modelling

Environ. Model. Software

Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation

Keras

Flow prediction by three back propagation techniques using k-fold partitioning of neural network training data

Nord. Hydrol

Crash testing hydrological models in contrasted climate conditions: an experiment on 216 australian catchments

Water Resour. Res.

Chapter one - 70 years of machine learning in geoscience in review

The influence of conceptual model structure on model performance: a comparative study for 237 French catchments

Hydrol. Earth Syst. Sci.

Improved rainfall-runoff calibration for drying climate: choice of objective function

Water Resour. Res.

The Proper Care and Feeding of Camels: How Limited Training Data Affects Streamflow Prediction

Deep Learning

Reconciling theory with observations: elements of a diagnostic approach to model evaluation

Hydrol. Process.

Long short-term memory

Neural Comput.

Matplotlib: a 2d graphics environment

Comput. Sci. Eng.

Fitting of hydrologic models: a close look at the nash–sutcliffe index

J. Hydrol. Eng.

Forecasting daily lake levels using artificial intelligence approaches

Comput. Geosci.

Modeling rainfall-runoff process using soft computing techniques

Comput. Geosci.

Technical note: inherent benchmark or not? comparing nash–sutcliffe and kling–gupta efficiency scores

Hydrol. Earth Syst. Sci.

Rainfall–runoff modelling using long short-term memory (lstm) networks

Hydrol. Earth Syst. Sci.

Towards learning universal, regional, and local hydrological behaviors via machine learning applied to large-sample datasets

Hydrol. Earth Syst. Sci.

A note on leveraging synergy in multiple meteorological datasets with deep learning for rainfall-runoff modeling

Hydrol. Earth Syst. Sci. Discuss.

Numba: a llvm-based python jit compiler

Deep learning

Nature

Effect of calibration data series length on performance and optimal parameters of hydrological model

Water Sci. Eng.

Research paper
The effect of calibration data length on the performance of a conceptual hydrological model versus LSTM and GRU: A case study for six basins from the CAMELS dataset