Research paper
The effect of calibration data length on the performance of a conceptual hydrological model versus LSTM and GRU: A case study for six basins from the CAMELS dataset

https://doi.org/10.1016/j.cageo.2021.104708Get rights and content

Highlights

  • LSTM and GRU approach GR4H at a notable rate, considering their number of parameters.

  • GRU models benefit more efficiently from additional calibration data than LSTM.

  • Compared to LSTM, the calibration of GRU is more robust in the case of limited data.

Abstract

We systematically explore the effect of calibration data length on the performance of a conceptual hydrological model, GR4H, in comparison to two Artificial Neural Network (ANN) architectures: Long Short-Term Memory Networks (LSTM) and Gated Recurrent Units (GRU), which have just recently been introduced to the field of hydrology. We implemented a case study for six river basins across the contiguous United States, with 25 years of meteorological and discharge data. Nine years were reserved for independent validation; two years were used as a warm-up period, one year for each of the calibration and validation periods, respectively; from the remaining 14 years, we sampled increasing amounts of data for model calibration, and found pronounced differences in model performance. While GR4H required less data to converge, LSTM and GRU caught up at a remarkable rate, considering their number of parameters. Also, LSTM and GRU exhibited the higher calibration instability in comparison to GR4H. These findings confirm the potential of modern deep-learning architectures in rainfall-runoff modelling, but also highlight the noticeable differences between them in regard to the effect of calibration data length.

Introduction

According to Mount et al. (2016), a hydrological model is a functional relationship in which the model output (y) is determined by its structure (f), inputs (x), parameters (p), and the residual error ε between the modelled variable and its true (or observed) value:y=f(x,p)+ε

Following Mount et al. (2016), conceptual hydrological models put a strong emphasis on structure as a result of knowledge-driven model reduction, of how we formalize our concept of a hydrological system. In contrast, data-driven methods trust in the explanatory power of input data itself as they “simultaneously ‘discover’ the model structure and optimize its parameters [and minimize] the role of a priori hypotheses” (Mount et al., 2016). As a consequence, the number of parameters that need to be optimized (or calibrated) is relatively small for conceptual hydrological models (typically in the order of 100102), and relatively high (in the order of 103105) for data-driven models, such as Artificial Neural Networks (ANN).

Numerous studies have been devoted to different aspects of hydrological model calibration, such as the choice of the objective function Fowler et al. (2018), or the use of auxiliary information (Nijzink et al., 2018). Yet, the majority of hydrological models are still calibrated against discharge time series, using square error-based objective functions (Kisi et al., 2013; Arsenault et al., 2018), and their validity requires examination on independent data (Refsgaard et al., 2005). To that end, split-sampling techniques have been suggested by various authors as a basis for calibration-validation experiments, as diagnostic tools to assess the reliability of models and the robustness of their parameterization (e.g., Coron et al., 2012; Kisi et al., 2012; Thirel et al., 2015; Motavita et al., 2019).

In the light of limited data availability for calibration and validation, such techniques also provide a means to study the effect of the amount (or length) of calibration data on the reliability of models. A fair number of studies investigated this effect for conceptual hydrological models (e.g., Sorooshian et al., 1983; Yapo et al., 1996; Boughton, 2007; Perrin et al., 2007; Li et al., 2010; Arsenault et al., 2018; Motavita et al., 2019). For data-driven ANN models, Anctil et al. (2004), Cigizoglu and Kisi (2005), and Toth and Brath (2007) investigated the effect of calibration data in the context of runoff forecasting, i.e. on the skill of runoff prediction at lead times of several days, using observed runoff and meteorological forcing as predictors. Rainfall-runoff modelling, however, aims to predict runoff from meteorological variables, only. This is a different problem, and we found only one study – conducted by Gauch et al. (2019) – that investigates the effect of calibration data length on the performance of continental-scale data-driven rainfall-runoff models (namely, Extreme Gradient Boosting and Long Short-Term Memory Networks). However, the study does not compare this effect for conceptual hydrological models versus various data-driven models at basin scale, which is a gap that we want to address with the present study.

Thus, the objective of this study is to investigate the effect of calibration data length on the ability of a conceptual hydrological model (GR4H) and two data-driven ANN models to predict runoff from meteorological forcing, i.e. precipitation, temperature, and potential evaporation – at an hourly resolution. Hence, we do not limit our analysis to a single ANN architecture. Instead, we include two modern deep-learning architectures, Long Short-Term Memory Networks (LSTM) and Gated Recurrent Units (GRU), which have just recently been introduced to the field of hydrological modelling (Marçais and de Dreuzy, 2017; Kratzert et al., 2018; Shen, 2018; Xu and Niu, 2018; Reichstein et al., 2019). Specifically, Kratzert et al. (2018, 2019) have identified LSTM as superior to traditional ANN architectures and a conceptual hydrological model. Therefore, we specifically examine to what extent this superiority could depend on the amount of data that is available for model calibration.

Furthermore, our experimental design (Sect. 2.2) provides us a possibility to investigate the effect of model instability, i.e., the sensitivity of model performance on the validation period. This instability reflects both the variability between calibration sub-periods of the same length, and the variability in-between sub-periods of different lengths.

Altogether, this study aims to provide a new perspective on the role of calibration data length, and to serve as a guideline for establishing corresponding calibration-validation experiments. We start with an outline of the experimental setup as well as the underlying data for the meteorological forcing and the observed discharge (Sect. 2); the different models are described in Sect. 3, followed by details on model calibration (Sect. 3.3). We then provide the results of our experiments in Sect. 4 and conclude with Sect. 5.

Section snippets

Study area and data

In this study, the CAMELS dataset (Newman et al., 2015; Addor et al., 2017) of discharge and meteorological time series, as well as attributes of 671 river basins with relatively small human influence in the contiguous United States serves as a basis for basin selection. Additionally, we use the results by Kratzert et al. (2020), who presented the streamflow simulation efficiency of several hydrological models for 531 CAMELS basins at daily temporal resolution as a proxy estimate of

GR4H hydrological model

GR4H is a process-based, lumped, conceptual hydrological model, which was developed at Irstea by Mathevet (2005) for runoff predictions at an hourly resolution (Fig. 2). GR4H has four free parameters (Table 2, X1–X4) and mainly mimics the structure of the daily GR4J model (Perrin et al., 2003): precipitation is partitioned into two components, the production store (S; parameter X1 describes its capacity), and effective rainfall. Water which percolates from the production store merges with the

Results and discussion

Fig. 5 illustrates the main results of our study regarding the effect of calibration data length on the performance of hourly runoff simulations by different models and for different basins as well. The figure shows the change of NSE over the independent 9-year validation period (2011–2019) as a function of the number of calibration years n. Each of the solid lines represents the average NSE value over the available combinations of sub-periods of length n used for calibration. The shades

Conclusions

In a recent review, Dramsch (2020) has pointed out that “machine learning has seen the fastest successes in domains where decisions are cheap (e.g., click advertising), data are readily available (e.g., online shops), and the environment is simple (e.g., games) or unconstrained (e.g., image generation). Geoscience generally is at the opposite […]. Decisions are expensive, […]. Data are expensive, sparse, and noisy. The environment is heterogeneous and constrained by physical limitations.” The

Funding

G.A. was financially supported by Geo.X, the Research Network for Geosciences in Berlin and Potsdam, and by ClimXtreme – the Research Network on Climate Change and Extreme Events.

Data availability

Discharge data as well as meteorological forcing data for the studied basins are available in the corresponding repository on Github (https://github.com/hydrogo/KALIv2, accessed 02.09.2020).

Computer code availability

We support this paper by making the entire computational workflow available online in the corresponding repository on Github (https://github.com/hydrogo/KALIv2, accessed 02.09.2020). The code is written in the Python 3 programming language (https://www.python.org/, accessed 02.09.2020) and is based on open-source software libraries, namely Numpy (Oliphant, 2006), Pandas (McKinney et al., 2010), Scipy (Virtanen et al., 2020), Numba (Lam et al., 2015), Tensorflow (Abadi et al., 2015), and Keras (

CRediT authorship contribution statement

Georgy Ayzel: ran the experiments, performed the analysis, and wrote the manuscript. Maik Heistermann: Co-designed and supervised the study, co-wrote the manuscript.

Declaration of competing interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

We thank Frederik Kratzert and two other anonymous referees for constructive and critical comments which substantially contributed to the development of the study and the manuscript.

References (63)

  • M. Abadi et al.

    TensorFlow: large-scale Machine Learning on Heterogeneous Systems

    (2015)
  • R.J. Abrahart et al.

    Two decades of anarchy? Emerging themes and outstanding challenges for neural network river forecasting

    Prog. Phys. Geography-Earth Environ.

    (2012)
  • N. Addor et al.

    The camels data set: catchment attributes and meteorology for large-sample studies

    Hydrol. Earth Syst. Sci.

    (2017)
  • F. Anctil et al.

    Impact of the length of observed records on the performance of ann and of conceptual parsimonious rainfall-runoff forecasting models

    Environ. Model. Software

    (2004)
  • R. Arsenault et al.

    The hazards of split-sample validation in hydrological model calibration

    J. Hydrol.

    (2018)
  • G. Ayzel

    KALIv2: calibration and validation experiment results

  • Y. Bengio et al.

    Learning long-term dependencies with gradient descent is difficult

    IEEE Trans. Neural Network.

    (1994)
  • T. de Boer-Euser et al.

    Looking beyond general metrics for model comparison – lessons from an international model intercomparison study

    Hydrol. Earth Syst. Sci.

    (2017)
  • W. Boughton

    Effect of data length on rainfall-runoff modelling

    Environ. Model. Software

    (2007)
  • K. Cho et al.

    Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation

    (2014)
  • F. Chollet

    Keras

    (2015)
  • H.K. Cigizoglu et al.

    Flow prediction by three back propagation techniques using k-fold partitioning of neural network training data

    Nord. Hydrol

    (2005)
  • L. Coron et al.

    Crash testing hydrological models in contrasted climate conditions: an experiment on 216 australian catchments

    Water Resour. Res.

    (2012)
  • J.S. Dramsch

    Chapter one - 70 years of machine learning in geoscience in review

  • W.R. van Esse et al.

    The influence of conceptual model structure on model performance: a comparative study for 237 French catchments

    Hydrol. Earth Syst. Sci.

    (2013)
  • K. Fowler et al.

    Improved rainfall-runoff calibration for drying climate: choice of objective function

    Water Resour. Res.

    (2018)
  • M. Gauch et al.

    The Proper Care and Feeding of Camels: How Limited Training Data Affects Streamflow Prediction

    (2019)
  • I. Goodfellow et al.

    Deep Learning

    (2016)
  • H.V. Gupta et al.

    Reconciling theory with observations: elements of a diagnostic approach to model evaluation

    Hydrol. Process.

    (2008)
  • S. Hochreiter et al.

    Long short-term memory

    Neural Comput.

    (1997)
  • J.D. Hunter

    Matplotlib: a 2d graphics environment

    Comput. Sci. Eng.

    (2007)
  • S.K. Jain et al.

    Fitting of hydrologic models: a close look at the nash–sutcliffe index

    J. Hydrol. Eng.

    (2008)
  • O. Kisi et al.

    Forecasting daily lake levels using artificial intelligence approaches

    Comput. Geosci.

    (2012)
  • O. Kisi et al.

    Modeling rainfall-runoff process using soft computing techniques

    Comput. Geosci.

    (2013)
  • W.J.M. Knoben et al.

    Technical note: inherent benchmark or not? comparing nash–sutcliffe and kling–gupta efficiency scores

    Hydrol. Earth Syst. Sci.

    (2019)
  • F. Kratzert et al.

    Rainfall–runoff modelling using long short-term memory (lstm) networks

    Hydrol. Earth Syst. Sci.

    (2018)
  • F. Kratzert et al.

    Towards learning universal, regional, and local hydrological behaviors via machine learning applied to large-sample datasets

    Hydrol. Earth Syst. Sci.

    (2019)
  • F. Kratzert et al.

    A note on leveraging synergy in multiple meteorological datasets with deep learning for rainfall-runoff modeling

    Hydrol. Earth Syst. Sci. Discuss.

    (2020)
  • S.K. Lam et al.

    Numba: a llvm-based python jit compiler

  • Y. LeCun et al.

    Deep learning

    Nature

    (2015)
  • C.z. Li et al.

    Effect of calibration data series length on performance and optimal parameters of hydrological model

    Water Sci. Eng.

    (2010)
  • Cited by (48)

    View all citing articles on Scopus
    View full text