Introduction

Estimating gas fluxes across the air–water interface in lakes is fundamental for understanding their biogeochemical, environmental and ecological functioning. Accurate estimates of lakes carbon (CO2 and CH4), nitrogen (N2 and N2O) and oxygen (O2) fluxes with the atmosphere are key to constrain their biogeochemical cycles (Likens 2010), quantify whole lake metabolism (Dugan et al. 2016) and evaluate greenhouse gas emissions at regional and global scales (Raymond et al. 2013; Soued et al. 2015; Hastie et al. 2017). Estimating air–water fluxes of environmental aquatic contaminants (Hornbuckle et al. 1994; Bidleman 1999; Poissant et al. 2000) and biogenic volatile organic compounds (VOC) in lakes (Fink 2007) is also critical for preserving lake ecosystem services. Gas flux across the air–water interface (F) is described by using Fick’s first law of diffusion as the difference in the surface water (Cwtr) and air-equilibrium (Ceq) gas concentrations, multiplied by the air–water gas transfer velocity (k):

$$F = k \cdot \left( {C_{wtr} - C_{eq} } \right),$$
(1)

where positive \(F\) implies flux from the water to the atmosphere. There are methods to estimate gas fluxes such as the use of floating chambers (Engle and Melack 2000; Matthews et al. 2003) or by eddy covariance (Anderson et al. 1999). However, these methods can be time and/or cost consuming and may be difficult to be applied to capture potential spatial variation in multiple systems at the same time (Cole et al. 2010; Schilder et al. 2013; Erkkilä et al. 2018). Alternatively, fluxes can be modelled using Eq. 1, based on gas concentrations and k. However, while gas concentrations are usually relatively straightforward to measure, k is by far more difficult to estimate with high confidence.

The air–water gas transfer velocity of sparingly soluble gases is driven by near-surface water turbulence, which in lakes is mainly generated by wind stress over the lake surface and thermal convection when colder waters masses at the surface sink due to greater density (MacIntyre et al. 1995). Wind is the main source of near-surface turbulence in many lakes (Read et al. 2012), however, the efficiency of this wind-to-turbulence transfer may be modulated by several lake characteristics and other hydrodynamic processes. For a given wind speed, larger fetch lengths will result in larger wave heights, greater turbulence, and thus higher k values (Schilder et al. 2013; Vachon and Prairie 2013; Gålfalk et al. 2013). Surface heat flux, which determines whether a lake is warming or cooling, also affects near-surface turbulence. A warming surface water will stratify thermally and will suppress the wind-driven turbulence (negative buoyancy flux), while a cooling lake surface will generate additional turbulence from the convective movements of water masses (positive buoyancy flux) (MacIntyre et al. 2010). This effect is related to lake area and latitude (Read et al. 2012). Using wind to predict k in lakes is thus far from being direct, however, compared with turbulence measurements, wind speed is relatively easy to measure and therefore the most widely used predictor in empirical models of k for lakes.

Several empirical wind-based k models are currently available for lakes, with k often standardized to a Schmidt number of 600 (k600) to characterize CO2 transfer at 20 °C water temperature. However, each model was calibrated in a distinct system, under specific conditions, and using different methods to determine k600. This results in a wide range of model parameterizations (Table 1). The most common methods to derive k600 include floating chambers (e.g. Vachon and Prairie 2013), mass balance approaches (e.g. Cole and Caraco 1998) and the eddy covariance technique (e.g. MacIntyre et al. 2010). These methods are fundamentally different in their approach, in addition to having specific issues that may affect the resulting k600 estimates (see MacIntyre et al. (1995), Jähne and Haussecker (1998), Wanninkhof et al. (2009) and Cole et al. (2010) for more detailed discussions). A key difference between methods, however, is related to their scales of spatial and temporal integration (SIN and TIN, respectively), ranging from centimeters and minutes (floating chambers); to meters and hours (eddy covariance technique) and the whole lake and days (mass balance approach) (Fig. 1a). The different SIN and TIN will have implications for the measured k600 for a given wind speed. For example, when the relationship between k600 and wind speed has a positive curvature, the average k600 derived over a period of variable winds will be greater than over a period of steady winds of the same average wind speed (Wanninkhof et al. 1987; Livingstone and Imboden 1993). Longer TIN may thus increase its estimated value for a given averaged wind speed. The effect of SIN is less well understood and documented (Schilder et al. 2013; Paranaíba et al. 2018). Wind measured in the center of the lake may potentially record higher wind speeds than the average wind speed integrated over the whole lake area (Venäläinen et al. 2003; Docquier et al. 2016). This intra-lake spatial variability depends on the size and shape of the lake and whether the lake is sheltered by trees, mountains or buildings (Kwan and Taylor 1994; Markfort et al. 2010; Vachon and Prairie 2013). As a result, complex-shape lakes should, in theory, have greater within-lake spatial variability in wind-driven turbulence and thus k600. The average k600 derived from mass balances (whole-lake, i.e. SIN = 1) should thus be lower than the measured k600 at the center of the lake from the eddy covariance technique (SIN < 1) and floating chambers (SIN << 1) (Fig. 1c).

Table 1 Models for predicting the air–water gas transfer velocity (k600, in cm h−1) based on wind speed at 10 m height (U10, in m s−1) alone or in combination with lake area (LA, in km2)
Fig. 1
figure 1

Spatial and temporal scales, and the prediction space of common wind-based air–water gas transfer velocity (k) models in lakes. a Models vary in the space- (SIN) and time integration (TIN) of the underlying method to measure k, and b in the surface area (LA) and the shoreline development index (SDI) of the lakes they are calibrated for. In a and b, dashed and dotted lines cover the range of conditions in case they varied throughout the training dataset. c For a given dominant wind direction lake surface turbulence will be heterogeneous due to lake shape, size and sheltering effect. The different spatial integrations of k measurements between floating chambers (red area), eddy covariance (white shaded area), and mass balance (whole lake area) should result in different average k values. CC98 Cole and Caraco (1998), VP13 Vachon and Prairie (2013), CW03 Crusius and Wanninkhof (2003), M10 MacIntyre et al. (2010), G07 Guérin et al. (2007) and L18 Li (2018) (color figure onine)

Available wind-based k600 models also differ in the number and geometrical dimensions of the lakes they are calibrated against, in particular the surface area (LA) and the shoreline development index (SDI), the latter describing the ratio of the shoreline length of a given lake relative to the shoreline length of a circular lake of the same area (Fig. 1b). Each model has been calibrated under specific conditions. How the models perform under different conditions still remains widely unknown. This uncertainty is typically dealt with by averaging across predictions from a number of different k600 models (Raymond et al. 2013; Hastie et al. 2017). However, model averaging does not necessarily reduce prediction errors nor provide more accurate estimates (Dormann et al. 2018). Existing uncertainties in k600 modelling call for a systematic evaluation of the context-dependence of the bias of currently used k600 models and attempts to develop new models in order to test whether uncertainties can be reduced and quantified relative to environmental conditions.

Our first aim is to evaluate the predictions of a selection of currently available wind-based k600 models using published k600 measurements in global lakes. We selected six commonly used or recently published models that differ in their methods of k600 measurements, their SIN and TIN and the geometry of the lake under which the models were calibrated (Fig. 1 and Table 1). Specifically, we assess context-dependent model prediction performance using multivariate regression tree analyses and discuss how the environmental conditions (average wind speed), the geometry (LA and SDI) and latitude (Lat) of the lakes studied, and the relevant scales of SIN and TIN related to the method affect the model performance. We hypothesized that none of the available models generally performs better than other models (H1), but that certain models outperform others under certain environmental conditions or system characteristics (H2). Our second aim is to explore new parametrizations using the existing lake k600 data that allow more flexibility by accounting for lake-specific effects and explicitly provide prediction errors. We hypothesized that the wind speed effect on k600 will vary among the different lakes due to their different LA, SDI, SIN or Lat (H3) and hence including these additional variables will generally improve k600 predictions among a wide range of lakes (H4).

Materials and methods

Data compilation and standardization

We compiled data on 2297 simultaneous estimates of k and wind speed from 79 global lakes and 10 reservoirs (Online Resources Table S1), however, only a subset was used to accommodate our analyses. The data were retrieved from all peer-reviewed scientific papers (n = 46) we could find via the search engine “web of science” and papers cited therein. As keywords, we used “lake”, “pond” or “reservoir” in combination with “wind”, and either “k600”, “gas exchange”, “piston velocity”, “reaeration” or “gas transfer”. We only included k estimates that were based on sparingly soluble gases where air–water gas exchange is dominantly water-side controlled. Data were either provided by corresponding authors, extracted from tables, or digitized from figures using the web tool WebPlotDigitizer (Rohatgi 2019) as indicated in Online Resources Table S1. All k estimates were derived from the floating chamber, mass balance or eddy covariance approach and based on a variety of tracer gases (carbon dioxide, methane, oxygen, radon, helium, neon, krypton, sulfur hexafluoride, mercury and propane). The eddy covariance data were binned according to wind speed following the methodology described in the respective original papers. We documented the method-specific SIN and TIN as described in Online Resources Text S1. SIN was expressed relative to the total lake surface area. To make the data comparable across studies, we used k600 and wind speed at 10 m height above ground (U10). If k values were not reported as k600, we converted the estimates of k for a given Schmidt number (Sc) to a Schmidt number of 600 \(k_{600} = k\left( {\frac{600}{{Sc}}} \right)^{ - n}\) where n was 2/3 for low wind speeds (U10 ≤ 3.7 m s−1) and 1/2 for high wind speeds (U10 > 3.7 m s−1) as suggested by Jähne et al. (1987). We used Schmidt number parameterizations by Wanninkhof (2014) for carbon dioxide, methane, oxygen, radon, helium (3He), neon, krypton, and sulfur hexafluoride, by Kuss et al. (2009) for mercury and by Witherspoon and Saraf (1965) for propane. Parameterizations were chosen for fresh- or saltwater, depending on the salinity classification in the original paper. For each lake, we also compiled Lat, LA and SDI, calculated following Hutchinson (1957):\(SDI = \frac{p}{{2\sqrt {\pi LA} }},\) where p is the lake perimeter. These measures were extracted from the HydroLakes dataset (for \(LA\)> 10 ha, Messager et al. (2016)), by on-screen digitization of lake surfaces in Google Maps, or, if available, as shown in maps provided in the original papers (for LA ≤10 ha). For Swedish lakes, we relied on the ViVaN dataset because of its higher accuracy and resolution relative to the HydroLakes dataset (Nisell et al. 2007).

General analytical approach

We first evaluated the predictions of each of the six k600 models listed in Table 1 relative to observed k600 in our dataset. We then analyzed context-dependent model biases using multivariate regression tree (MRT) analysis to identify experimental and method-specific conditions (average U10, LA, SDI, SIN, TIN, absolute value of Lat) under which certain k600 models would perform better than other models. Finally, we explored new k600 parametrizations using multivariate regression analysis. For these analyses, we only included those 46 lakes with at least six k600 observations each, and a total of 2222 observations. We chose this threshold to maximize the total number of observations and the number of lakes, but fulfill recommendations of a minimum larger than five observations per lake (Theall et al. 2011) and an average larger than 30 observations per lake (Scherbaum and Ferreter 2009, Online Resources Fig. S1).

Evaluating available wind-based k600 models

We evaluated models by comparing observed and predicted k600 based on four performance measures and summarized these in an integrated performance index. First, we calculated the root mean squared deviation (RMSD) of the predicted values relative to the 1:1 line of observed vs. predicted values (Piñeiro et al. 2008). The smaller the RMSD, the better is the model fit. Second, we calculated the coefficient of determination, \(R^{2} = 1 - \frac{{SS_{res} }}{{SS_{tot} }}\), where \(SS_{res} = \mathop \sum \nolimits_{i} \left( {y_{i} - f_{i} } \right)^{2}\) is the residual sum of square and \(SS_{tot} = \mathop \sum \nolimits_{i} \left( {y_{i} - \overline{y}_{i} } \right)^{2}\) is the total sum of squares with observed values yi, predicted values fi and the mean of the observed data \(\overline{y}_{i}\). To account for differences in model complexity, we further adjusted R2 for the number of predictor variables v relative to the number of observations \(m:R_{adj} ^{2} = 1 - (1 - R^{2} )(m - 1)/(m - v - 1). R_{adj} ^{2} = 1\) implies perfect fit, \(R_{adj}^{2} > 0\) implies the model fits the data better than the arithmetic mean of the data, and \(R_{adj}^{2} < 0\) implies that the arithmetic mean fits the data better than the model predictions. Third and fourth, we calculated the intercept and slope of the linear regression line of observed vs. predicted k600. We tested whether the intercept was significantly different from zero based on the significance of the intercept of the linear regression of observed vs. predicted k600. We also tested whether the slope was significantly different from 1 based on the significance of the slope of the linear regression of observed-minus-predicted vs. predicted k600. The larger the p-value of the intercept and slope, the closer the regression line of observed vs. predicted k600 is to the 1:1 line (intercept = 0, slope = 1). Linear regressions were fit using the ‘lm’ function in R 3.6.1 (R Core Team 2019). We ranked each model, with the highest rank assigned to the lowest RMSD, highest R2adj, and highest p-values of the intercept and slope tests explained above. We used the median rank among all performance measures as an integrated performance index scaling from 1 (best performance) to 6 (worst performance).

We used MRT analysis (De’Ath 2002) to evaluate if any models perform better than others under certain conditions. MRT forms groups (leaves) of lakes by repeated splitting of the data. Each split is defined by a simple rule based on specific conditions and is selected to cluster together lakes for which the different model performance patterns are similar. For example, a cluster could be created if under specific conditions hypothetical models A and B perform better than hypothetical models C and D. In practice, MRT clusters a matrix of dependent variables under the constraints of a matrix of independent variables. As dependent variables, we chose the integrated performance index of the different k600 models. As independent variables, we chose average U10, LA, SDI, the method-specific SIN and average TIN, and the absolute value of Lat. To account for potential variability in tree structures depending on the threshold number of lakes to be included per leaf, we fitted a series of trees with threshold numbers ranging from 1 to 23. We fitted MRTs using the ‘mvpart’ function of the R package ‘mvpart’ (De’Ath 2014), using the Euclidean distance as a dissimilarity index. We selected the best tree within one standard error of the overall best using tenfold cross-validation. We evaluated the extent to which the MRT can explain variability in model performance among lakes based on its R2 and tenfold cross-validated relative error. As an additional evaluation, we also performed an unconstrained cluster analysis using the same number of clusters and the same metric of dissimilarity as in the MRT analysis using the ‘kmeans’ function in R (Hartigan and Wong 1979). If the unconstrained cluster analysis yields a higher R2 than the MRT analysis, then unobserved factors account for additional variation in model performance (De’Ath 2002). We also tested if certain models perform differently within each leaf relative to grand means using a Kruskal–Wallis one-way analysis of variance (‘kruskal.test’ function in R). In case of significant overall differences (p < 0.05), we tested model-specific differences using pairwise Wilcoxon rank sum tests (‘pairwise.wilcox.test’ function in R). We chose nonparametric hypothesis tests to account for the relatively small sample size.

Parametrizing new wind-based k600 models

To explore new k600 parameterizations, we fitted a series of regression models to the dataset of lakes with at least six k600 observations. We fitted nonlinear mixed-effects models following the multilevel approach with cross-level interactions by Bryk and Raudenbush (1992). We used a 2-level model, where U10 was included to explain variation in k600 at the (within-lake) observation level and LA, Lat, SDI and SIN to explain variation in the U10- k600 relationships at the (among-) lake level. We tested three functional forms commonly used in the literature (Table 1): linear (\(k_{600} = a \cdot U_{10} + c\)), exponential (\(k_{600} = a \cdot exp(U_{10} \cdot b)\)) and power (\(k_{600} = a \cdot U_{10}^{b} + c\)). For each functional form we tested to what extent the slope (a) or shape (b) of the k600-U10 relationship is a function of either LA or SDI, based on previous evidence on the potential lake-specific transfer from wind to turbulence (Schilder et al. 2013; Vachon and Prairie 2013; Gålfalk et al. 2013). We also tested whether the offset (c) is a function of either LA or Lat, based on previous evidence of lake-specific additional wind-independent turbulence by e.g. heat fluxes (MacIntyre et al. 2010; Read et al. 2012). The effect of SIN has never been tested, therefore we allowed it to have an influence on all parameters (a, b and c). We fitted models that cover all combinations of these predictor variables, amounting to 16 (linear, exponential) or 64 (power) candidate models. We did not test for TIN effects because it would not be useful as a predictive variable and require additional but unavailable data on the frequency distribution of U10 for each time integration interval (Wanninkhof et al. 1987; Livingstone and Imboden 1993). We evaluated the fits of all candidate models using the Akaike Information Criterion (AIC) and selected, for each of the three model types, the model with the lowest AIC and all parameters significant as the final model. For the final models, we report the RMSD, R2adj, and slope and intercept of linear regressions of observed vs. predicted k600 values and present 95% confidence intervals for mean predictions, representing a typical lake, following an approach by Bolker (2008). See Online Resources Text S2 for more details on the modelling procedure.

Results

Data set

The compiled data covered k600 values from 0.01 to 57.62 cm h−1, U10 from 0 to 13 m s−1, LA from 181 m2 to 1342 km2 and SDI from 1.0 to 22.5 (Fig. 2). k600 increased with U10 and was generally < 10 cm h−1 for U10 < 2 m s−1 and > 10 cm h−1 for U10 > 8 m s−1. U10 increased generally with LA, but decreased for very large LA (Fig. 2b), because these had very high SDI (Online Resources Fig. S2). k600 was not strongly related to SDI (Fig. 2c). TIN varied from 0.0024 to 122 days and SIN varied from 5·10–11 to 1 (Online Resources Table S1).

Fig. 2
figure 2

Air–water gas transfer velocity (k600) plotted against wind speed at 10 m height (U10) (a), lake area (LA) (b), and shoreline development index (SDI) (c). Blue circles denote floating chamber measurements (FC), green crosses denote the mass balance approach (MB), and red triangles denote the eddy covariance technique (EC). Histograms show data distributions across the plot axes (color figure onine)

General performance of wind-based k600 models

Table 2 summarizes the general performance of each model in predicting k600 for a specific lake. Model performance was similar for five of the six models with median RMSDs of 2.2–4.3 cm h−1, R2adj values of − 0.9 to − 0.2, intercepts of − 0.9 to 1.9 cm h−1 and slopes of 0.7–1.7. Negative R2adj suggest that predictions were not better than using mean k600 observations. Only 2–39% of all cases showed positive R2adj depending on the model. Intercepts and slopes of linear regressions of observed vs. predicted values were significantly different from 0 and 1 in 30–61% and 41–52% of all cases, respectively. The model L18 by Li (2018) performed generally very poorly outside its training domain. Complete results for each model and lake are reported in Online Resources Table S2.

Table 2 Performance of empirical wind-based models for predictions of air–water gas transfer velocities (k600) in 46 lakes with at least 6 observations per lake

Predicting variability in model performance

The available experimental conditions (average U10, LA, SDI, SIN, TIN, absolute value of Lat) were poor predictors of k600 model performance. This was indicated by the low explanatory power of the MRT analysis, and variable tree structures, depending on the threshold set for the minimum number of lakes included in each leaf. Depending on this threshold, two different trees were generated (Fig. 3, Online Resources Fig. S3). Cross-validated relative errors of these trees were between 1.14 and 1.17 and hence close to one, indicating poor prediction (De’Ath 2002). This was confirmed by the small proportion of variance in model performance ranks that was explained by the MRTs (R2 = 0.10–0.11). The unconstrained cluster analyses explained a much higher proportion of variance (R2 = 0.32–0.36), indicating that model performance ranks clustered relatively strongly and that k600 models performed differently under different conditions. However, the lower R2 of MRT relative to the unconstrained cluster analysis suggests that observed conditions could only partly account for the difference in model performance (De’Ath 2002).

Fig. 3
figure 3

Multivariate regression tree (MRT) to explain patterns in ranked model performance of air–water gas transfer velocity (k600) predictions according to six published models. Ranks scale from 1 (best performance) to 6 (worst performance). The potential explanatory variables were average wind speed at 10 m height (U10), lake area (LA), shoreline development index (SDI), the absolute value of latitude (Lat), space integration (SIN) and time integration (TIN). The MRT is valid for a minimum number of lakes of n ≥ 5. The R2 of the MRT was 0.10, and the cross-validated relative error was 1.17 (SE = 0.12). The boxplots show the distributions by medians (lines), interquartile ranges (boxes) and 1.5 times the interquartile ranges (whiskers). Circles mark values outside these ranges. The p values indicate the significance of differences among models within a leaf (Kruskal–Wallis one-way analysis of variance). The letters a, b and c denote significant differences between boxplots within leaves (pairwise Wilcoxon rank sum test, p < 0.05). n is the number of lakes in each leaf. CC98 Cole and Caraco (1998), VP13 Vachon and Prairie (2013), CW03 Crusius and Wanninkhof (2003), M10 Macintyre et al. (2010), G07 Guerin et al. (2007), L18 Li (2018)

Factors influencing wind-based k600 model performance

The two MRTs suggest that model performance was a function of TIN or LA, depending on the threshold number of lakes chosen. Apart from L18 always being the worst model, we can observe the following structures. For a threshold of 1–4 lakes per leaf, model performance was determined by LA (Online Resources Fig. S3). For very large systems (LA ≥ 375 km2), the model VP13 by Vachon and Prairie (2013) performed slightly but not statistically significantly better than the other models. For systems smaller than 375 km2, models performed equally. For a threshold of minimum 5 lakes per leaf, model performance was primarily determined by TIN (Fig. 3). For TIN < 105 min, VP13 performed slightly, but not statistically significantly better than the other models. For TIN ≥ 105 min, the model G07 by Guerin et al. (2007) performed significantly better than VP13.

Differences in model performance ranks were reflected by differences in the individual performance measures. For example, the higher ranks of G07 relative to VP13 for TIN ≥ 105 min were due to lower RMSDs and higher R2adj (Online Resources Fig. S4). The generally poor performance of L18 was reflected by extremely high RMSD, low R2adj and slopes almost always being significantly different from 1.

New k600 model parametrizations

Among all the tested linear, exponential and power models, we identified a suite of potential candidates for new k600 parameterizations in which all their parameters were statistically significant (p < 0.05), and which had similar statistical support (evidence ratio ≥ 0.05) as the respective models with the lowest AIC (Table 3). The candidate models all included lake- or method-specific characteristics in addition to U10. Hence, including these variables significantly improved the models relative to models with U10 alone. The slope of the linear model type was equivalently explained by LA, SDI or SIN, and the intercept of the power model type was equivalently explained by SIN, LA, or no predictor variable. These equivalences likely result, at least in part, from correlations between LA, SDI and SIN (Online Resources Fig. S2). The linear and power models with the lowest AIC showed a similar structure, where their slope component (a) increased with LA and their intercept component (c) decreased with SIN. We did not find any variable that significantly modulated the shape (b) component in the power model. The slope (a) and shape (b) component of the exponential model with the lowest AIC decreased with SIN and increased with SDI, respectively (Table 3).

Table 3 Performance characteristics of linear, exponential and power mixed-effects models to predict air–water gas transfer velocity (k600) from wind speed at 10 m height (U10), lake area (LA), shoreline development index (SDI) and/or space integration (SIN)

For each functional shape, we exemplarily chose the model with the lowest AIC for further evaluation. Accordingly, the linear and power model predictions explained similar variability in k600 (65%) and had similar RMSD (3.35 and 3.34 cm h−1, Online Resources Table S3). The linear regressions of observed vs. predicted k600 followed closely the 1:1 line, with intercepts and slopes not significantly different from 0 and 1, respectively (Fig. 4a,c). Accounting for lake-specific intercepts and slopes would not improve the fit of observed and predicted k600 (Likelihood ratio test, Online Resources Table S6), suggesting a good fit among all lakes. Model residuals were homogeneous with a mean near zero across the whole range of predicted k600, U10, LA, SDI, SIN, and Lat (Online Resources Fig. S5–S6). Finally, the linear and power model parameters were robust against the threshold minimum number of k600 observations per lake (Online Resources Fig. S8). In contrast, the exponential model with the lowest AIC explained only 35% of the variability in k600 with a RMSD of 4.89 cm h−1 (Online Resources Table S3). The linear regressions of observed and predicted k600 diverged from the 1:1 line, with intercept and slope significantly different from 0 and 1, respectively (Fig. 4b). Accounting for lake-specific intercepts and slopes would improve the fit of observed and predicted k600 (Likelihood ratio test, Online Resources Table S6), suggesting that the exponential model was significantly biased for some lakes. Model residuals were heterogeneous across the whole range of predicted k600, U10, LA, SDI, SIN, and Lat (Online Resources Fig. S7). The model parameter coefficients varied with the threshold minimum number of k600 observations per lake (Online Resources Fig. S8). Overall, these evaluations suggest that the linear and power models fitted the data substantially better and with less bias than the exponential model.

Fig. 4
figure 4

Observed vs. predicted air–water gas transfer velocities (k600) according to the linear (a), exponential (b) and power (c) model with the lowest Akaike Information Criterion values among all candidate models of the respective functional shape. Blue circles denote floating chamber measurements (FC), green crosses denote the mass balance approach (MB), and red triangles denote the eddy covariance technique (EC). The solid line shows the linear regression line, the dashed line is the 1:1 line (color figure onine)

The k600 predictions based on our new parameterizations were surprisingly similar for the linear and power models with the lowest AIC values, because of their similar model structure and the power exponent is close to 1 (Fig. 5). Predictions by the exponential model tended to be higher for relatively low (< 1 m s−1) and high (> 7 m s−1) U10. These mixed-effect models integrate a large variability in lake-specific model parameterizations (see grey lines in Fig. 5). For example, U10-k600 slopes (a) varied roughly between 0 and 5, and power exponents (b) varied from near 0 to up to 10 (Online Ressources Fig. S10). Our mixed-effects model predictions fell largely within the range of predictions by published models (Fig. 5). For example, the intercept of our linear model was similar to M10 and CW03 for whole-lake space integrations (SIN = 1) and similar to G07, CC98, and VP13 for the minimum space integration found in our dataset (SIN = 5·10–11). The slopes were within the range of slopes in VP13.

Fig. 5
figure 5

Relationship between air–water gas transfer velocities (k600) and wind speed at 10 m height (U10) predicted by the linear (a), exponential (b) and power (c) model with the lowest Akaike Information Criterion values among all candidate models of the respective functional shape, and six published models (d). Colored shadings show predictions for different lake areas (LA) or shoreline development indices (SDI). Predictions in a–c are exemplarily shown for whole-lake space integrations (SIN = 1). The error bar shows the variation in intercepts of predictions in this study, for a whole lake space integration (SIN = 1, lower boundary), and the minimum space integration found in our dataset (SIN = 5·10–11, upper boundary). Grey lines show least-square predictions for each individual lake (in c three lakes are missing due to model fitting failure). The model equations are k600 = (0.328 · log10(LA) + 1.581) · U10−0.066 · logit(SIN) + 1.266 (a), k600 = (− 0.057 · logit(SIN) + 2.366) · exp(U10 · (0.144 · log10(SDI) + 0.156)) (b), and k600 = (0.281 · log10(LA) + 1.361) · U101.097 − 0.072 · logit(SIN) + 1.401 (c). Note that the model in b is not meaningful (for more details, see Discussion section). CC98 Cole and Caraco (1998), CW03 Crusius and Wanninkhof (2003), VP13 Vachon and Prairie (2013), M10 MacIntyre et al. (2010), G07 Guérin et al. (2007), L18 Li (2018) (color figure onine)

The k600 predictions showed large uncertainties, as exemplarily shown in Fig. 6 for the linear model with the lowest AIC value. The lower and upper bounds of 95% confidence intervals were 10–250% smaller or larger than the mean predictions, respectively. This ratio was relatively small (< 25%) for U10 > 2 m s−1 and LA > 1 km2 but increased drastically towards smaller LA where the density of available data was relatively scarce, and towards lower U10. The increase in prediction uncertainties towards lower U10 was more pronounced for whole-lake integrations (SIN = 1, Fig. 6a–c) relative to smaller space integrations of, for example, 1 m2 (SIN = 1 m2/LA, Fig. 6d–f).

Fig. 6
figure 6

Predicted air–water gas transfer velocities (k600) as a function of wind speed at 10 m height (U10), and lake area (LA) according to the linear model with the lowest Akaikes Information Criterion value. Shown are mean predictions (a, d) and the 95% confidence interval (CI), i.e. the 2.5% bound (b, e) and 97.5% bound (C, F) expressed relative to mean predictions for whole-lake space integration (SIN = 1) (ac) and SIN = 10–6 km2/LA (df). Grey dots show observations. The CIs take uncertainties from fixed effects into account

Compared with previous k600 models, none of our new models performed better. Our linear, exponential and power models with the lowest AIC had median RMSDs of 3.3, 3.6 and 3.4, R2adj of − 0.1, − 0.2 and − 0.1 and intercepts and slopes of observed vs. predicted k600 of − 0.3, − 0.7 and − 0.3, and 1.0, 1.2 and 0.9, respectively (c.f. Table 2). The proportion of positive R2adj was 0.41, 0.39 and 0.37, the proportion of intercepts significantly different from 0 was 0.28, 0.46 and 0.28, and the proportion of slopes significantly different from 1 was 0.41, 0.46 and 0.41, respectively. Overall, the integrated performance indices of our new models were not significantly different from most previous models, except for L18 (pairwise Wilcoxon rank sum test, Online Resources Fig. S9).

Discussion

How well can wind speed predict k600 over global lakes?

Previously published models showed that U10 alone can explain a high share of variance in measured k600 when applied within their calibration domain (R2 = 0.68–0.93; Table 1), suggesting that U10 can be a robust predictor under specific conditions. However, several studies have also failed to identify U10 as a predictor of k600 in specific lakes (Cole and Caraco 1998; Matthews et al. 2003; Xiao et al. 2014; Podgrajsek et al. 2015; Holgerson et al. 2017), and the number of unpublished unsuccessful attempts is unknown. Our global data synthesis allowed us to evaluate U10 as a predictor of k600 over a wide range of lakes. This dataset, i.e. 2222 simultaneous k600 and U10 measurements from 46 lakes and reservoirs, is by far more extensive than the database of previously published wind-based models (Table 1). We show that applying these wind-based models on new lakes and under new conditions results in poor and arbitrary k600 predictability, suggesting that U10 parameterizations derived from one or few systems cannot always be used to predict k600 beyond the training data sets. The R2adj of observed vs. predicted k600 was on average negative, and only in a minority of cases, U10-based predictions were more accurate than simply assuming a mean k600 independent of U10. (Table 2). These results signify that there is no wind-based model that predicts k600 well in all types of lakes.

Among all lakes, none of the tested published models clearly performed better than the others, in line with our first hypothesis (H1). However, one model (L18) performed worse than all other models likely because it was developed in a reservoir with significant lateral water flow as an additional source of turbulence (Li 2018). Some models seemed to perform slightly better than others under specific conditions such as very large LA or short TIN (Fig. 3 and Online Resources Fig. S3). However, these differences were small (typically < 2 performance index ranks) and not statistically significant. Overall, we did not clearly identify conditions under which certain models performed better than others, and therefore reject our second hypothesis (H2). This result is further supported by the very low explanatory power of the MRTs (R2 = 0.10–11) relative to the unconstrained cluster analysis (R2 = 0.32–0.36), implying that there was little structure in model performance and that this structure was mainly determined by unobserved factors, or by secondary factors that were observed but remained unrevealed due to a lack of statistical power.

The generally poor and nearly random performance of published k600 models suggests that k600 predictions are associated with large errors, especially when models are used to extrapolate to new conditions. This finding emphasizes our second research question whether a more general model fitted to a wider range of training data, and including additional easy to obtain predictor variables could improve k600 predictions or at least better account for prediction uncertainties.

Functional shape of wind-based k600 parametrizations

To properly assess the new k600 parameterizations and the effects of additional predictor variables, we first have to evaluate the shape of the U10-k600 relationship. The U10-k600 relationship can have many different shapes among lakes (Fig. 5), and this can be due to several factors. For example, non-linear U10-k600 relationships can be a result of k600 being enhanced at low U10 due to convection (MacIntyre et al. 2010; Polsenaere et al. 2013) or chemical enhancement of reactive gases (Wanninkhof et al. 1987). Our models were not flexible enough to accommodate such variability, which is reflected in high uncertainties at low U10 (Fig. 6). k600 can also be enhanced by bubbles formed by breaking waves, leading to an accelerating increase at U10 > 12 m s−1 (Broecker and Siems 1984). This process was negligible in our dataset because most observations were done below the U10 threshold for breaking waves. Some studies hypothesized the presence of microbubbles affecting k600 measurements from less soluble gases (e.g. CH4; Prairie and Giorgio 2013; McGinnis et al. 2015). While the exact drivers of this phenomenon are still unknown, it could indeed affect the lake-specific U10-k600 shape. We accounted for potential variability among lakes in the shape of the U10-k600 relationships by the exponent b in our power model. Interestingly, b varied widely among lakes but none of the tested lake- or method-related characteristics could significantly explain this high variability. This highlights the need for future studies to identify other factors that could explain high between-lake variability in U10-k600 shapes. Our best estimate for the exponent b in the power model was near 1 (Table 3). Alternative parameterizations that assume an exponential shape resulted in poor model fits with strong biases. Therefore, we conclude that the U10-k600 relationship of a typical lake in our database is linear and recommend the use of our linear model parameterizations for applications across a wide range of lakes. We provide code implemented in the program R to estimate k600 and associated uncertainties as a function of U10, LA, and SIN based on our linear model with the lowest AIC value (Online Resources Code S1, Data S1, and Data S2). We emphasize that this model is one of several other linear models that fitted the data equally well.

Modulators of wind-speed effect on k600

Our new multilevel modelling procedure showed that LA, SDI and/or SIN can significantly explain between-lake differences in the U10-k600—relationships, which supports our third hypothesis (H3). The U10 effect on k600 increased with LA likely as a result of progressively larger fetch length, wave build-up and hence turbulence (Schilder et al. 2013; Vachon and Prairie 2013). The modulating effect of LA in our dataset was less strong than the effect found by Vachon and Prairie (2013) and Guerin et al. (2007) (Online Resources Fig. S11, but note absence of effects in Wanninkhof et al. (1987)). This variability in the effect of LA on U10 can arise from differences in shoreline sheltering of the lakes considered. Relatively many small lakes included in our dataset (Sebacher et al. 1983; Leuning et al. 1984; Boyd and Teichert-Coddington 1992; Denmead and Freney 1992) had only little shoreline sheltering which allowed k600 to respond relatively strongly to U10. As a result, the LA effect on U10-k600 relationships was rather weak (Online Resources Fig. S11). In contrast, the very large LA effects reported by Guérin et al. (2007) may be due to the long effective fetch length in the elongated estuaries included in this study, allowing larger wave heights for a given U10. The larger LA effects reported by Vachon and Prairie (2013) and Guérin et al. (2007) could also be a result of estimating k600 locally in the lake center, for which fetch length may matter more than for whole-lake integrated data. Hence, the effect of LA on U10-k600 relationships is far from universal and may need further investigations. Our dataset covered both sheltered and unsheltered systems, and also a wide range of SDI. We argue that all these effects were partly captured by the addition of LA as a modulator of the U10 effect, making our model less specific and more generic than previous models.

Our modelling procedure quantified, for the first time, the effect of SIN on k600 estimates. This scale is strongly related to the method of estimating k600. Predicted k600 was around 2.4 cm h−1 higher for small-scale integrations (10–6 km2/LA) in the lake center (typical for floating chambers) relative to integrations over the whole lake (typical for mass balances). With the mass balance approach, k600 are integrated over the whole lake surface while floating chambers and the eddy covariance technique usually integrate smaller areas near the lake center (Fig. 1c). Here, U10 and hence k600 is usually higher than near-shore (Venäläinen et al. 2003; Schilder et al. 2013; Vachon and Prairie 2013; Docquier et al. 2016). Therefore smaller SIN resulted in higher k600 in our model. It is important to note, however, that our models predict local k600 only for areas around the lake center and that higher or lower k600 should be expected in near-shore areas, depending on the wind direction and lake shape (Vachon and Prairie 2013). The SIN effect should raise the awareness among k600 model users about which proportion of the lake the anticipated k600 values should integrate (Fig. 1c). This finding has implications for calculations of local vs. whole-lake gas fluxes where the scale of concentration and k600 estimates should match. For example, flux calculations should be based on SIN << 1 if gas concentrations are measured in the center of the lake, and on SIN = 1, if gas concentrations are measured at (multiple) points that are representative for the whole lake surface.

Some of the selected best candidate models included SDI, suggesting the U10 effect on k600 to increase in lakes with more complex shoreline geometry. This finding is rather counter-intuitive as one would think that k600 should decrease with SDI, given the increased shoreline sheltering simulating the effect of a small lake (Schilder et al. 2013; Vachon and Prairie 2013; Gålfalk et al. 2013). The positive effect of SDI likely resulted from SDI being highly correlated with LA (Online Resources Fig. S2A) and LA having a positive effect on k600. We, therefore, conclude that sheltering due to complex shorelines is not a dominant modulator of the U10-k600 relationship and do not recommend using our models that include SDI.

Performance and applicability of new parametrization on global lakes

Accounting for the effects of LA or SIN on lake-level U10-k600 relationships improved model fits relative to single-level models based on U10 alone. However, despite this improvement, even our best k600 models did not clearly outperform the previously published k600 models most of which only included U10 (Online Resources Fig. S9), which provides support to falsify our fourth hypothesis (H4). However, even if including additional variables in addition to U10 did not improve the statistical model fit, it may contribute to a more accurate geographic explanation of variations in k600 among a wide range of lakes.

Our best linear model is designed to fit average conditions across a wide range of global lakes and to account for their variability by estimating the error. We regard our model to be the globally least biased k600 model available, as it averages out method and system-specific errors across a wide range of conditions and explicitly predicts these errors. This approach fills an important gap, relative to previous k600 models which were developed under limited conditions (Table 1). Our new model also greatly expands previous limits of the calibration data sets in terms of U10 and LA and should include most conditions encountered in the world (U10 = 0–13 m s−1, LA = 183 m2 to 1342 km2, SDI = 1–22.5; (c.f. Verpoorter et al. 2014)). Hence, our model yields predictions to represent, to the best possible, an average global lake, and error predictions to account for potential spatiotemporal variability. With these characteristics, our new model is suitable for large-scale applications such as upscaling gas fluxes to regional and global scales.

Within our prediction domain, prediction errors were large (up to 10 cm h−1 or > 200% of mean predictions) and must be accounted for in large-scale or global gas flux estimates. Interestingly, prediction errors varied non-linearly with U10 and LA. While k600 was rather well constrained under conditions that have previously been relatively commonly sampled (intermediate U10 and LA), large uncertainties still exist in relative terms in small lakes (LA < 1 km2) and for low wind speeds (U10 < 2 m s−1), conditions that are common globally (Verpoorter et al. 2014, https://globalwindatlas.info/). Under these conditions, k600 can be dominantly driven by many other factors in addition to U10 (Crusius and Wanninkhof 2003; Read et al. 2012; Holgerson et al. 2017). Future data collections should focus on low wind speeds and small lakes, to identify underlying drivers of high variability in k600 and by that reduce prediction uncertainties.

Limitations and way forward for better wind-based k600 models

Several factors that are known to potentially influence k600 were not accounted for in our new k600 models. Such factors include surface films, turbulence due to convective cooling, bubble-mediated gas transfer, gas-specific behavior (e.g. chemical enhancement), boundary layer stability, stratification, variability in wind stress and the wave field for given U10 levels due to shoreline sheltering, or method-specific issues beyond spatial and temporal integrations (Wanninkhof et al. 1987; MacIntyre et al. 1995; Jähne and Haussecker 1998). To capture these drivers is often difficult and labor-intensive, hence, relevant data are only available for a limited number of systems. Accounting for environmental factors beyond wind may improve k600 predictions in specific lakes (MacIntyre et al. 2010; Polsenaere et al. 2013; Heiskanen et al. 2014). Here, a way forward is to develop new models based on mechanistic first principles, relating environmental factors to turbulent kinetic energy dissipation as the primary driver of k600 (Zappa et al. 2007; Vachon et al. 2010). However, it remains to be assessed whether these factors would matter and to what extent mechanistic models that account for these factors could improve k600 predictions at larger spatial scales. If turning out important, simple proxies of otherwise difficult to measure processes would need to be found to develop k600 models that are applicable over many lakes.

Our analysis indicates that our ability to predict k600 based on empirical wind-based models could approach an upper limit that is not necessarily determined by a lack of understanding of the controls of k600, but by a lack of methodological consistency. First, models with additional predictor variables or relatively many lakes included (e.g. CC98, VP13), did not perform significantly better than single lake/single variable models (e.g. G07, M10). Second, there was also only little structure in the model’s performance relative to the experimental or environmental conditions under which the data were collected (Fig. 3 and Online Resources Fig. S3). These observations could be either explained by a true lack of structure or that this structure is masked by high noise in the collected k600 data due to measurement errors or inconsistencies in methodologies.

One important methodological inconsistency with consequences for the predictability of k600 is the way U10 is measured. In our data set, U10 was mainly measured on or within 3 km of the lake surface (76% or 89% of observations from all lakes and 42% or 85% of observations from lakes with at least 6 observations, Online Resources Table S1), but some were measured even farther inland. Measurements were also scaled to 10 m height from different measurement heights. Wind speed point measurements are not always representative of the whole lake mean U10 (Fig. 1c; Venäläinen et al. 2003; Docquier et al. 2016). Wind height scaling is also associated with uncertainties (Large and Pond 1981). To reduce these uncertainties, the spatial resolution of wind speed measurements should be increased to match the scale and extent of k600 estimates [e.g. several anemometers over the lake, (Wanninkhof et al. 1987)].

Many other methodological and environmental issues that could lead to noise in U10-k600 relationships have been discussed in the literature. In essence, every approach addresses air–water gas transfer from a different angle, with characteristic scales and with method-specific advantages and disadvantages (MacIntyre et al. 1995; Jähne and Haussecker 1998). Our new k600 models account for most of these uncertainties as they integrate the variability in previous studies carried out under widely different conditions. However, despite the wide variety of studies included, more efforts are needed to measure k600 in lakes types that go beyond our dataset and to assess how representative our data collection is for global conditions of lake-atmosphere gas transfer.

Implications and conclusions

With the currently available set of k600 models with their limited calibration domain, researchers have been in need to extrapolate k600 to their system of interest without being able to properly quantify and account for resulting uncertainties. This practice would strongly limit their ability to gain insights into ecological and biogeochemical processes in specific systems (Dugan et al. 2016; Kiuru et al. 2019) and to upscale the lakes’ air–water gas fluxes to the globe (Raymond et al. 2013). Based on an evaluation of existing wind-based k600 models against an extensive set of published U10 and k600 estimates, we conclude that extrapolation can lead to significant biases in k600 predictions.

Building on the growing awareness that “the gas transfer velocity is not simply a function of the wind speed” (Jähne and Haußecker 1998), we found here that U10 is generally a poor universal predictor of k600 over lakes or reservoirs, no matter which of the existing model parameterizations are applied. Prediction errors remain high even in new parameterizations calibrated against the global data set. Therefore, we conclude (in agreement with Cole et al. 2010) that wind-based models are currently very limited in their use to scale k600 across lakes and advise better measurements rather than models of k600 when accurate estimations for specific lakes are needed. Similar challenges of scaling k600 across systems remain in streams and rivers (Hall and Ulseth 2019).

For the future development of lake models (c.f. Tan and Zhuang 2015; Stepanenko et al. 2016) or larger scale applications (c.f. Zwart et al. 2018), we emphasize accounting for large uncertainties when modelling k600. To do so, we propose a new k600 model that provides estimates of means and uncertainties in k600 as a function of U10, LA, and SIN. Large uncertainties in k600 models may be overcome by developing mechanistic models from first principles. Until this is achieved, the best approach remains to calibrate empirical constants against extensive data sets. Progress on these lines will improve the development of coupled atmosphere–land surface–lake models and incorporation of lakes in earth system models (MacKay et al. 2009).

The results from this study emphasize careful thought about the strength and limitations, in particular the calibration domain, and spatial scale of integration of the anticipated means to estimate k600, no matter if k600 is estimated empirically or modelled. By an informed choice of the most suitable k600 model, researchers should be able to limit uncertainties in k600 predictions within acceptable boundaries, or as George Box phrased it: “Essentially, all models are wrong, but some are useful”.