Sampling bias mitigation for species occurrence modeling using machine learning methods

https://doi.org/10.1016/j.ecoinf.2020.101091Get rights and content

Highlights

  • We propose methods to measure and mitigate spatial and detection bias in SDM.

  • Spatial bias mitigation improves representativeness of spatial data variability.

  • Applying sampling reliability weights to observations reduces detection bias.

  • Spatial bias had a larger effect on prediction and accuracy than detection bias.

  • Bias mitigation should be a standard practice in species occurrence modeling.

Abstract

The identification and mitigation of sampling biases is commonly overseen in species distribution modeling, even though bias can seriously compromise the validity of modeling outcomes. Here we propose methods to 1) detect and mitigate spatial and detection sampling biases in the use of machine learning methods for modeling species occurrence and 2) assess the magnitude of bias and the effectiveness of bias mitigation on modeling prediction, variable importance, and model performance. We illustrate these techniques through the calibration of boosted decision trees for the prediction of annual occurrences of Aedes albopictus, an invasive disease vector, in South-East Pennsylvania between 2001 and 2015. Methods consist of the application of spatial filters and the assignment of sampling reliability weights to observed locations. We tested the performance of spatial bias mitigation by comparing the frequency distribution obtained for predictors before and after filtering with the distribution that would be obtained under an ideal sampling design. We also tested the performance of detection bias mitigation by comparing the importance of variables representing detection bias before and after the assignment of reliability weights. Results show that spatial filtering reduced differences between the frequency distribution obtained with the unfiltered data and the distribution that would be obtained under a reference sampling design. The assignment of sampling reliability weights to observations reduced the relative influence of detection bias on fitted models. The mitigation of spatial bias had a larger effect on modeling prediction and accuracy estimates compared to detection bias mitigation. Spatial sampling bias mitigation largely tended to reduce the number of years of predicted A. albopictus occurrence while detection bias mitigation tended to increase it. Our results highlight the importance of identifying, quantifying and mitigating observation biases as a standard practice in the use of machine learning methods for species occurrence modeling because biases can compromise the reliability of modeling outcomes and interpretation.

Introduction

Species distribution models (SDMs) have provided valuable knowledge about critical environmental processes explaining patterns of species occupancy such as changes in temperature or habitat loss. Therefore, SDMs are useful tools for biodiversity conservation and planning (Elith and Graham, 2009) and for understanding and controlling the advance of invasive species, especially in rapidly changing environments (Ward, 2007). Despite the rapid progress in the development of techniques for SDMs, the ability to adequately characterize the geographic range attributable to a given species is heavily constrained by biases associated with the collection of input data on species presence/absence.

Two common biases in input data for species distribution modeling have to do with 1) the spatial clustering of sampling points, commonly towards easily accessible areas (Reddy and Dávalos, 2003) and 2) the uncertainty associated with the realization of a species in sampled locations where occurrence can be confounded with detectability (Royle et al., 2005). We refer to these sources of bias as spatial and detection biases. Spatial sampling bias is often described in the literature as “sampling bias” (Kramer-Schadt et al., 2013) while detection sampling bias is considered as a source of imperfect detection (Lahoz-Monfort et al., 2014). We instead use the terms spatial and detection bias respectively because they refer to different aspects of sampling bias. While the first one denotes “where”, the second relates to “when” and “how often” a sample is collected.

The goals of this paper are to propose methods for 1) detecting and mitigating spatial and detection biases in the use of machine learning methods for modeling species occurrence and 2) assessing the magnitude of bias and the effectiveness of bias mitigation on modeling prediction, variable importance, and model performance. The research questions are:

  • 1.

    How can spatial and detection bias be quantified and then mitigated using machine learning methods for SDMs?

  • 2.

    How to measure the effectiveness of spatial and detection bias mitigation approaches?

  • 3.

    What is the effect of bias mitigation on modeling prediction and performance?

Spatial bias refers to the spatial aggregation or clustering of sampled locations in a study area of interest. Species distribution maps should be ideally built with samples collected following a suitable sampling design (e.g. random, systematic). However, in most cases, samples are collected opportunistically, where criteria related to accessibility or costs are prioritized over representativeness of the variations in environmental conditions in the area of analysis (Phillips et al., 2009). Spatial bias often translates into oversampling of easily accessible locations such as roads or populated areas (Kadmon et al., 2004; Reddy and Dávalos, 2003).

Spatial bias introduces an unequal representation of the spatial variability of the covariates used for species occurrence prediction. This bias affects modeling outcomes through 1) a mis-representation of the sampling accuracy, 2) distorted estimates of variable importance or significance and 3) a limited generality and transferability of the model, especially to extrapolate predictions to under-observed locations (Anderson and Gonzalez Jr., 2011; Kramer-Schadt et al., 2013). Despite the shortcomings of using biased location data for species distribution models, methods to control for location bias are rarely applied (Yackulic et al., 2013).

Several methods have been proposed to reduce the effect of sampling bias on model over-fitting. They include 1) the use of background data portraying a sampling bias similar to presence data, 2) the application of spatial filters or 3) spatial thinning of sampled locations (Aiello-Lammens et al., 2015; Fourcade et al., 2014; Phillips et al., 2009). The performance of geographic bias mitigation has been typically evaluated based on the outcomes of goodness-of-fit or model accuracy (Beck et al., 2014; Boria et al., 2014; Lintz et al., 2013). However, methods to assess the effectiveness of geographic filtering to reduce bias in the representation of the environmental conditions in the study area are lacking.

Here we propose a method to measure the effectiveness of geographic filtering at representing variation of environmental conditions in the study. We do this by comparing the frequency distribution produced with the values of covariates used for modeling in sampling locations, with the frequency distribution that would be obtained under an ideal sampling design. This approach is in agreement with Phillips et al. (2009), who state that spatial bias is a problem not much because of the level of aggregation or clustering of the data but because of the inadequate representation of the variability of environmental conditions in the study area.

Detection bias relates to what has been called in the literature as “imperfect detection” or the probability of inclusion of false absences in sampled locations (Royle et al., 2005). When false absences are not accounted for, the resulting model predicts not only species presence (desirable) but also species detectability (undesirable) (Lahoz-Monfort et al., 2014). Not accounting for false absences can derive into occurrence estimation biases and inflation of model performance. The extent to which false absences can affect model performance and evaluation depends largely on the magnitude of the errors and their correlation with environmental characteristics (Guillera-Arroita, 2017). Despite of the detrimental influence of imperfect detection on SDMs, this bias has been largely ignored in the literature (Kellner and Swihart, 2014).

Recent developments of statistical tools and software have facilitated the consideration of imperfect detection in species distribution modeling (Fiske and Chandler, 2011). Standard methods consist of hierarchical parametric statistical models that account for presence and detectability as two separate processes (Lahoz Monfort et al., 2014). Fitting occupancy models through the use of hierarchical frameworks is not always straightforward (Welsh et al., 2013, Banks-Leite et al., 2014, Guillera Arroita et al., 2014, Merrow et al., 2014, Field et al., 2016). These models involve some challenges common to all parametric models that include dealing with variable interactions and non-linearities, the need for data transformations or standardization to match linearity, gaussianity or other model assumptions along with model convergence problems (Hutchinson et al., 2011).

The use of machine learning methods in SDMs has grown substantially in recent years. Their advantages include minimum requirements for data pre-processing, their robustness to efficiently process large data sets, their ability to predict without the need of data transformations or rescaling, and their capacity to handle missing observations, non-linearities and variable interactions (Hutchinson et al., 2011). Despite these virtues, attempts to incorporate imperfect detection in the use of machine learning algorithms for SDMs have been limited. Instead, most applications of machine learning have ignored absence data and rather focused on the use of presence data that allows only to predict species habitat suitability rather than occurrence. For this purpose, pseudo absence data is generating by drawing random points in the study area (Phillips et al., 2009). This approach is counter to the recommendation by several authors encouraging the use of absence data in SDM whenever is available even if detection probability is less than 1 (Brotons et al., 2004; Guillera-Arroita, 2017; Ward et al., 2009; Yackulic et al., 2013). This recommendation implies that all observations contain meaningful information that can contribute to model parameterization, even if they are not perfect.

In this work we evaluate a method to mitigate the effect of imperfect detection in the prediction of species occupancy using machine learning methods. The approach consists of assigning reliability weights to absence observations based on the influence of sampling frequency (number of repeated measurements of a given site in a year) and sampling timing (the median date in which a sample was collected) on the probability of species detection. The application of weights to observations has been proposed to reduce prediction biases due to differences in the accuracy, reliability, or confidence of training samples (Hashemi and Karimi, 2018). For instance, Gholami et al. (2018) assigned weights to observations to account for imperfect detection in the application of machine learning methods to predict poaching attacks in Uganda. Mengersen et al. (2017) assigned sureness weights to presence observations from citizen science data to account for imperfect detection in the application of Support Vector Machines for modeling the distribution of jaguars in the Peruvian Amazon. Weights have been used in machine learning methods also to account for the likelihood of pseudo-absence observations to be a realized absence (Zaniewski et al., 2002), to give more prominence to observations in under-sampled areas (Elith et al., 2010), or to reduce imbalances due to differences in the number of samples representing presences and pseudo-absences (Barbet-Massin et al., 2012).

Section snippets

Overview of methods

We used in situ presence/absence observations of the Aedes albopictus mosquito collected annually between 2001 and 2015 by the Division of Vector Management within the Pennsylvania Department of Environmental Protection (DEP) for model calibration and validation. We derived spatial covariates from publicly available data sources on precipitation, temperature, vegetation status, land cover, and socioeconomic factors to predict the spatial distribution of mosquitoes in South East Pennsylvania

Spatial bias detection and mitigation

Spatial bias is expressed as the difference between the empirical cumulative frequency distribution function produced by the values of the covariates in sampled locations and the frequency distribution that would be obtained under a reference random sampling design. Results show that the magnitude of spatial bias varies among covariates (Fig. 3). Spatial filtering generally produced a cumulative distribution that resembles more closely the distribution derived from a reference random sampling

Discussion

In this work, we propose methods to measure spatial and detection sampling bias, to mitigate detection bias and to quantify the effectiveness of spatial and detection bias mitigation for species distribution modeling using machine learning methods. Spatial bias was measured by comparing the frequency distribution obtained for the values of covariates in sampled sites with the distribution that would be obtained under a reference sampling design. The rationale for this approach is that sampling

Data availability

Data available from the FigShare Repository: https://figshare.com/articles/dataset4pub_RData/10058426

Acknowledgments

This work was supported through funding from the Office of the Vice President for Research and the College of Liberal Arts at Temple University under the Targeted Funding Program. We would like to thank Michael Hutchinson and the Vector Management Program at the Pennsylvania Department of Environmental Protection for making the mosquito data available for this research. We thank Benjamin Velez-Hayes for the thoughtful conversations that helped to improve clarity in the manuscript.

References (45)

  • L. Brotons et al.

    Presence-absence versus presence-only modelling methods for predicting bird habitat suitability

    Ecography

    (2004)
  • R Core Team

    R: A Language and Environment for Statistical Computing

    (2019)
  • J. Elith et al.

    Do they? How do they? Why do they differ? On finding reasons for differing performances of species distribution models

    Ecography

    (2009)
  • J. Elith et al.

    The art of modelling range-shifting species

    Methods Ecol. Evol.

    (2010)
  • C.R. Field et al.

    How does choice of statistical method to adjust counts for imperfect detection affect inferences about animal abundance?

    Methods Ecol. Evol.

    (2016)
  • I. Fiske et al.

    Unmarked: an R package for fitting hierarchical models of wildlife occurrence and abundance

    J. Stat. Softw.

    (2011)
  • Y. Fourcade et al.

    Mapping species distributions with MAXENT using a geographically biased sample of presence data: a performance assessment of methods for correcting sampling bias

    PLoS One

    (2014)
  • J.H. Friedman

    Greedy function approximation: a gradient boosting machine

    Ann. Stat.

    (2001)
  • J.H. Friedman et al.

    Additive logistic regression: a statistical view of boosting

    Ann. Stat.

    (2000)
  • S. Gholami et al.

    Adversary models account for imperfect crime data: forecasting and planning against real-world poachers

  • B. Greenwell et al.

    Package ‘gbm’

  • G. Guillera-Arroita

    Modelling of species distributions, range dynamics and communities under imperfect detection: advances, challenges and opportunities

    Ecography

    (2017)
  • Cited by (6)

    • Population trends from count data: Handling environmental bias, overdispersion and excess of zeroes

      2022, Ecological Informatics
      Citation Excerpt :

      Several types of biases have been described in ecological data. Geographical (Anderson and Gonzalez, 2011) or spatial biases (Gutierrez-Velez and Wiese, 2020; Kramer-Schadt et al., 2013; Phillips et al., 2009) are due to differences in sampling effort in the study area, resulting in a spatial clustering of sampling sites. They can lead to environmental biases (i.e. inadequate representation of the variability of environmental covariates in the study area; Phillips et al., 2009; Kramer-Schadt et al., 2013).

    • SAMPLING BIAS WORSEN THE PREDICTIVE ABILITY OF NICHE MODELS

      2024, Revista de Gestao Social e Ambiental
    View full text