Sampling bias mitigation for species occurrence modeling using machine learning methods
Introduction
Species distribution models (SDMs) have provided valuable knowledge about critical environmental processes explaining patterns of species occupancy such as changes in temperature or habitat loss. Therefore, SDMs are useful tools for biodiversity conservation and planning (Elith and Graham, 2009) and for understanding and controlling the advance of invasive species, especially in rapidly changing environments (Ward, 2007). Despite the rapid progress in the development of techniques for SDMs, the ability to adequately characterize the geographic range attributable to a given species is heavily constrained by biases associated with the collection of input data on species presence/absence.
Two common biases in input data for species distribution modeling have to do with 1) the spatial clustering of sampling points, commonly towards easily accessible areas (Reddy and Dávalos, 2003) and 2) the uncertainty associated with the realization of a species in sampled locations where occurrence can be confounded with detectability (Royle et al., 2005). We refer to these sources of bias as spatial and detection biases. Spatial sampling bias is often described in the literature as “sampling bias” (Kramer-Schadt et al., 2013) while detection sampling bias is considered as a source of imperfect detection (Lahoz-Monfort et al., 2014). We instead use the terms spatial and detection bias respectively because they refer to different aspects of sampling bias. While the first one denotes “where”, the second relates to “when” and “how often” a sample is collected.
The goals of this paper are to propose methods for 1) detecting and mitigating spatial and detection biases in the use of machine learning methods for modeling species occurrence and 2) assessing the magnitude of bias and the effectiveness of bias mitigation on modeling prediction, variable importance, and model performance. The research questions are:
- 1.
How can spatial and detection bias be quantified and then mitigated using machine learning methods for SDMs?
- 2.
How to measure the effectiveness of spatial and detection bias mitigation approaches?
- 3.
What is the effect of bias mitigation on modeling prediction and performance?
Spatial bias refers to the spatial aggregation or clustering of sampled locations in a study area of interest. Species distribution maps should be ideally built with samples collected following a suitable sampling design (e.g. random, systematic). However, in most cases, samples are collected opportunistically, where criteria related to accessibility or costs are prioritized over representativeness of the variations in environmental conditions in the area of analysis (Phillips et al., 2009). Spatial bias often translates into oversampling of easily accessible locations such as roads or populated areas (Kadmon et al., 2004; Reddy and Dávalos, 2003).
Spatial bias introduces an unequal representation of the spatial variability of the covariates used for species occurrence prediction. This bias affects modeling outcomes through 1) a mis-representation of the sampling accuracy, 2) distorted estimates of variable importance or significance and 3) a limited generality and transferability of the model, especially to extrapolate predictions to under-observed locations (Anderson and Gonzalez Jr., 2011; Kramer-Schadt et al., 2013). Despite the shortcomings of using biased location data for species distribution models, methods to control for location bias are rarely applied (Yackulic et al., 2013).
Several methods have been proposed to reduce the effect of sampling bias on model over-fitting. They include 1) the use of background data portraying a sampling bias similar to presence data, 2) the application of spatial filters or 3) spatial thinning of sampled locations (Aiello-Lammens et al., 2015; Fourcade et al., 2014; Phillips et al., 2009). The performance of geographic bias mitigation has been typically evaluated based on the outcomes of goodness-of-fit or model accuracy (Beck et al., 2014; Boria et al., 2014; Lintz et al., 2013). However, methods to assess the effectiveness of geographic filtering to reduce bias in the representation of the environmental conditions in the study area are lacking.
Here we propose a method to measure the effectiveness of geographic filtering at representing variation of environmental conditions in the study. We do this by comparing the frequency distribution produced with the values of covariates used for modeling in sampling locations, with the frequency distribution that would be obtained under an ideal sampling design. This approach is in agreement with Phillips et al. (2009), who state that spatial bias is a problem not much because of the level of aggregation or clustering of the data but because of the inadequate representation of the variability of environmental conditions in the study area.
Detection bias relates to what has been called in the literature as “imperfect detection” or the probability of inclusion of false absences in sampled locations (Royle et al., 2005). When false absences are not accounted for, the resulting model predicts not only species presence (desirable) but also species detectability (undesirable) (Lahoz-Monfort et al., 2014). Not accounting for false absences can derive into occurrence estimation biases and inflation of model performance. The extent to which false absences can affect model performance and evaluation depends largely on the magnitude of the errors and their correlation with environmental characteristics (Guillera-Arroita, 2017). Despite of the detrimental influence of imperfect detection on SDMs, this bias has been largely ignored in the literature (Kellner and Swihart, 2014).
Recent developments of statistical tools and software have facilitated the consideration of imperfect detection in species distribution modeling (Fiske and Chandler, 2011). Standard methods consist of hierarchical parametric statistical models that account for presence and detectability as two separate processes (Lahoz Monfort et al., 2014). Fitting occupancy models through the use of hierarchical frameworks is not always straightforward (Welsh et al., 2013, Banks-Leite et al., 2014, Guillera Arroita et al., 2014, Merrow et al., 2014, Field et al., 2016). These models involve some challenges common to all parametric models that include dealing with variable interactions and non-linearities, the need for data transformations or standardization to match linearity, gaussianity or other model assumptions along with model convergence problems (Hutchinson et al., 2011).
The use of machine learning methods in SDMs has grown substantially in recent years. Their advantages include minimum requirements for data pre-processing, their robustness to efficiently process large data sets, their ability to predict without the need of data transformations or rescaling, and their capacity to handle missing observations, non-linearities and variable interactions (Hutchinson et al., 2011). Despite these virtues, attempts to incorporate imperfect detection in the use of machine learning algorithms for SDMs have been limited. Instead, most applications of machine learning have ignored absence data and rather focused on the use of presence data that allows only to predict species habitat suitability rather than occurrence. For this purpose, pseudo absence data is generating by drawing random points in the study area (Phillips et al., 2009). This approach is counter to the recommendation by several authors encouraging the use of absence data in SDM whenever is available even if detection probability is less than 1 (Brotons et al., 2004; Guillera-Arroita, 2017; Ward et al., 2009; Yackulic et al., 2013). This recommendation implies that all observations contain meaningful information that can contribute to model parameterization, even if they are not perfect.
In this work we evaluate a method to mitigate the effect of imperfect detection in the prediction of species occupancy using machine learning methods. The approach consists of assigning reliability weights to absence observations based on the influence of sampling frequency (number of repeated measurements of a given site in a year) and sampling timing (the median date in which a sample was collected) on the probability of species detection. The application of weights to observations has been proposed to reduce prediction biases due to differences in the accuracy, reliability, or confidence of training samples (Hashemi and Karimi, 2018). For instance, Gholami et al. (2018) assigned weights to observations to account for imperfect detection in the application of machine learning methods to predict poaching attacks in Uganda. Mengersen et al. (2017) assigned sureness weights to presence observations from citizen science data to account for imperfect detection in the application of Support Vector Machines for modeling the distribution of jaguars in the Peruvian Amazon. Weights have been used in machine learning methods also to account for the likelihood of pseudo-absence observations to be a realized absence (Zaniewski et al., 2002), to give more prominence to observations in under-sampled areas (Elith et al., 2010), or to reduce imbalances due to differences in the number of samples representing presences and pseudo-absences (Barbet-Massin et al., 2012).
Section snippets
Overview of methods
We used in situ presence/absence observations of the Aedes albopictus mosquito collected annually between 2001 and 2015 by the Division of Vector Management within the Pennsylvania Department of Environmental Protection (DEP) for model calibration and validation. We derived spatial covariates from publicly available data sources on precipitation, temperature, vegetation status, land cover, and socioeconomic factors to predict the spatial distribution of mosquitoes in South East Pennsylvania
Spatial bias detection and mitigation
Spatial bias is expressed as the difference between the empirical cumulative frequency distribution function produced by the values of the covariates in sampled locations and the frequency distribution that would be obtained under a reference random sampling design. Results show that the magnitude of spatial bias varies among covariates (Fig. 3). Spatial filtering generally produced a cumulative distribution that resembles more closely the distribution derived from a reference random sampling
Discussion
In this work, we propose methods to measure spatial and detection sampling bias, to mitigate detection bias and to quantify the effectiveness of spatial and detection bias mitigation for species distribution modeling using machine learning methods. Spatial bias was measured by comparing the frequency distribution obtained for the values of covariates in sampled sites with the distribution that would be obtained under a reference sampling design. The rationale for this approach is that sampling
Data availability
Data available from the FigShare Repository: https://figshare.com/articles/dataset4pub_RData/10058426
Acknowledgments
This work was supported through funding from the Office of the Vice President for Research and the College of Liberal Arts at Temple University under the Targeted Funding Program. We would like to thank Michael Hutchinson and the Vector Management Program at the Pennsylvania Department of Environmental Protection for making the mosquito data available for this research. We thank Benjamin Velez-Hayes for the thoughtful conversations that helped to improve clarity in the manuscript.
References (45)
- et al.
Species-specific tuning increases robustness to sampling bias in models of species distributions: an implementation with Maxent
Ecol. Model.
(2011) - et al.
Spatial bias in the GBIF database and its effect on modeling species' geographic distributions
Ecol. Inform.
(2014) - et al.
Characterising performance of environmental models
Environ. Model. Softw.
(2013) - et al.
Spatial filtering to reduce sampling bias can improve the performance of ecological niche models
Ecol. Model.
(2014) Explaining the unsuitability of the kappa coefficient in the assessment and comparison of the accuracy of thematic maps obtained by image classification
Remote Sens. Environ.
(2020)- et al.
Effect of inventory method on niche models: random versus systematic error
Ecol. Informatics
(2013) - et al.
Predicting species spatial distributions using presence-only data: a case study of native New Zealand ferns
Ecol. Model.
(2002) - et al.
spThin: an R package for spatial thinning of species occurrence records for use in ecological niche models
Ecography
(2015) - et al.
Assessing the utility of statistical adjustments for imperfect detection in tropical conservation science
J. Appl. Ecol.
(2014) - et al.
Selecting pseudo-absences for species distribution models: how, where and how many?
Methods Ecol. Evol.
(2012)
Presence-absence versus presence-only modelling methods for predicting bird habitat suitability
Ecography
R: A Language and Environment for Statistical Computing
Do they? How do they? Why do they differ? On finding reasons for differing performances of species distribution models
Ecography
The art of modelling range-shifting species
Methods Ecol. Evol.
How does choice of statistical method to adjust counts for imperfect detection affect inferences about animal abundance?
Methods Ecol. Evol.
Unmarked: an R package for fitting hierarchical models of wildlife occurrence and abundance
J. Stat. Softw.
Mapping species distributions with MAXENT using a geographically biased sample of presence data: a performance assessment of methods for correcting sampling bias
PLoS One
Greedy function approximation: a gradient boosting machine
Ann. Stat.
Additive logistic regression: a statistical view of boosting
Ann. Stat.
Adversary models account for imperfect crime data: forecasting and planning against real-world poachers
Package ‘gbm’
Modelling of species distributions, range dynamics and communities under imperfect detection: advances, challenges and opportunities
Ecography
Cited by (6)
Population trends from count data: Handling environmental bias, overdispersion and excess of zeroes
2022, Ecological InformaticsCitation Excerpt :Several types of biases have been described in ecological data. Geographical (Anderson and Gonzalez, 2011) or spatial biases (Gutierrez-Velez and Wiese, 2020; Kramer-Schadt et al., 2013; Phillips et al., 2009) are due to differences in sampling effort in the study area, resulting in a spatial clustering of sampling sites. They can lead to environmental biases (i.e. inadequate representation of the variability of environmental covariates in the study area; Phillips et al., 2009; Kramer-Schadt et al., 2013).
SAMPLING BIAS WORSEN THE PREDICTIVE ABILITY OF NICHE MODELS
2024, Revista de Gestao Social e AmbientalProbabilistic approximation to change and no change in multispectral remote sensing
2021, International Journal of Remote SensingNorthward expansion of fire-adaptative vegetation in future warming
2022, Environmental Research Letters