当前位置: X-MOL 学术Methods Ecol. Evol. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Spatial thinning and class balancing: Key choices lead to variation in the performance of species distribution models with citizen science data
Methods in Ecology and Evolution ( IF 6.6 ) Pub Date : 2020-11-15 , DOI: 10.1111/2041-210x.13525
Valerie A. Steen 1, 2 , Morgan W. Tingley 1, 3 , Peter W. C. Paton 2 , Chris S. Elphick 1
Affiliation  

  1. Spatial biases are a common feature of presence–absence data from citizen scientists. Spatial thinning can mitigate errors in species distribution models (SDMs) that use these data. When detections or non‐detections are rare, however, SDMs may suffer from class imbalance or low sample size of the minority (i.e. rarer) class. Poor predictions can result, the severity of which may vary by modelling technique.
  2. To explore the consequences of spatial bias and class imbalance in presence–absence data, we used eBird citizen science data for 102 bird species from the northeastern USA to compare spatial thinning, class balancing and majority‐only thinning (i.e. retaining all samples of the minority class). We created SDMs using two parametric or semi‐parametric techniques (generalized linear models and generalized additive models) and two machine learning techniques (random forest and boosted regression trees). We tested the predictive abilities of these SDMs using an independent and systematically collected reference dataset with a combination of discrimination (area under the receiver operator characteristic curve; true skill statistic; area under the precision‐recall curve) and calibration (Brier score; Cohen's kappa) metrics.
  3. We found large variation in SDM performance depending on thinning and balancing decisions. Across all species, there was no single best approach, with the optimal choice of thinning and/or balancing depending on modelling technique, performance metric and the baseline sample prevalence of species in the data. Spatially thinning all the data was often a poor approach, especially for species with baseline sample prevalence <0.1. For most of these rare species, balancing classes improved model discrimination between presence and absence classes using machine learning techniques, but typically hindered model calibration.
  4. Baseline sample prevalence, sample size, modelling approach and the intended application of SDM output—whether discrimination or calibration—should guide decisions about how to thin or balance data, given the considerable influence of these methodological choices on SDM performance. For prognostic applications requiring good model calibration (vis‐à‐vis discrimination), the match between sample prevalence and true species prevalence may be the overriding feature and warrants further investigation.


中文翻译:

空间稀疏和类平衡:关键选择导致具有公民科学数据的物种分布模型的性能发生变化

  1. 空间偏差是存在的普遍特征-来自公民科学家的缺乏数据。空间细化可以减轻使用这些数据的物种分布模型(SDM)中的错误。但是,当极少出现检测或未检测到时,SDM可能会遭受类别不平衡或少数(即稀有)类别的样本量较小的困扰。可能会导致预测不佳,其严重程度可能会因建模技术而异。
  2. 为了探究存在缺失数据中空间偏向和类别不平衡的后果,我们使用了来自美国东北部的102种鸟类的eBird公民科学数据,比较了空间稀疏,类别平衡和仅占多数的稀疏(即保留所有少数样本)类)。我们使用两种参数或半参数技术(广义线性模型和广义加性模型)和两种机器学习技术(随机森林和增强回归树)创建了SDM。我们使用独立且系统收集的参考数据集,结合了辨别力(接收者操作员特征曲线下的面积;真实技能统计量;精确召回曲线下的面积)和校准(Brier分数; Cohen's kappa)组合,测试了这些SDM的预测能力。 )指标。
  3. 我们发现根据细化和平衡决策,SDM性能存在很大差异。在所有物种中,没有一个单一的最佳方法,根据模型技术,性能指标和数据中物种基线样本的普遍性,可以选择稀疏和/或平衡的最佳选择。在空间上稀疏所有数据通常是一种不好的方法,尤其是对于基线样本患病率小于0.1的物种。对于大多数这些稀有物种,平衡类使用机器学习技术改善了存在和不存在类之间的模型辨别力,但通常会妨碍模型校准。
  4. 考虑到这些方法选择对SDM性能的巨大影响,基线样本流行度,样本量,建模方法以及SDM输出的预期应用(无论是判别还是校准)都应指导如何精简或平衡数据的决策。对于需要良好模型校准(相对于歧视)的预后应用,样本患病率与真实物种患病率之间的匹配可能是最重要的特征,需要进一步研究。
更新日期:2020-11-15
down
wechat
bug