Effects of ignoring survey design information for data reuse,Ecological Applications

当前位置： X-MOL 学术 › Ecol. Appl. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Effects of ignoring survey design information for data reuse
Ecological Applications ( IF 4.3 ) Pub Date : 2021-04-25 , DOI: 10.1002/eap.2360
Scott D Foster ₁ , Jarno Vanhatalo _{2,

3} , Verena M Trenkel ₄ , Torsti Schulz ₃ , Emma Lawrence ₅ , Rachel Przeslawski ₆ , Geoffrey R Hosack ₁

Affiliation

Data are currently being used, and reused, in ecological research at an unprecedented rate. To ensure appropriate reuse however, we need to ask the question: “Are aggregated databases currently providing the right information to enable effective and unbiased reuse?” We investigate this question, with a focus on designs that purposefully favor the selection of sampling locations (upweighting the probability of selection of some locations). These designs are common and examples are those designs that have uneven inclusion probabilities or are stratified. We perform a simulation experiment by creating data sets with progressively more uneven inclusion probabilities and examine the resulting estimates of the average number of individuals per unit area (density). The effect of ignoring the survey design can be profound, with biases of up to 250% in density estimates when naive analytical methods are used. This density estimation bias is not reduced by adding more data. Fortunately, the estimation bias can be mitigated by using an appropriate estimator or an appropriate model that incorporates the design information. These are only available however, when essential information about the survey design is available: the sample location selection process (e.g., inclusion probabilities), and/or covariates used in their specification. The results suggest that such information must be stored and served with the data to support meaningful inference and data reuse.

中文翻译：

忽略调查设计信息对数据重用的影响

数据目前正以前所未有的速度在生态研究中被使用和重复使用。然而，为了确保适当的重用，我们需要问一个问题：“聚合数据库当前是否提供正确的信息以实现有效和无偏见的重用？” 我们调查了这个问题，重点关注有目的地选择采样位置的设计（提高选择某些位置的概率）。这些设计很常见，示例是包含概率不均匀或分层的设计。我们通过创建包含概率越来越不均匀的数据集来执行模拟实验，并检查每单位面积（密度）的平均个体数量的估计结果。忽略调查设计的影响可能是深远的，当使用简单的分析方法时，密度估计的偏差高达 250%。通过添加更多数据不会减少这种密度估计偏差。幸运的是，可以通过使用适当的估计器或包含设计信息的适当模型来减轻估计偏差。然而，只有当有关调查设计的基本信息可用时，这些信息才可用：样本位置选择过程（例如，包含概率）和/或规范中使用的协变量。结果表明，此类信息必须与数据一起存储和提供，以支持有意义的推理和数据重用。可以通过使用适当的估计器或包含设计信息的适当模型来减轻估计偏差。然而，只有当有关调查设计的基本信息可用时，这些信息才可用：样本位置选择过程（例如，包含概率）和/或规范中使用的协变量。结果表明，此类信息必须与数据一起存储和提供，以支持有意义的推理和数据重用。可以通过使用适当的估计器或包含设计信息的适当模型来减轻估计偏差。然而，只有当有关调查设计的基本信息可用时，这些信息才可用：样本位置选择过程（例如，包含概率）和/或规范中使用的协变量。结果表明，此类信息必须与数据一起存储和提供，以支持有意义的推理和数据重用。

更新日期：2021-04-25

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南11