当前位置: X-MOL 学术Glob. Ecol. Biogeogr. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Handling missing values in trait data
Global Ecology and Biogeography ( IF 6.4 ) Pub Date : 2020-10-18 , DOI: 10.1111/geb.13185
Thomas F. Johnson 1 , Nick J. B. Isaac 2 , Agustin Paviolo 3, 4 , Manuela González‐Suárez 1
Affiliation  

Aim Trait data are widely used in ecological and evolutionary phylogenetic comparative studies, but often values are not available for all species of interest. Researchers traditionally have excluded species without data from analyses, but estimation of missing values using imputation has been proposed as a better approach. However, imputation methods have largely been designed for randomly missing data, yet trait data are often not missing at random (e.g. more data for bigger species). Here we evaluate the performance of approaches for handling missing values considering biased datasets. Location Any Time period Any Major taxa studied Any Methods We simulated continuous traits and separate response variables to test performance of nine imputation methods and complete-case analysis (excluding missing values from the dataset) under biased missing data scenarios. We characterized performance by estimating error in imputed trait values (deviation from the true value) and inferred trait-response relationships (deviation from the true relationship between a trait and response). Results Generally, Rphylopars imputation produced the most accurate estimate of missing values and best preserved the response-trait slope. However, estimates of missing data were still inaccurate, even with only 5% of values missing. Under severe biases, errors were high with every approach. Imputation was not always the best option, with complete-case analysis frequently outperforming Mice imputation, and to a lesser degree BHPMF imputation. Mice, a popular approach, performed poorly when the response variable was excluded from the imputation model. Main conclusions Imputation can effectively handle missing data under some conditions, but is not always the best solution. None of the methods we tested could effectively deal with severe biases, which may be common in trait datasets. We recommend rigorous data checking for biases before and after imputation and propose variables that can assist researchers working with incomplete datasets to detect data biases and minimise errors.

中文翻译:

处理特征数据中的缺失值

目标特征数据广泛用于生态和进化系统发育比较研究,但通常并非所有感兴趣的物种都可以获得值。研究人员传统上将没有数据的物种排除在分析之外,但有人提出使用插补法估计缺失值是一种更好的方法。然而,插补方法主要是为随机丢失的数据设计的,但性状数据通常不会随机丢失(例如,更大物种的更多数据)。在这里,我们评估了考虑有偏差数据集的缺失值处理方法的性能。位置 任何时间段 任何主要分类群 研究 任何方法 我们模拟连续性状并分离响应变量,以测试在有偏差的缺失数据情景下九种插补方法和完整案例分析(不包括数据集中的缺失值)的性能。我们通过估计估算的特征值(与真实值的偏差)和推断的特征-响应关系(与特征和响应之间的真实关系的偏差)中的误差来表征性能。结果 一般来说,Rphylopars 插补产生了最准确的缺失值估计值,并最好地保留了响应特征斜率。然而,对缺失数据的估计仍然不准确,即使只有 5% 的值缺失。在严重的偏见下,每种方法的错误率都很高。插补并不总是最好的选择,完整案例分析通常优于小鼠插补,并在较小程度上优于 BHPMF 插补。当响应变量从插补模型中排除时,小鼠是一种流行的方法,表现不佳。主要结论 Imputation 在某些情况下可以有效处理缺失数据,但并不总是最佳解决方案。我们测试的方法都不能有效处理严重的偏差,这在特征数据集中可能很常见。我们建议对插补前后的偏差进行严格的数据检查,并提出可以帮助研究人员处理不完整数据集的变量,以检测数据偏差并最大限度地减少错误。主要结论 Imputation 在某些情况下可以有效处理缺失数据,但并不总是最佳解决方案。我们测试的方法都不能有效处理严重的偏差,这在特征数据集中可能很常见。我们建议对插补前后的偏差进行严格的数据检查,并提出可以帮助研究人员处理不完整数据集的变量,以检测数据偏差并最大限度地减少错误。主要结论 Imputation 在某些情况下可以有效处理缺失数据,但并不总是最佳解决方案。我们测试的方法都不能有效处理严重的偏差,这在特征数据集中可能很常见。我们建议在插补之前和之后对偏差进行严格的数据检查,并提出可以帮助研究人员处理不完整数据集的变量,以检测数据偏差并最大限度地减少错误。
更新日期:2020-10-18
down
wechat
bug