当前位置: X-MOL 学术Mol. Ecol. Resour. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Pseudoreplication in genomic-scale data sets
Molecular Ecology Resources ( IF 7.7 ) Pub Date : 2021-08-05 , DOI: 10.1111/1755-0998.13482
Robin S Waples 1 , Ryan K Waples 2 , Eric J Ward 1
Affiliation  

In genomic-scale data sets, loci are closely packed within chromosomes and hence provide correlated information. Averaging across loci as if they were independent creates pseudoreplication, which reduces the effective degrees of freedom (df’) compared to the nominal degrees of freedom, df. This issue has been known for some time, but consequences have not been systematically quantified across the entire genome. Here, we measured pseudoreplication (quantified by the ratio df’/df) for a common metric of genetic differentiation (FST) and a common measure of linkage disequilibrium between pairs of loci (r2). Based on data simulated using models (SLiM and msprime) that allow efficient forward-in-time and coalescent simulations while precisely controlling population pedigrees, we estimated df’ and df’/df by measuring the rate of decline in the variance of mean FST and mean r2 as more loci were used. For both indices, df’ increases with Ne and genome size, as expected. However, even for large Ne and large genomes, df’ for mean r2 plateaus after a few thousand loci, and a variance components analysis indicates that the limiting factor is uncertainty associated with sampling individuals rather than genes. Pseudoreplication is less extreme for FST, but df’/df ≤0.01 can occur in data sets using tens of thousands of loci. Commonly-used block-jackknife methods consistently overestimated var (FST), producing very conservative confidence intervals. Predicting df’ based on our modelling results as a function of Ne, L, S, and genome size provides a robust way to quantify precision associated with genomic-scale data sets.

中文翻译:

基因组规模数据集中的伪复制

在基因组规模的数据集中,基因座紧密排列在染色体内,因此提供了相关信息。对基因座进行平均就好像它们是独立的一样会创建伪复制,与名义自由度df相比,这会降低有效自由度 ( df' ) 。这个问题已经有一段时间了,但尚未对整个基因组的后果进行系统量化。在这里,我们测量了假复制(由比率df' / df量化),用于遗传分化 ( F ST ) 的常见度量和基因座对之间连锁不平衡的常见度量 ( r 2)。基于使用模型(SLiM 和 msprime)模拟的数据,这些模型允许在精确控制种群谱系的同时进行有效的实时和合并模拟,我们通过测量平均F ST方差的下降率来估计df'df' / df和平均r 2使用更多的基因座。正如预期的那样,对于这两个指数,df'随着N e和基因组大小的增加而增加。然而,即使对于大N e和大基因组,df'表示平均r 2几千个位点后趋于平稳,方差分量分析表明,限制因素是与抽样个体而不是基因相关的不确定性。F ST的伪复制不太极端,但df' / df ≤0.01 可能发生在使用数万个基因座的数据集中。常用的块折刀方法始终高估 var ( F ST ),产生非常保守的置信区间。根据我们的建模结果预测df'作为N eLS和基因组大小的函数,提供了一种可靠的方法来量化与基因组规模数据集相关的精度。
更新日期:2021-08-05
down
wechat
bug