当前位置: X-MOL 学术Genome Res. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Accounting for GC-content bias reduces systematic errors and batch effects in ChIP-seq data
Genome Research ( IF 7 ) Pub Date : 2017-11-01 , DOI: 10.1101/gr.220673.117
Mingxiang Teng , Rafael A. Irizarry

The main application of ChIP-seq technology is the detection of genomic regions that bind to a protein of interest. A large part of functional genomics’ public catalogs is based on ChIP-seq data. These catalogs rely on peak calling algorithms that infer protein-binding sites by detecting genomic regions associated with more mapped reads (coverage) than expected by chance, as a result of the experimental protocol's lack of perfect specificity. We find that GC-content bias accounts for substantial variability in the observed coverage for ChIP-seq experiments and that this variability leads to false-positive peak calls. More concerning is that the GC effect varies across experiments, with the effect strong enough to result in a substantial number of peaks called differently when different laboratories perform experiments on the same cell line. However, accounting for GC content bias in ChIP-seq is challenging because the binding sites of interest tend to be more common in high GC-content regions, which confounds real biological signals with unwanted variability. To account for this challenge, we introduce a statistical approach that accounts for GC effects on both nonspecific noise and signal induced by the binding site. The method can be used to account for this bias in binding quantification as well to improve existing peak calling algorithms. We use this approach to show a reduction in false-positive peaks as well as improved consistency across laboratories.



中文翻译:

计入GC含量偏差可减少ChIP-seq数据中的系统误差和批次影响

ChIP-seq技术的主要应用是检测与目标蛋白质结合的基因组区域。功能基因组学的大部分公共目录都基于ChIP-seq数据。这些目录依赖于峰值调用算法,该算法通过检测与随机预期之外的更多映射读段(覆盖率)相关的基因组区域来推断蛋白质结合位点,这是由于实验方案缺乏完美特异性而导致的。我们发现,GC含量偏差解释了ChIP-seq实验观察到的覆盖范围中的巨大差异,并且这种差异导致假阳性的峰调用。更令人担忧的是,GC效应随实验而变化,当不同实验室在同一细胞系上进行实验时,其效应足够强,导致大量的峰被不同地调用。但是,在ChIP-seq中解决GC含量偏倚具有挑战性,因为感兴趣的结合位点在高GC含量区域中更常见,这将真实的生物学信号与不希望的变异性混淆了。为了解决这一挑战,我们引入了一种统计方法,该方法考虑了GC对非特异性噪声和结合位点诱导的信号的影响。该方法可用于解决结合定量中的这种偏见,并改善现有的峰调用算法。我们使用这种方法来显示假阳性峰的减少以及各个实验室之间一致性的提高。为了解决这一挑战,我们引入了一种统计方法,该方法考虑了GC对非特异性噪声和结合位点诱导的信号的影响。该方法可用于解决结合定量中的这种偏见,并改善现有的峰调用算法。我们使用这种方法来显示假阳性峰的减少以及各个实验室之间一致性的提高。为了解决这一挑战,我们引入了一种统计方法,该方法考虑了GC对非特异性噪声和结合位点诱导的信号的影响。该方法可用于解决结合定量中的这种偏见,并改善现有的峰调用算法。我们使用这种方法来显示假阳性峰的减少以及各个实验室之间一致性的提高。

更新日期:2017-11-01
down
wechat
bug