Accounting for GC-content bias reduces systematic errors and batch effects in ChIP-seq data,Genome Research

当前位置： X-MOL 学术 › Genome Res. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Accounting for GC-content bias reduces systematic errors and batch effects in ChIP-seq data
Genome Research ( IF 6.2 ) Pub Date : 2017-10-12 , DOI: 10.1101/gr.220673.117
Mingxiang Teng _{1,

2,

3} , Rafael A Irizarry _{1,

2}

Affiliation

The main application of ChIP-seq technology is the detection of genomic regions that bind to a protein of interest. A large part of functional genomics’ public catalogs is based on ChIP-seq data. These catalogs rely on peak calling algorithms that infer protein-binding sites by detecting genomic regions associated with more mapped reads (coverage) than expected by chance, as a result of the experimental protocol's lack of perfect specificity. We find that GC-content bias accounts for substantial variability in the observed coverage for ChIP-seq experiments and that this variability leads to false-positive peak calls. More concerning is that the GC effect varies across experiments, with the effect strong enough to result in a substantial number of peaks called differently when different laboratories perform experiments on the same cell line. However, accounting for GC content bias in ChIP-seq is challenging because the binding sites of interest tend to be more common in high GC-content regions, which confounds real biological signals with unwanted variability. To account for this challenge, we introduce a statistical approach that accounts for GC effects on both nonspecific noise and signal induced by the binding site. The method can be used to account for this bias in binding quantification as well to improve existing peak calling algorithms. We use this approach to show a reduction in false-positive peaks as well as improved consistency across laboratories.

中文翻译：

考虑 GC 含量偏差可减少 ChIP-seq 数据中的系统误差和批次效应

ChIP-seq 技术的主要应用是检测与目标蛋白质结合的基因组区域。大部分功能基因组学的公共目录都是基于 ChIP-seq 数据。由于实验方案缺乏完美的特异性，这些目录依赖于峰值调用算法，该算法通过检测与比偶然预期更多映射读数（覆盖率）相关的基因组区域来推断蛋白质结合位点。我们发现 GC 含量偏差解释了 ChIP-seq 实验观察到的覆盖率的显着可变性，并且这种可变性导致假阳性峰调用。更令人担忧的是，GC 效应在不同实验中有所不同，当不同实验室对同一细胞系进行实验时，这种效应足以导致大量峰被称为不同。然而，在 ChIP-seq 中解释 GC 含量偏差具有挑战性，因为感兴趣的结合位点往往在 GC 含量高的区域更常见，这会混淆真实的生物信号和不需要的变异性。为了应对这一挑战，我们引入了一种统计方法，用于解释 GC 对结合位点诱导的非特异性噪声和信号的影响。该方法可用于解释结合量化中的这种偏差，以及改进现有的峰值调用算法。我们使用这种方法来显示假阳性峰的减少以及实验室间一致性的提高。为了应对这一挑战，我们引入了一种统计方法，用于解释 GC 对结合位点诱导的非特异性噪声和信号的影响。该方法可用于解释结合量化中的这种偏差，以及改进现有的峰值调用算法。我们使用这种方法来显示假阳性峰的减少以及实验室间一致性的提高。为了应对这一挑战，我们引入了一种统计方法，用于解释 GC 对结合位点诱导的非特异性噪声和信号的影响。该方法可用于解释结合量化中的这种偏差，以及改进现有的峰值调用算法。我们使用这种方法来显示假阳性峰的减少以及实验室间一致性的提高。

更新日期：2017-10-12

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南11