Bayesian localization of CNV candidates in WGS data within minutes.,Algorithms for Molecular Biology

当前位置： X-MOL 学术 › Algorithms Mol. Biol. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Bayesian localization of CNV candidates in WGS data within minutes.
Algorithms for Molecular Biology ( IF 1.5 ) Pub Date : 2019-09-23 , DOI: 10.1186/s13015-019-0154-7
John Wiedenhoeft _{1,

2} , Alex Cagan _{3,

4} , Rimma Kozhemyakina ₅ , Rimma Gulevich ₅ , Alexander Schliep _{1,

2}

Affiliation

BACKGROUND Full Bayesian inference for detecting copy number variants (CNV) from whole-genome sequencing (WGS) data is still largely infeasible due to computational demands. A recently introduced approach to perform Forward-Backward Gibbs sampling using dynamic Haar wavelet compression has alleviated issues of convergence and, to some extent, speed. Yet, the problem remains challenging in practice. RESULTS In this paper, we propose an improved algorithmic framework for this approach. We provide new space-efficient data structures to query sufficient statistics in logarithmic time, based on a linear-time, in-place transform of the data, which also improves on the compression ratio. We also propose a new approach to efficiently store and update marginal state counts obtained from the Gibbs sampler. CONCLUSIONS Using this approach, we discover several CNV candidates in two rat populations divergently selected for tame and aggressive behavior, consistent with earlier results concerning the domestication syndrome as well as experimental observations. Computationally, we observe a 29.5-fold decrease in memory, an average 5.8-fold speedup, as well as a 191-fold decrease in minor page faults. We also observe that metrics varied greatly in the old implementation, but not the new one. We conjecture that this is due to the better compression scheme. The fully Bayesian segmentation of the entire WGS data set required 3.5 min and 1.24 GB of memory, and can hence be performed on a commodity laptop.

中文翻译：

在几分钟内对 WGS 数据中的 CNV 候选者进行贝叶斯定位。

背景技术由于计算需求，从全基因组测序（WGS）数据中检测拷贝数变异（CNV）的完全贝叶斯推理在很大程度上仍然是不可行的。最近引入的一种使用动态 Haar 小波压缩执行 Forward-Backward Gibbs 采样的方法缓解了收敛问题，并在一定程度上缓解了速度问题。然而，这个问题在实践中仍然具有挑战性。结果在本文中，我们为这种方法提出了一种改进的算法框架。我们提供了新的节省空间的数据结构，以在对数时间内查询足够的统计数据，基于数据的线性时间、就地变换，这也提高了压缩率。我们还提出了一种新方法来有效地存储和更新从 Gibbs 采样器获得的边缘状态计数。结论使用这种方法，我们在两个大鼠种群中发现了几种 CNV 候选者，它们因驯服和攻击行为而被不同地选择，这与早期关于驯化综合征的结果以及实验观察结果一致。在计算上，我们观察到内存减少了 29.5 倍，平均加速了 5.8 倍，以及次要页面错误减少了 191 倍。我们还观察到，旧实现中的指标差异很大，而新实现则不然。我们推测这是由于更好的压缩方案。整个 WGS 数据集的完全贝叶斯分割需要 3.5 分钟和 1.24 GB 内存，因此可以在商用笔记本电脑上执行。与早期关于驯化综合征的结果以及实验观察结果一致。在计算上，我们观察到内存减少了 29.5 倍，平均加速了 5.8 倍，以及次要页面错误减少了 191 倍。我们还观察到，旧实现中的指标差异很大，而新实现则不然。我们推测这是由于更好的压缩方案。整个 WGS 数据集的完全贝叶斯分割需要 3.5 分钟和 1.24 GB 内存，因此可以在商用笔记本电脑上执行。与早期关于驯化综合征的结果以及实验观察结果一致。在计算上，我们观察到内存减少了 29.5 倍，平均加速了 5.8 倍，以及次要页面错误减少了 191 倍。我们还观察到，旧实现中的指标差异很大，而新实现则不然。我们推测这是由于更好的压缩方案。整个 WGS 数据集的完全贝叶斯分割需要 3.5 分钟和 1.24 GB 内存，因此可以在商用笔记本电脑上执行。我们推测这是由于更好的压缩方案。整个 WGS 数据集的完全贝叶斯分割需要 3.5 分钟和 1.24 GB 内存，因此可以在商用笔记本电脑上执行。我们推测这是由于更好的压缩方案。整个 WGS 数据集的完全贝叶斯分割需要 3.5 分钟和 1.24 GB 内存，因此可以在商用笔记本电脑上执行。

更新日期：2019-11-01

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南11