Linear time minimum segmentation enables scalable founder reconstruction.,Algorithms for Molecular Biology

当前位置： X-MOL 学术 › Algorithms Mol. Biol. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Linear time minimum segmentation enables scalable founder reconstruction.
Algorithms for Molecular Biology ( IF 1.5 ) Pub Date : 2019-05-17 , DOI: 10.1186/s13015-019-0147-6
Tuukka Norri ₁ , Bastien Cazaux ₁ , Dmitry Kosolobov ₂ , Veli Mäkinen ₁

Affiliation

BACKGROUND We study a preprocessing routine relevant in pan-genomic analyses: consider a set of aligned haplotype sequences of complete human chromosomes. Due to the enormous size of such data, one would like to represent this input set with a few founder sequences that retain as well as possible the contiguities of the original sequences. Such a smaller set gives a scalable way to exploit pan-genomic information in further analyses (e.g. read alignment and variant calling). Optimizing the founder set is an NP-hard problem, but there is a segmentation formulation that can be solved in polynomial time, defined as follows. Given a threshold L and a set R = { R 1 , … , R m } of m strings (haplotype sequences), each having length n, the minimum segmentation problem for founder reconstruction is to partition [1, n] into set P of disjoint segments such that each segment [ a , b ] ∈ P has length at least L and the number d ( a , b ) = | { R i [ a , b ] : 1 ≤ i ≤ m } | of distinct substrings at segment [a, b] is minimized over [ a , b ] ∈ P . The distinct substrings in the segments represent founder blocks that can be concatenated to form max { d ( a , b ) : [ a , b ] ∈ P } founder sequences representing the original R such that crossovers happen only at segment boundaries. RESULTS We give an O(mn) time (i.e. linear time in the input size) algorithm to solve the minimum segmentation problem for founder reconstruction, improving over an earlier O ( m n 2 ) . CONCLUSIONS Our improvement enables to apply the formulation on an input of thousands of complete human chromosomes. We implemented the new algorithm and give experimental evidence on its practicality. The implementation is available in https://github.com/tsnorri/founder-sequences.

中文翻译：

线性时间最小分割实现了可扩展的创始人重建。

背景我们研究与泛基因组分析相关的预处理程序：考虑一组完整人类染色体的对齐单倍型序列。由于此类数据的巨大规模，人们想用一些尽可能保留原始序列的连续性的创始序列来表示这个输入集。这样一个较小的集合提供了一种可扩展的方式来利用泛基因组信息进行进一步分析（例如读取对齐和变体调用）。优化创始人集是一个 NP-hard 问题，但是有一个可以在多项式时间内解决的分割公式，定义如下。给定一个阈值 L 和一个集合 R = { R 1 , ... , R m } 的 m 个字符串（单倍型序列），每个长度为 n，创建者重建的最小分割问题是分割 [1, n] 到不相交段的集合 P 中，使得每个段 [ a , b ] ∈ P 的长度至少为 L 并且数量 d ( a , b ) = | { R i [ a , b ] : 1 ≤ i ≤ m } | 段 [a, b] 的不同子串在 [a, b] ∈ P 上被最小化。段中不同的子串表示创建者块，可以连接起来形成 max { d ( a , b ) : [ a , b ] ∈ P } 代表原始 R 的创建者序列，这样交叉只发生在段边界处。结果我们给出了一个 O(mn) 时间（即输入大小的线性时间）算法来解决创建者重建的最小分割问题，改进了早期的 O (mn 2 )。结论我们的改进能够将公式应用于数千个完整人类染色体的输入。我们实现了新算法，并对其实用性给出了实验证据。

更新日期：2019-11-01

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南11