Convex hulls in hamming space enable efficient search for similarity and clustering of genomic sequences,BMC Bioinformatics

当前位置： X-MOL 学术 › BMC Bioinform. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Convex hulls in hamming space enable efficient search for similarity and clustering of genomic sequences
BMC Bioinformatics ( IF 2.9 ) Pub Date : 2020-12-30 , DOI: 10.1186/s12859-020-03811-z
David S. Campo , Yury Khudyakov

In molecular epidemiology, comparison of intra-host viral variants among infected persons is frequently used for tracing transmissions in human population and detecting viral infection outbreaks. Application of Ultra-Deep Sequencing (UDS) immensely increases the sensitivity of transmission detection but brings considerable computational challenges when comparing all pairs of sequences. We developed a new population comparison method based on convex hulls in hamming space. We applied this method to a large set of UDS samples obtained from unrelated cases infected with hepatitis C virus (HCV) and compared its performance with three previously published methods. The convex hull in hamming space is a data structure that provides information on: (1) average hamming distance within the set, (2) average hamming distance between two sets; (3) closeness centrality of each sequence; and (4) lower and upper bound of all the pairwise distances among the members of two sets. This filtering strategy rapidly and correctly removes 96.2% of all pairwise HCV sample comparisons, outperforming all previous methods. The convex hull distance (CHD) algorithm showed variable performance depending on sequence heterogeneity of the studied populations in real and simulated datasets, suggesting the possibility of using clustering methods to improve the performance. To address this issue, we developed a new clustering algorithm, k-hulls, that reduces heterogeneity of the convex hull. This efficient algorithm is an extension of the k-means algorithm and can be used with any type of categorical data. It is 6.8-times more accurate than k-mode, a previously developed clustering algorithm for categorical data. CHD is a fast and efficient filtering strategy for massively reducing the computational burden of pairwise comparison among large samples of sequences, and thus, aiding the calculation of transmission links among infected individuals using threshold-based methods. In addition, the convex hull efficiently obtains important summary metrics for intra-host viral populations.

中文翻译：

汉明空间中的凸包可有效搜索基因组序列的相似性和聚类

在分子流行病学中，经常将感染者之间宿主内病毒变异的比较用于追踪人群的传播情况和检测病毒感染的爆发。超深层测序（UDS）的应用极大地提高了传输检测的灵敏度，但在比较所有序列对时却带来了相当大的计算挑战。我们在汉明空间中开发了一种基于凸包的人口比较新方法。我们将此方法应用于从不相关病例感染丙型肝炎病毒（HCV）的大量UDS样品中，并将其性能与三种以前发表的方法进行了比较。汉明空间中的凸包是一种数据结构，可提供以下信息：（1）集合内的平均汉明距离；（2）两组之间的平均汉明距离；（3）每个序列的接近中心性；（4）两组成员之间所有成对距离的上下限。这种过滤策略可快速正确地删除所有成对HCV样本比较中的96.2％，胜过所有以前的方法。凸包距离（CHD）算法根据实际和模拟数据集中所研究种群的序列异质性表现出可变的性能，这表明使用聚类方法来改善性能的可能性。为解决此问题，我们开发了一种新的聚类算法k壳，可减少凸壳的异质性。这种有效的算法是k-means算法的扩展，可以与任何类型的分类数据一起使用。它比k模式精确6.8倍，先前开发的用于分类数据的聚类算法。CHD是一种快速有效的过滤策略，可大大减少大型序列样本之间成对比较的计算负担，从而有助于使用基于阈值的方法来计算感染个体之间的传输链接。此外，凸包有效地获取了宿主内病毒种群的重要摘要指标。

更新日期：2020-12-30

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南11