当前位置: X-MOL 学术Algorithms Mol. Biol. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Adjacency-constrained hierarchical clustering of a band similarity matrix with application to genomics
Algorithms for Molecular Biology ( IF 1.5 ) Pub Date : 2019-11-15 , DOI: 10.1186/s13015-019-0157-4
Christophe Ambroise 1 , Alia Dehman 2 , Pierre Neuvial 3 , Guillem Rigaill 1, 4 , Nathalie Vialaneix 5
Affiliation  

Background
Genomic data analyses such as Genome-Wide Association Studies (GWAS) or Hi-C studies are often faced with the problem of partitioning chromosomes into successive regions based on a similarity matrix of high-resolution, locus-level measurements. An intuitive way of doing this is to perform a modified Hierarchical Agglomerative Clustering (HAC), where only adjacent clusters (according to the ordering of positions within a chromosome) are allowed to be merged. But a major practical drawback of this method is its quadratic time and space complexity in the number of loci, which is typically of the order of \(10^4\) to \(10^5\) for each chromosome.

Results
By assuming that the similarity between physically distant objects is negligible, we are able to propose an implementation of adjacency-constrained HAC with quasi-linear complexity. This is achieved by pre-calculating specific sums of similarities, and storing candidate fusions in a min-heap. Our illustrations on GWAS and Hi-C datasets demonstrate the relevance of this assumption, and show that this method highlights biologically meaningful signals. Thanks to its small time and memory footprint, the method can be run on a standard laptop in minutes or even seconds.

Availability and implementation
Software and sample data are available as an R package, adjclust , that can be downloaded from the Comprehensive R Archive Network (CRAN).



中文翻译:

带相似性矩阵的邻接约束层次聚类及其在基因组学中的应用

背景
基因组数据分析,例如全基因组关联研究 (GWAS) 或 Hi-C 研究,经常面临基于高分辨率、位点级测量的相似性矩阵将染色体划分为连续区域的问题。执行此操作的一种直观方法是执行修改后的层次凝聚聚类 (HAC),其中仅允许合并相邻簇(根据染色体内位置的顺序)。但这种方法的一个主要实际缺点是其基因座数量的二次时间和空间复杂度,通常每个染色体的数量级为 \(10^4\) 到 \(10^5\)。

结果
通过假设物理上遥远的对象之间的相似性可以忽略不计,我们能够提出一种具有准线性复杂度的邻接约束 HAC 的实现。这是通过预先计算特定的相似度总和并将候选融合存储在最小堆中来实现的。我们在 GWAS 和 Hi-C 数据集上的插图证明了这一假设的相关性,并表明该方法突出了具有生物学意义的信号。由于其时间和内存占用较小,该方法可以在标准笔记本电脑上在几分钟甚至几秒钟内运行。

可用性和实施​​ 软件和示例数据以 R 包 adjclust 的
形式提供 ,可以从综合 R 存档网络 (CRAN) 下载。

更新日期:2019-11-15
down
wechat
bug