Fast Optimal Circular Clustering and Applications on Round Genomes,IEEE/ACM Transactions on Computational Biology and Bioinformatics

当前位置： X-MOL 学术 › IEEE/ACM Trans. Comput. Biol. Bioinform. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Fast Optimal Circular Clustering and Applications on Round Genomes
IEEE/ACM Transactions on Computational Biology and Bioinformatics ( IF 3.6 ) Pub Date : 2021-05-04 , DOI: 10.1109/tcbb.2021.3077573
Tathagata Debnath ₁ , Mingzhou Song ₂

Affiliation

Round genomes are found in bacteria, plant chloroplasts, and mitochondria. Genetic or epigenetic marks can present biologically interesting clusters along a circular genome. The circular data clustering problem groups NN points on a circle into KK clusters to minimize the within-cluster sum of squared distances. Repeatedly applying the KK-means algorithm takes quadratic time, impractical for large circular datasets. To overcome this issue, we developed a reproducible fast optimal circular clustering (FOCC) algorithm of worst-case O(KNlog2N)\mathcal {O}(KN \log ^2 N) time. The core is a fast optimal framed clustering algorithm, which we designed by integrating two divide-and-conquer and one bracket dynamic programming strategies. The algorithm is optimal based on a property of monotonic increasing cluster borders over frames on linearized data. On clustering 50,000 circular data points, FOCC outruns brute-force or heuristic circular clustering by three orders of magnitude in time. We produced clusters of CpG sites and genes along three round genomes, exhibiting higher quality than heuristic clustering. More broadly, the presented subquadratic-time algorithms offer the fastest known solution to not only framed and circular clustering, but also angular, periodical, and looped clustering. We implemented these algorithms in the R package ‘OptCirClust’ (https://CRAN.R-project.org/package=OptCirClust).

中文翻译：

快速最优圆形聚类及其在圆形基因组上的应用

圆形基因组存在于细菌、植物叶绿体和线粒体中。遗传或表观遗传标记可以沿着环状基因组呈现生物学上有趣的簇。圆形数据聚类问题将圆上的 NN 个点分为 KK 个簇，以最小化簇内距离平方和。重复应用 KK 均值算法需要二次时间，这对于大型圆形数据集来说是不切实际的。为了克服这个问题，我们开发了一种可重复的快速最优循环聚类（FOCC）算法，最坏情况下的时间为 O(KNlog2N)\mathcal {O}(KN \log ^2 N) 。其核心是一种快速最优框架聚类算法，我们通过集成两种分而治之和一括号动态规划策略来设计。该算法是基于线性化数据上帧上单调递增簇边界的特性而优化的。在对 50,000 个循环数据点进行聚类时，FOCC 在时间上比暴力或启发式循环聚类快了三个数量级。我们沿着三轮基因组生成了 CpG 位点和基因簇，表现出比启发式聚类更高的质量。更广泛地说，所提出的次二次时间算法不仅为框架聚类和圆形聚类，而且为角度聚类、周期聚类和循环聚类提供了最快的已知解决方案。我们在 R 包“OptCirClust”(https://CRAN.R-project.org/package=OptCirClust) 中实现了这些算法。

更新日期：2021-05-04

点击分享查看原文

点击收藏

阅读更多本刊最新论文