当前位置: X-MOL 学术IEEE/ACM Trans. Comput. Biol. Bioinform. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Fast Optimal Circular Clustering and Applications on Round Genomes.
IEEE/ACM Transactions on Computational Biology and Bioinformatics ( IF 4.5 ) Pub Date : 2021-05-04 , DOI: 10.1109/tcbb.2021.3077573
Tathagata Debnath , Mingzhou Song

Round genomes are found in bacteria, plant chloroplasts, and mitochondria. Genetic or epigenetic marks can present biologically interesting clusters along a circular genome. The circular data clustering problem groups N points on a circle into K clusters to minimize the within-cluster sum of squared distances. Repeatedly applying the K-means algorithm takes quadratic time, impractical for large circular datasets. To overcome this issue, we developed a fast, reproducible, and optimal circular clustering (FOCC) algorithm of worst-case O(KN log2 N) time. The core is a fast optimal framed clustering algorithm, which we designed by integrating two divide-and-conquer and one bracket dynamic programming strategies. The algorithm is optimal based on a property of monotonic increasing cluster borders over frames on linearized data. On clustering 50,000 circular data points, FOCC outruns brute-force or heuristic circular clustering by three orders of magnitude. We produced clusters of CpG sites and genes along three round genomes, exhibiting higher quality than heuristic clustering. More broadly, the presented subquadratic-time algorithms offer the fastest known solution to not only framed and circular clustering, but also angular, periodical, and looped clustering. We implemented these algorithms in the R package OptCirClust (https://CRAN.R-project.org/package=OptCirClust).

中文翻译:

快速优化的圆形聚类及其在圆形基因组上的应用。

在细菌,植物叶绿体和线粒体中发现了圆形基因组。遗传标记或表观遗传标记可以沿圆形基因组呈现生物学上有趣的簇。圆形数据聚类问题将圆上的N个点分为K个聚类,以最小化平方距离的聚类内和。重复应用K-均值算法需要二次时间,这对于大型圆形数据集而言是不切实际的。为解决此问题,我们开发了一种最坏情况下O(KN log 2 N)的快速,可重现和最佳的循环聚类(FOCC)算法。时间。核心是一种快速优化的框架聚类算法,我们通过集成两个分治法和一个括号的动态规划策略进行设计。该算法是基于线性数据上帧的单调递增群集边界的属性而优化的。在对50,000个循环数据点进行聚类时,FOCC比蛮力或启发式循环聚类超出了三个数量级。我们沿着三个圆形基因组产生了CpG位点和基因的簇,与启发式簇相比,它们表现出更高的质量。更广泛地讲,提出的次二次时间算法不仅为框架和圆形聚类提供了最快的已知解决方案,而且还为角度,周期性和环状聚类提供了最快的解决方案。我们在R包OptCirClust(https://CRAN.R-project.org/package=OptCirClust)中实现了这些算法。
更新日期:2021-05-04
down
wechat
bug