当前位置: X-MOL 学术Genom. Proteom. Bioinform. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Gclust: A Parallel Clustering Tool for Microbial Genomic Data.
Genomics, Proteomics & Bioinformatics ( IF 11.5 ) Pub Date : 2020-01-07 , DOI: 10.1016/j.gpb.2018.10.008
Ruilin Li 1 , Xiaoyu He 1 , Chuangchuang Dai 1 , Haidong Zhu 1 , Xianyu Lang 2 , Wei Chen 1 , Xiaodong Li 1 , Dan Zhao 1 , Yu Zhang 1 , Xinyin Han 1 , Tie Niu 2 , Yi Zhao 2 , Rongqiang Cao 2 , Rong He 2 , Zhonghua Lu 2 , Xuebin Chi 3 , Weizhong Li 4 , Beifang Niu 5
Affiliation  

The accelerating growth of the public microbial genomic data imposes substantial burden on the research community that uses such resources. Building databases for non-redundant reference sequences from massive microbial genomic data based on clustering analysis is essential. However, existing clustering algorithms perform poorly on long genomic sequences. In this article, we present Gclust, a parallel program for clustering complete or draft genomic sequences, where clustering is accelerated with a novel parallelization strategy and a fast sequence comparison algorithm using sparse suffix arrays (SSAs). Moreover, genome identity measures between two sequences are calculated based on their maximal exact matches (MEMs). In this paper, we demonstrate the high speed and clustering quality of Gclust by examining four genome sequence datasets. Gclust is freely available for non-commercial use at https://github.com/niu-lab/gclust. We also introduce a web server for clustering user-uploaded genomes at http://niulab.scgrid.cn/gclust.

中文翻译:

Gclust:用于微生物基因组数据的并行聚类工具。

公共微生物基因组数据的加速增长给使用此类资源的研究界带来了沉重负担。基于聚类分析从大量微生物基因组数据构建非冗余参考序列的数据库至关重要。但是,现有的聚类算法在长基因组序列上的性能较差。在本文中,我们介绍了Gclust,这是一个用于对完整或草图基因组序列进行聚类的并行程序,其中使用新颖的并行化策略和使用稀疏后缀数组(SSA)的快速序列比较算法来加速聚类。而且,基于两个序列的最大精确匹配(MEMs)计算两个序列之间的基因组同一性度量。在本文中,我们通过检查四个基因组序列数据集证明了Gclust的高速性和聚类质量。Gclust可在https://github.com/niu-lab/gclust上免费用于非商业用途。我们还在http://niulab.scgrid.cn/gclust上引入了一个用于对用户上传的基因组进行聚类的Web服务器。
更新日期:2020-04-21
down
wechat
bug