A new DNA sequence entropy-based Kullback-Leibler algorithm for gene clustering.,Journal of Applied Genetics

当前位置： X-MOL 学术 › J. Appl. Genet. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

A new DNA sequence entropy-based Kullback-Leibler algorithm for gene clustering.
Journal of Applied Genetics ( IF 2.0 ) Pub Date : 2020-01-24 , DOI: 10.1007/s13353-020-00543-x
Houshang Dehghanzadeh ₁ , Mostafa Ghaderi-Zefrehei ₂ , Seyed Ziaeddin Mirhoseini ₃ , Saeid Esmaeilkhaniyan ₄ , Ishaku Lemu Haruna ₅ , Hamed Amirpour Najafabadi ₅

Affiliation

Information theory is a branch of mathematics that overlaps with communications, biology, and medical engineering. Entropy is a measure of uncertainty in the set of information. In this study, for each gene and its exons sets, the entropy was calculated in orders one to four. Based on the relative entropy of genes and exons, Kullback-Leibler divergence was calculated. After obtaining the Kullback-Leibler distance for genes and exons sets, the results were entered as input into 7 clustering algorithms: single, complete, average, weighted, centroid, median, and K-means. To aggregate the results of clustering, the AdaBoost algorithm was used. Finally, the results of the AdaBoost algorithm were investigated by GeneMANIA prediction server to explore the results from gene annotation point of view. All calculations were performed using the MATLAB Engineering Software (2015). Following our findings on investigating the results of genes metabolic pathways based on the gene annotations, it was revealed that our proposed clustering method yielded correct, logical, and fast results. This method at the same that had not had the disadvantages of aligning allowed the genes with actual length and content to be considered and also did not require high memory for large-length sequences. We believe that the performance of the proposed method could be used with other competitive gene clustering methods to group biologically relevant set of genes. Also, the proposed method can be seen as a predictive method for those genes bearing up weak genomic annotations.

中文翻译：

一种新的基于 DNA 序列熵的 Kullback-Leibler 基因聚类算法。

信息论是数学的一个分支，与通信、生物学和医学工程重叠。熵是信息集中不确定性的度量。在这项研究中，对于每个基因及其外显子组，熵按一到四的顺序计算。根据基因和外显子的相对熵，计算 Kullback-Leibler 散度。获得基因和外显子集的 Kullback-Leibler 距离后，将结果输入到 7 种聚类算法中：单一聚类算法、完整聚类算法、平均聚类算法、加权聚类算法、质心聚类算法、中值聚类算法和 K 均值聚类算法。为了聚合聚类结果，使用了 AdaBoost 算法。最后，通过GeneMANIA预测服务器对AdaBoost算法的结果进行研究，从基因注释的角度探讨结果。所有计算均使用 MATLAB 工程软件 (2015) 进行。根据我们基于基因注释研究基因代谢途径结果的发现，我们提出的聚类方法产生了正确、逻辑和快速的结果。这种方法同时没有比对的缺点，允许考虑实际长度和内容的基因，并且对大长度序列不需要高记忆力。我们相信所提出的方法的性能可以与其他竞争性基因聚类方法一起使用来对生物学相关的基因组进行分组。此外，所提出的方法可以被视为对那些基因组注释较弱的基因的预测方法。

更新日期：2020-01-24

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11