当前位置: X-MOL 学术Eng. Comput. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Discovering regulatory motifs of genetic networks using the indexing-tree based algorithm: a parallel implementation
Engineering Computations ( IF 1.6 ) Pub Date : 2020-06-26 , DOI: 10.1108/ec-02-2020-0108
Abedalmuhdi Almomany , Ahmad M. Al-Omari , Amin Jarrah , Mohammad Tawalbeh

Purpose

The problem of motif discovery has become a significant challenge in the era of big data where there are hundreds of genomes requiring annotations. The importance of motifs has led many researchers to develop different tools and algorithms for finding them. The purpose of this paper is to propose a new algorithm to increase the speed and accuracy of the motif discovering process, which is the main drawback of motif discovery algorithms.

Design/methodology/approach

All motifs are sorted in a tree-based indexing structure where each motif is created from a combination of nucleotides: ‘A’, ‘C’, ‘T’ and ‘G’. The full motif can be discovered by extending the search around 4-mer nucleotides in both directions, left and right. Resultant motifs would be identical or degenerated with various lengths.

Findings

The developed implementation discovers conserved string motifs in DNA without having prior information about the motifs. Even for a large data set that contains millions of nucleotides and thousands of very long sequences, the entire process is completed in a few seconds.

Originality/value

Experimental results demonstrate the efficiency of the proposed implementation; as for a real-sequence of 1,270,000 nucleotides spread into 2,000 samples, it takes 5.9 s to complete the overall discovering process when the code ran on an Intel Core i7-6700 @ 3.4 GHz machine and 26.7 s when running on an Intel Xeon x5670 @ 2.93 GHz machine. In addition, the authors have improved computational performance by parallelizing the implementation to run on multi-core machines using the OpenMP framework. The speedup achieved by parallelizing the implementation is scalable and proportional to the number of processors with a high efficiency that is close to 100%.



中文翻译:

使用基于索引树的算法发现遗传网络的调控主题:并行实现

目的

在有数百个需要注释的基因组的大数据时代,主题发现的问题已成为一个重大挑战。图案的重要性促使许多研究人员开发出不同的工具和算法来查找它们。本文的目的是提出一种新的算法来提高图案发现过程的速度和准确性,这是图案发现算法的主要缺点。

设计/方法/方法

所有基序均以基于树的索引结构进行排序,其中每个基序均由以下核苷酸组合创建:“ A”,“ C”,“ T”和“ G”。可以通过在左右两个方向上围绕4聚体核苷酸扩展搜索来发现完整的基序。最终的图案将是相同的,或者具有不同的长度。

发现

所开发的实现方法可以发现DNA中保守的字符串基序,而无需事先了解这些基序。即使对于包含数百万个核苷酸和数千个非常长的序列的大型数据集,整个过程也只需几秒钟即可完成。

创意/价值

实验结果证明了所提方法的有效性。至于将1,270,000个核苷酸的实际序列分散到2,000个样本中,当代码在Intel Core i7-6700 @ 3.4 GHz机器上运行时,完成整个发现过程需要5.9 s,而在Intel Xeon x5670 @上运行则需要26.7 s 2.93 GHz机器。此外,作者通过并行使用OpenMP框架在多核计算机上运行的实现,还提高了计算性能。通过并行实现实现的加速是可扩展的,并且与接近100%的高效处理器数量成比例。

更新日期:2020-06-26
down
wechat
bug