当前位置: X-MOL 学术Interdiscip. Sci. Comput. Life Sci. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
ProtPCV: A Fixed Dimensional Numerical Representation of Protein Sequence to Significantly Reduce Sequence Search Time.
Interdisciplinary Sciences: Computational Life Sciences ( IF 3.9 ) Pub Date : 2020-06-10 , DOI: 10.1007/s12539-020-00380-w
Manoj Kumar Pal 1 , Tapobrata Lahiri 1 , Rajnish Kumar 1
Affiliation  

Protein sequence is a wealth of experimental information which is yet to be exploited to extract information on protein homologues. Consequently, it is observed from publications that dynamic programming, heuristics and HMM profile-based alignment techniques along with the alignment free techniques do not directly utilize ordered profile of physicochemical properties of a protein to identify its homologue. Also, it is found that these works lack crucial bench-marking or validation in absence of which their incorporation in search engines may appears to be questionable. In this direction this research approach offers fixed dimensional numerical representation of protein sequences extending the concept of periodicity count value of nucleotide types (2017) to accommodate Euclidean distance as direct similarity measure between two proteins. Instead of bench-marking with BLAST and PSI-BLAST only, this new similarity measure was also compared with Needleman–Wunsch and Smith–Waterman. For enhancing the strength of comparison, this work for the first time introduces two novel benchmarking methods based on correlation of “similarity scores” and “proximity of ranked outputs from a standard sequence alignment method” between all possible pairs of search techniques including the new one presented in this paper. It is found that the novel and unique numerical representation of a protein can reduce computational complexity of protein sequence search to the tune of O(log(n)). It may also help implementation of various other similarity-based operation possible, such as clustering, phylogenetic analysis and classification of proteins on the basis of the properties used to build this numerical representation of protein.



中文翻译:

ProtPCV:蛋白质序列的固定维数表示形式,可以显着减少序列搜索时间。

蛋白质序列是大量的实验信息,尚未被用于提取有关蛋白质同源物的信息。因此,从出版物中观察到,动态编程,启发式和基于HMM轮廓的比对技术以及无比对技术不会直接利用蛋白质的理化性质的有序分布来鉴定其同源物。此外,还发现这些作品缺乏关键的基准测试或验证,如果没有这些基准或验证,它们是否被纳入搜索引擎似乎是值得怀疑的。在这个方向上,这种研究方法提供了蛋白质序列的固定尺寸数字表示,扩展了核苷酸类型的周期性计数值(2017)的概念,以适应作为两种蛋白质之间直接相似性度量的欧几里德距离。除了仅使用BLAST和PSI-BLAST进行基准测试外,还将该新的相似性度量标准与Needleman-Wunsch和Smith-Waterman进行了比较。为了增强比较的强度,这项工作首次引入了两种新颖的基准化方法,这些方法基于包括新方法在内的所有可能的搜索技术对之间的“相似性得分”和“标准序列比对方法的排名输出的接近度”的相关性。在本文中提出。发现一种新颖而独特的蛋白质数字表示可以将蛋白质序列搜索的计算复杂度降低到 这项工作首次引入了两种新颖的基准化方法,它们基于所有可能的搜索技术对之间的“相似性得分”和“标准序列比对方法的排名输出的接近度”之间的相关性,其中包括本文提出的新技术。发现一种新颖而独特的蛋白质数字表示可以将蛋白质序列搜索的计算复杂度降低到 这项工作首次引入了两种新颖的基准化方法,它们基于所有可能的搜索技术对(包括本文提出的新技术)之间的“相似性得分”和“标准序列比对方法的排名输出的接近性”之间的相关性。发现一种新颖而独特的蛋白质数字表示可以将蛋白质序列搜索的计算复杂度降低到O(log(n))。它还可能有助于实现其他各种基于相似性的操作,例如基于用于建立蛋白质数字表示的属性,对蛋白质进行聚类,系统发育分析和分类。

更新日期:2020-06-10
down
wechat
bug