当前位置: X-MOL 学术bioRxiv. Bioinform. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
PASV: Automatic protein partitioning and validation using conserved residues
bioRxiv - Bioinformatics Pub Date : 2021-02-02 , DOI: 10.1101/2021.01.20.427478
Ryan M. Moore , Amelia O. Harrison , Daniel J. Nasko , Jessica Chopyk , Metehan Cebeci , Barbra D. Ferrell , Shawn W. Polson , K. Eric Wommack

Background: Increasingly, researchers use protein-coding genes from targeted PCR amplification or direct metagenomic sequencing in community and population ecology. Analysis of protein-coding genes presents different challenges from those encountered in traditional SSU rRNA studies. Most protein-coding sequences are annotated based on homology to other computationally-annotated sequences, which can lead to inaccurate annotations. Therefore, the results of sensitive homology searches must be validated to remove false-positives and assess functionality. Multiple lines of in silico evidence can be gathered by examining conserved domains and residues identified through biochemical investigations. However, manually validating sequences in this way can be time consuming and error prone, especially in large environmental studies. Results: An automated pipeline for protein active site validation (PASV) was developed to improve validation and partitioning accuracy for protein-coding sequences, combining multiple sequence alignment with expert domain knowledge. PASV was tested using commonly misannotated proteins: ribonucleotide reductase (RNR), alternative oxidase (AOX), and plastid terminal oxidase (PTOX). PASV partitioned 9,906 putative Class I alpha and Class II RNR sequences from bycatch in a global viral metagenomic investigation with >99% true positive and true negative rates. PASV predicted the class of 2,579 RNR sequences in >98% agreement with manual annotations. PASV correctly partitioned all 336 tested AOX and PTOX sequences. Conclusions: PASV provides an automated and accurate way to address post-homology search validation and partitioning of protein-coding marker genes. Source code is released under the MIT license and is found with documentation and usage examples on GitHub at https://github.com/mooreryan/pasv.

中文翻译:

PASV:使用保守残基自动进行蛋白质分区和验证

背景:研究人员越来越多地在社区和种群生态学中使用靶向PCR扩增或直接宏基因组测序的蛋白质编码基因。与传统的SSU rRNA研究相比,蛋白质编码基因的分析提出了不同的挑战。大多数蛋白质编码序列是根据与其他计算注释序列的同源性进行注释的,这可能导致注释不准确。因此,必须验证敏感同源搜索的结果,以消除假阳性并评估功能。通过检查通过生化研究鉴定出的保守结构域和残基,可以收集多种计算机证据。但是,以这种方式手动验证序列可能既耗时又容易出错,尤其是在大型环境研究中。结果:开发了用于蛋白质活性位点验证(PASV)的自动管道,以结合多个序列比对和专家领域的知识来提高蛋白质编码序列的验证和分配准确性。PASV使用常见的错误标注蛋白进行了测试:核糖核苷酸还原酶(RNR),替代氧化酶(AOX)和质体末端氧化酶(PTOX)。在全球病毒宏基因组学研究中,PASV从副渔获物中分离了9,906个推定的I类α和II类RNR序列,真阳性率和真阴性率均> 99%。PASV通过人工注释以> 98%的一致性预测了2579个RNR序列的类别。PASV正确划分了所有336个测试的AOX和PTOX序列。结论:PASV提供了一种自动且准确的方法来解决同源性搜索后的验证和蛋白质编码标记基因的划分。源代码以MIT许可证发布,可在GitHub上的文档和用法示例中找到,网址为https://github.com/mooreryan/pasv。
更新日期:2021-02-03
down
wechat
bug