Obtaining extremely large and accurate protein multiple sequence alignments from curated hierarchical alignments.,Database: The Journal of Biological Databases and Curation

当前位置： X-MOL 学术 › Database J. Biol. Databases Curation › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Obtaining extremely large and accurate protein multiple sequence alignments from curated hierarchical alignments.
Database: The Journal of Biological Databases and Curation ( IF 5.8 ) Pub Date : 2020-06-05 , DOI: 10.1093/database/baaa042
Andrew F Neuwald _{1,

2} , Christopher J Lanczycki ₃ , Theresa K Hodges ₁ , Aron Marchler-Bauer ₃

Affiliation

For optimal performance, machine learning methods for protein sequence/structural analysis typically require as input a large multiple sequence alignment (MSA), which is often created using query-based iterative programs, such as PSI-BLAST or JackHMMER. However, because these programs align database sequences using a query sequence as a template, they may fail to detect or may tend to misalign sequences distantly related to the query. More generally, automated MSA programs often fail to align sequences correctly due to the unpredictable nature of protein evolution. Addressing this problem typically requires manual curation in the light of structural data. However, curated MSAs tend to contain too few sequences to serve as input for statistically based methods. We address these shortcomings by making publicly available a set of 252 curated hierarchical MSAs (hiMSAs), containing a total of 26 212 066 sequences, along with programs for generating from these extremely large MSAs. Each hiMSA consists of a set of hierarchically arranged MSAs representing individual subgroups within a superfamily along with template MSAs specifying how to align each subgroup MSA against MSAs higher up the hierarchy. Central to this approach is the MAPGAPS search program, which uses a hiMSA as a query to align (potentially vast numbers of) matching database sequences with accuracy comparable to that of the curated hiMSA. We illustrate this process for the exonuclease–endonuclease–phosphatase superfamily and for pleckstrin homology domains. A set of extremely large MSAs generated from the hiMSAs in this way is available as input for deep learning, big data analyses. MAPGAPS, auxiliary programs CDD2MGS, AddPhylum, PurgeMSA and ConvertMSA and links to National Center for Biotechnology Information data files are available at https://www.igs.umaryland.edu/labs/neuwald/software/mapgaps/.

中文翻译：

从策划的层次比对中获得非常大且准确的蛋白质多序列比对。

为了获得最佳性能，用于蛋白质序列/结构分析的机器学习方法通常需要输入较大的多序列比对（MSA）作为输入，通常使用基于查询的迭代程序（例如PSI-BLAST或JackHMMER）创建该序列。但是，由于这些程序使用查询序列作为模板来比对数据库序列，因此它们可能无法检测到或可能导致与查询远距离相关的序列不匹配。更一般而言，由于蛋白质进化的不可预测性，自动化MSA程序通常无法正确比对序列。解决此问题通常需要根据结构数据进行手动管理。但是，策划的MSA往往包含的序列太少，无法用作基于统计的方法的输入。我们通过公开提供一组252个经过整理的层次化MSA（hiMSA）来解决这些缺陷，其中包含总共26 212 066个序列，以及用于从这些超大型MSA生成的程序。每个hiMSA包括一组代表超家族中各个子组的层次排列的MSA，以及指定如何将每个子组MSA与层次结构中较高的MSA对齐的模板MSA。这种方法的核心是MAPGAPS搜索程序，该程序使用hiMSA作为查询来比对匹配的数据库序列（可能是大量的），其准确性可与策划的hiMSA媲美。我们为核酸外切酶-核酸内切酶-磷酸酶超家族和pleckstrin同源域说明了这一过程。以此方式从hiMSA生成的一组非常大的MSA可用作深度学习，大数据分析的输入。MAPGAPS，辅助程序CDD2MGS，AddPhylum，PurgeMSA和ConvertMSA以及与国家生物技术中心信息数据文件的链接可在https://www.igs.umaryland.edu/labs/neuwald/software/mapgaps/上找到。

更新日期：2020-06-05

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文

全部期刊列表>>