当前位置: X-MOL 学术bioRxiv. Synth. Biol. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
MSA Transformer
bioRxiv - Synthetic Biology Pub Date : 2021-08-27 , DOI: 10.1101/2021.02.12.430858
Roshan M Rao , Jason Liu , Robert Verkuil , Joshua Meier , John Canny , Pieter Abbeel , Tom Sercu , Alexander Rives

Unsupervised protein language models trained across millions of diverse sequences learn structure and function of proteins. Protein language models studied to date have been trained to perform inference from individual sequences. The longstanding approach in computational biology has been to make inferences from a family of evo lutionarily related sequences by fitting a model to each family independently. In this work we combine the two paradigms. We introduce a protein language model which takes as input a set of sequences in the form of a multiple sequence alignment. The model interleaves row and column attention across the input sequences and is trained with a variant of the masked language modeling objective across many protein families. The performance of the model surpasses current state-of-the-art unsupervised structure learning methods by a wide margin, with far greater parameter efficiency than prior state-of-the-art protein language models.

中文翻译:

MSA 变压器

在数百万个不同序列中训练的无监督蛋白质语言模型学习蛋白质的结构和功能。迄今为止研究的蛋白质语言模型已经过训练,可以从单个序列进行推理。计算生物学中长期存在的方法是通过将模型独立地拟合到每个家族,从进化相关的序列家族中进行推断。在这项工作中,我们结合了两种范式。我们引入了一种蛋白质语言模型,该模型以多序列比对形式的一组序列作为输入。该模型在输入序列中交错行和列注意力,并使用跨许多蛋白质家族的掩码语言建模目标的变体进行训练。
更新日期:2021-08-30
down
wechat
bug