Generating novel protein sequences using Gibbs sampling of masked language models,bioRxiv - Synthetic Biology

当前位置： X-MOL 学术 › bioRxiv. Synth. Biol. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Generating novel protein sequences using Gibbs sampling of masked language models
bioRxiv - Synthetic Biology Pub Date : 2021-01-27 , DOI: 10.1101/2021.01.26.428322
Sean R. Johnson , Sarah Monaco , Kenneth Massie , Zaid Syed

Recently developed language models (LMs) based on deep neural networks have demonstrated the ability to generate fluent natural language text. LMs pre-trained on protein sequences have shown state of the art performance on a variety of downstream tasks. Protein LMs have also been used to generate novel protein sequences. In the present work we use Gibbs sampling of BERT-style LMs, pre-trained on protein sequences using the masked language modeling task, to generate novel protein sequences. We evaluate the quality of the generated sequences by comparing them to natural sequences from the same family. In particular, we focus on proteins from the chorismate mutase type II family, which has been used in previous work as an example target for protein generative models. We find that the Gibbs sampling process on BERT-style models pre-trained on millions to billions of protein sequences is able to generate novel sequences that retain key features of related natural sequences. Further, we find that smaller models fine-tuned or trained from scratch on family-specific data are able to equal or surpass the generation quality of large pre-trained models by some metrics. The ability to generate novel natural-like protein sequences could contribute to the development of improved protein therapeutics and protein-catalysts for industrial chemical production.

中文翻译：

使用Gibbs掩蔽语言模型采样生成新的蛋白质序列

最近开发的基于深度神经网络的语言模型（LM）已证明能够生成流畅的自然语言文本。对蛋白质序列进行预训练的LM在各种下游任务中均显示出了最先进的性能。蛋白LM也已用于产生新的蛋白序列。在当前工作中，我们使用掩盖语言建模任务对蛋白质序列进行预训练的BERT样式LM的吉布斯采样，以生成新的蛋白质序列。我们通过将它们与同一家族的自然序列进行比较来评估所产生序列的质量。特别是，我们专注于II型分支酸突变酶的蛋白质，该蛋白质已在先前的研究中用作蛋白质生成模型的目标靶标。我们发现，对数百万至数十亿的蛋白质序列进行预训练的BERT样式模型的Gibbs采样过程能够生成保留相关自然序列关键特征的新颖序列。此外，我们发现，对于特定于家庭的数据，从头开始进行微调或训练的较小模型在某些度量标准上可以等于或超过大型预训练模型的生成质量。产生新颖的类似天然蛋白质的序列的能力可能有助于开发用于工业化学生产的改良的蛋白质治疗剂和蛋白质催化剂。我们发现，对于特定于家庭的数据，从头开始进行微调或训练的较小模型能够在某些指标上等于或超过大型预训练模型的生成质量。产生新颖的类似天然蛋白质的序列的能力可能有助于开发用于工业化学生产的改良的蛋白质治疗剂和蛋白质催化剂。我们发现，对于特定于家庭的数据，从头开始进行微调或训练的较小模型能够在某些指标上等于或超过大型预训练模型的生成质量。产生新颖的类似天然蛋白质的序列的能力可能有助于开发用于工业化学生产的改良的蛋白质治疗剂和蛋白质催化剂。

更新日期：2021-01-27

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文