当前位置: X-MOL 学术bioRxiv. Synth. Biol. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences
bioRxiv - Synthetic Biology Pub Date : 2020-12-15 , DOI: 10.1101/622803
Alexander Rives , Joshua Meier , Tom Sercu , Siddharth Goyal , Zeming Lin , Demi Guo , Myle Ott , C. Lawrence Zitnick , Jerry Ma , Rob Fergus

In the field of artificial intelligence, a combination of scale in data and model capacity enabled by unsupervised learning has led to major advances in representation learning and statistical generation. In the life sciences, the anticipated growth of sequencing promises unprecedented data on natural sequence diversity. Protein language modeling at the scale of evolution is a logical step toward predictive and generative artificial intelligence for biology. To this end we use unsupervised learning to train a deep contextual language model on 86 billion amino acids across 250 million protein sequences spanning evolutionary diversity. The resulting model contains information about biological properties in its representations. The representations are learned from sequence data alone. The learned representation space has a multi-scale organization reflecting structure from the level of biochemical properties of amino acids to remote homology of proteins. Information about secondary and tertiary structure is encoded in the representations and can be identified by linear projections. Representation learning produces features that generalize across a range of applications, enabling state-of-the-art supervised prediction of mutational effect and secondary structure, and improving state-of-the-art features for long-range contact prediction.

中文翻译:

生物学结构和功能从无监督学习扩展到2.5亿个蛋白质序列而出现

在人工智能领域,无监督学习支持的数据规模和模型容量的组合已导致表示学习和统计生成方面的重大进步。在生命科学中,测序的预期增长有望带来有关自然序列多样性的空前数据。进化规模的蛋白质语言建模是迈向生物学的预测性和生成性人工智能的逻辑步骤。为此,我们使用无监督学习对跨越进化多样性的2.5亿个蛋白质序列上的860亿个氨基酸训练了一种深层的上下文语言模型。结果模型在其表示中包含有关生物学特性的信息。仅从序列数据中学习表示。学习的表示空间具有多尺度的组织,反映了从氨基酸的生化特性水平到蛋白质的远距离同源性的结构。关于二级和三级结构的信息被编码在表示中,并且可以通过线性投影来识别。表征学习产生的功能可广泛应用于各种应用程序中,从而能够对变异效应和二级结构进行最新的监督式预测,并改善远程接触预测的最新功能。
更新日期:2020-12-17
down
wechat
bug