当前位置: X-MOL 学术Proc. Natl. Acad. Sci. U.S.A. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Learning the molecular grammar of protein condensates from sequence determinants and embeddings [Biophysics and Computational Biology]
Proceedings of the National Academy of Sciences of the United States of America ( IF 9.4 ) Pub Date : 2021-04-13 , DOI: 10.1073/pnas.2019053118
Kadi L Saar 1, 2 , Alexey S Morgunov 1 , Runzhang Qi 1 , William E Arter 1 , Georg Krainer 1 , Alpha A Lee 2 , Tuomas P J Knowles 2, 3
Affiliation  

Intracellular phase separation of proteins into biomolecular condensates is increasingly recognized as a process with a key role in cellular compartmentalization and regulation. Different hypotheses about the parameters that determine the tendency of proteins to form condensates have been proposed, with some of them probed experimentally through the use of constructs generated by sequence alterations. To broaden the scope of these observations, we established an in silico strategy for understanding on a global level the associations between protein sequence and phase behavior and further constructed machine-learning models for predicting protein liquid–liquid phase separation (LLPS). Our analysis highlighted that LLPS-prone proteins are more disordered, less hydrophobic, and of lower Shannon entropy than sequences in the Protein Data Bank or the Swiss-Prot database and that they show a fine balance in their relative content of polar and hydrophobic residues. To further learn in a hypothesis-free manner the sequence features underpinning LLPS, we trained a neural network-based language model and found that a classifier constructed on such embeddings learned the underlying principles of phase behavior at a comparable accuracy to a classifier that used knowledge-based features. By combining knowledge-based features with unsupervised embeddings, we generated an integrated model that distinguished LLPS-prone sequences both from structured proteins and from unstructured proteins with a lower LLPS propensity and further identified such sequences from the human proteome at a high accuracy. These results provide a platform rooted in molecular principles for understanding protein phase behavior. The predictor, termed DeePhase, is accessible from https://deephase.ch.cam.ac.uk/.



中文翻译:

从序列决定因素和嵌入中学习蛋白质凝聚物的分子语法 [生物物理学和计算生物学]

将蛋白质细胞内相分离成生物分子凝聚物的过程越来越多地被认为是在细胞区室化和调节中起关键作用的过程。已经提出了关于决定蛋白质形成缩合物趋势的参数的不同假设,其中一些假设通过使用序列改变产生的构建体进行了实验探测。为了扩大这些观察的范围,我们建立了一种在全球范围内理解蛋白质序列和相行为之间关联的计算机策略,并进一步构建了用于预测蛋白质液-液相分离(LLPS)的机器学习模型。我们的分析强调了 LLPS 倾向的蛋白质更无序,疏水性更差,和比蛋白质数据库或 Swiss-Prot 数据库中的序列更低的香农熵,并且它们在极性和疏水性残基的相对含量上表现出良好的平衡。为了进一步以无假设的方式学习支持 LLPS 的序列特征,我们训练了一个基于神经网络的语言模型,并发现基于此类嵌入构建的分类器以与使用知识的分类器相当的准确度学习相位行为的基本原理基于特征。通过将基于知识的特征与无监督嵌入相结合,我们生成了一个集成模型,该模型将 LLPS 倾向序列与结构化蛋白质和具有较低 LLPS 倾向的非结构化蛋白质区分开来,并进一步以高精度从人类蛋白质组中识别出此类序列。这些结果为理解蛋白质相行为提供了一个植根于分子原理的平台。称为 DeePhase 的预测器可从 https://deephase.ch.cam.ac.uk/ 访问。

更新日期:2021-04-08
down
wechat
bug