当前位置: X-MOL 学术bioRxiv. Biophys. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Machine learning models for predicting protein condensate formation from sequence determinants and embeddings
bioRxiv - Biophysics Pub Date : 2020-10-26 , DOI: 10.1101/2020.10.26.354753
Kadi L. Saar , Alexey S. Morgunov , Runzhang Qi , William E. Arter , Georg Krainer , Alpha A. Lee , Tuomas P. J. Knowles

Intracellular phase separation of proteins into biomolecular condensates is increasingly recognised as an important phenomenon for cellular compartmentalisation and regulation of biological function. Different hypotheses about the parameters that determine the tendency of proteins to form condensates have been proposed with some of them probed experimentally through the use of constructs generated by sequence alterations. To broaden the scope of these observations, here, we established an in silico strategy for understanding on a global level the associations between protein sequence and condensate formation, and used this information to construct machine learning classifiers for predicting liquid-liquid phase separation (LLPS) from protein sequence. Our analysis highlighted that LLPS-prone sequences are more disordered, hydrophobic and of lower Shannon entropy than sequences in the Protein Data Bank or the Swiss-Prot database, and have their disordered regions enriched in polar, aromatic and charged residues. Using these determining features together with neural network based word2vec sequence embeddings, we developed machine learning classifiers for predicting protein condensate formation. Our model, trained to distinguish LLPS-prone sequences from structured proteins, achieved high accuracy (93%; 25-fold cross-validation) and identified condensate forming sequences from external independent test data at 97% sensitivity. Moreover, in combination with a classifier that had developed a nuanced insight into the features governing protein phase behaviour by learning to distinguish between sequences of varying LLPS propensity, the sensitivity was supplemented with high specificity (approximated ROC-AUC of 0.85). These results provide a platform rooted in molecular principles for understanding protein phase behaviour. The predictor is accessible from https://deephase.ch.cam.ac.uk/ .

中文翻译:

机器学习模型,可根据序列决定簇和嵌入预测蛋白质冷凝物的形成

蛋白质向生物分子缩合物的细胞内相分离越来越被认为是细胞区分开和调节生物学功能的重要现象。已经提出了关于确定蛋白质形成冷凝物趋势的参数的不同假设,其中一些假设是通过使用由序列改变产生的构建体进行实验性探索的。为了扩大这些观察的范围,在这里,我们建立了计算机模拟策略,以在全球范围内理解蛋白质序列和冷凝物形成之间的关联,并使用此信息构建用于预测液-液相分离(LLPS)的机器学习分类器。从蛋白质序列。我们的分析强调,LLPS易发序列更加混乱,疏水性和香农熵低于蛋白质数据库或Swiss-Prot数据库中的序列,并且其无序区域富含极性,芳香和带电残基。使用这些确定特征以及基于神经网络的word2vec序列嵌入,我们开发了机器学习分类器来预测蛋白质冷凝物的形成。我们的模型经过训练,可以将LLPS易发序列与结构蛋白区分开,获得了较高的准确性(93%; 25倍交叉验证),并从外部独立测试数据中以97%的灵敏度鉴定了冷凝物形成序列。此外,结合通过学习区分LLPS倾向不同的序列的分类器,对控制蛋白质相行为的特征有了细微的了解,高灵敏度补充了敏感性(ROC-AUC约为0.85)。这些结果提供了一个基于分子原理的平台,用于理解蛋白质相的行为。可从https://deephase.ch.cam.ac.uk/访问该预测变量。
更新日期:2020-10-30
down
wechat
bug