当前位置: X-MOL 学术Phys. Rev. E › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Statistical physics of interacting proteins: Impact of dataset size and quality assessed in synthetic sequences
Physical Review E ( IF 2.4 ) Pub Date : 2020-03-20 , DOI: 10.1103/physreve.101.032413
Carlos A. Gandarilla-Pérez , Pierre Mergny , Martin Weigt , Anne-Florence Bitbol

Identifying protein-protein interactions is crucial for a systems-level understanding of the cell. Recently, algorithms based on inverse statistical physics, e.g., direct coupling analysis (DCA), have allowed to use evolutionarily related sequences to address two conceptually related inference tasks: finding pairs of interacting proteins and identifying pairs of residues which form contacts between interacting proteins. Here we address two underlying questions: How are the performances of both inference tasks related? How does performance depend on dataset size and the quality? To this end, we formalize both tasks using Ising models defined over stochastic block models, with individual blocks representing single proteins and interblock couplings protein-protein interactions; controlled synthetic sequence data are generated by Monte Carlo simulations. We show that DCA is able to address both inference tasks accurately when sufficiently large training sets of known interaction partners are available and that an iterative pairing algorithm allows to make predictions even without a training set. Noise in the training data deteriorates performance. In both tasks we find a quadratic scaling relating dataset quality and size that is consistent with noise adding in square-root fashion and signal adding linearly when increasing the dataset. This implies that it is generally good to incorporate more data even if their quality are imperfect, thereby shedding light on the empirically observed performance of DCA applied to natural protein sequences.

中文翻译:

相互作用蛋白的统计物理学:在合成序列中评估的数据集大小和质量的影响

鉴定蛋白质之间的相互作用对于细胞的系统级了解至关重要。近来,基于逆统计物理学的算法,例如直接偶联分析(DCA),已经允许使用进化相关序列来解决两个概念上相关的推论任务:寻找相互作用蛋白对和鉴定在相互作用蛋白之间形成接触的残基对。在这里,我们解决两个基本问题:两个推理任务的性能如何相关?性能如何取决于数据集的大小和质量?为此,我们使用在随机模块模型上定义的Ising模型来形式化这两个任务,其中单个模块代表单个蛋白质,并且模块间相互作用是蛋白质与蛋白质的相互作用;受控的合成序列数据是通过蒙特卡洛模拟生成的。我们表明,当足够大的已知交互伙伴训练集可用时,DCA能够准确地解决两个推理任务,并且即使没有训练集,迭代配对算法也可以做出预测。训练数据中的噪声会降低性能。在这两个任务中,我们都发现了与数据集质量和大小相关的二次缩放,这与以平方根方式添加噪声和在增加数据集时线性添加信号一致。这意味着即使它们的质量不完美,通常也可以合并更多的数据,从而可以有效地观察经验性地将DCA应用于天然蛋白质序列的性能。我们表明,当足够大的已知交互伙伴训练集可用时,DCA能够准确地解决两个推理任务,并且即使没有训练集,迭代配对算法也可以做出预测。训练数据中的噪声会降低性能。在这两个任务中,我们都发现了与数据集质量和大小相关的二次缩放,这与以平方根方式添加噪声和在增加数据集时线性添加信号一致。这意味着即使它们的质量不完美,通常也可以合并更多的数据,从而可以有效地观察经验性地将DCA应用于天然蛋白质序列的性能。我们表明,当足够大的已知交互伙伴训练集可用时,DCA能够准确地解决两个推理任务,并且即使没有训练集,迭代配对算法也可以做出预测。训练数据中的噪声会降低性能。在这两个任务中,我们都发现了与数据集质量和大小相关的二次缩放,这与以平方根方式添加噪声和在增加数据集时线性添加信号一致。这意味着即使它们的质量不完美,通常也可以合并更多的数据,从而可以有效地观察经验性地将DCA应用于天然蛋白质序列的性能。
更新日期:2020-03-21
down
wechat
bug