Unsupervised multi-instance learning for protein structure determination,Journal of Bioinformatics and Computational Biology

当前位置： X-MOL 学术 › J. Bioinform. Comput. Biol. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Unsupervised multi-instance learning for protein structure determination
Journal of Bioinformatics and Computational Biology ( IF 0.9 ) Pub Date : 2021-02-11 , DOI: 10.1142/s0219720021400023
Fardina Fathmiul Alam ₁ , Amarda Shehu ₁

Affiliation

Many regions of the protein universe remain inaccessible by wet-laboratory or computational structure determination methods. A significant challenge in elucidating these dark regions in silico relates to the ability to discriminate relevant structure(s) among many structures/decoys computed for a protein of interest, a problem known as decoy selection. Clustering decoys based on geometric similarity remains popular. However, it is unclear how exactly to exploit the groups of decoys revealed via clustering to select individual structures for prediction. In this paper, we provide an intuitive formulation of the decoy selection problem as an instance of unsupervised multi-instance learning. We address the problem in three stages, first organizing given decoys of a protein molecule into bags, then identifying relevant bags, and finally drawing individual instances from these bags to offer as prediction. We propose both non-parametric and parametric algorithms for drawing individual instances. Our evaluation utilizes two datasets, one benchmark dataset of ensembles of decoys for a varied list of protein molecules, and a dataset of decoy ensembles for targets drawn from recent CASP competitions. A comparative analysis with state-of-the-art methods reveals that the proposed approach outperforms existing methods, thus warranting further investigation of multi-instance learning to advance our treatment of decoy selection.

中文翻译：

用于蛋白质结构确定的无监督多实例学习

蛋白质宇宙的许多区域仍然无法通过湿实验室或计算结构确定方法进入。在计算机中阐明这些暗区的一个重大挑战涉及在为感兴趣的蛋白质计算的许多结构/诱饵中区分相关结构的能力，这一问题被称为诱饵选择。基于几何相似性的聚类诱饵仍然很流行。然而，目前尚不清楚如何准确地利用通过聚类揭示的诱饵组来选择单个结构进行预测。在本文中，我们提供了诱饵选择问题的直观表述，作为无监督多实例学习的一个实例。我们分三个阶段解决这个问题，首先将给定的蛋白质分子诱饵组织成袋子，然后识别相关的袋子，最后从这些包中提取单个实例以提供预测。我们提出了用于绘制单个实例的非参数和参数算法。我们的评估使用了两个数据集，一个是针对各种蛋白质分子列表的诱饵集合基准数据集，另一个是针对最近 CASP 竞赛中的目标的诱饵集合数据集。与最先进方法的比较分析表明，所提出的方法优于现有方法，因此需要进一步研究多实例学习以推进我们对诱饵选择的处理。一个用于各种蛋白质分子列表的诱饵集合的基准数据集，以及从最近的 CASP 竞赛中抽取的目标的诱饵集合的数据集。与最先进方法的比较分析表明，所提出的方法优于现有方法，因此需要进一步研究多实例学习以推进我们对诱饵选择的处理。一个用于各种蛋白质分子列表的诱饵集合的基准数据集，以及从最近的 CASP 竞赛中抽取的目标的诱饵集合的数据集。与最先进方法的比较分析表明，所提出的方法优于现有方法，因此需要进一步研究多实例学习以推进我们对诱饵选择的处理。

更新日期：2021-02-11

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11