Critiquing Protein Family Classification Models Using Sufficient Input Subsets.,Journal of Computational Biology

当前位置： X-MOL 学术 › J. Comput. Biol. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Critiquing Protein Family Classification Models Using Sufficient Input Subsets.
Journal of Computational Biology ( IF 1.7 ) Pub Date : 2020-08-04 , DOI: 10.1089/cmb.2019.0339
Brandon Carter _{1,

2} , Maxwell Bileschi ₂ , Jamie Smith ₂ , Theo Sanderson ₂ , Drew Bryant ₂ , David Belanger ₂ , Lucy J Colwell _{2,

3}

Affiliation

In many application domains, neural networks are highly accurate and have been deployed at large scale. However, users often do not have good tools for understanding how these models arrive at their predictions. This has hindered adoption in fields such as the life and medical sciences, where researchers require that models base their decisions on underlying biological phenomena rather than peculiarities of the dataset. We propose a set of methods for critiquing deep learning models and demonstrate their application for protein family classification, a task for which high-accuracy models have considerable potential impact. Our methods extend the Sufficient Input Subsets (SIS) technique, which we use to identify subsets of features in each protein sequence that are alone sufficient for classification. Our suite of tools analyzes these subsets to shed light on the decision-making criteria employed by models trained on this task. These tools show that while deep models may perform classification for biologically relevant reasons, their behavior varies considerably across the choice of network architecture and parameter initialization. While the techniques that we develop are specific to the protein sequence classification task, the approach taken generalizes to a broad set of scientific contexts in which model interpretability is essential.

中文翻译：

使用足够的输入子集评价蛋白质家族分类模型。

在许多应用领域，神经网络都非常准确并且已经被大规模部署。然而，用户通常没有很好的工具来理解这些模型如何得出他们的预测。这阻碍了生命科学和医学科学等领域的采用，在这些领域，研究人员要求模型基于潜在的生物现象而不是数据集的特性做出决策。我们提出了一套评估深度学习模型的方法，并展示了它们在蛋白质家族分类中的应用，高精度模型对这项任务具有相当大的潜在影响。我们的方法扩展了足够的输入子集 (SIS) 技术，我们用它来识别每个蛋白质序列中单独足以进行分类的特征子集。我们的工具套件分析这些子集，以阐明在此任务上训练的模型所采用的决策标准。这些工具表明，虽然深度模型可能出于生物学相关的原因执行分类，但它们的行为在网络架构和参数初始化的选择上有很大差异。虽然我们开发的技术特定于蛋白质序列分类任务，但所采用的方法可以推广到一系列广泛的科学背景，其中模型可解释性至关重要。

更新日期：2020-08-08

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>