当前位置: X-MOL 学术BMC Bioinform. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Amino acid encoding for deep learning applications.
BMC Bioinformatics ( IF 2.9 ) Pub Date : 2020-06-09 , DOI: 10.1186/s12859-020-03546-x
Hesham ElAbd 1 , Yana Bromberg 2, 3, 4 , Adrienne Hoarfrost 2 , Tobias Lenz 5 , Andre Franke 1 , Mareike Wendorff 1
Affiliation  

The number of applications of deep learning algorithms in bioinformatics is increasing as they usually achieve superior performance over classical approaches, especially, when bigger training datasets are available. In deep learning applications, discrete data, e.g. words or n-grams in language, or amino acids or nucleotides in bioinformatics, are generally represented as a continuous vector through an embedding matrix. Recently, learning this embedding matrix directly from the data as part of the continuous iteration of the model to optimize the target prediction – a process called ‘end-to-end learning’ – has led to state-of-the-art results in many fields. Although usage of embeddings is well described in the bioinformatics literature, the potential of end-to-end learning for single amino acids, as compared to more classical manually-curated encoding strategies, has not been systematically addressed. To this end, we compared classical encoding matrices, namely one-hot, VHSE8 and BLOSUM62, to end-to-end learning of amino acid embeddings for two different prediction tasks using three widely used architectures, namely recurrent neural networks (RNN), convolutional neural networks (CNN), and the hybrid CNN-RNN. By using different deep learning architectures, we show that end-to-end learning is on par with classical encodings for embeddings of the same dimension even when limited training data is available, and might allow for a reduction in the embedding dimension without performance loss, which is critical when deploying the models to devices with limited computational capacities. We found that the embedding dimension is a major factor in controlling the model performance. Surprisingly, we observed that deep learning models are capable of learning from random vectors of appropriate dimension. Our study shows that end-to-end learning is a flexible and powerful method for amino acid encoding. Further, due to the flexibility of deep learning systems, amino acid encoding schemes should be benchmarked against random vectors of the same dimension to disentangle the information content provided by the encoding scheme from the distinguishability effect provided by the scheme.

中文翻译:


用于深度学习应用的氨基酸编码。



深度学习算法在生物信息学中的应用数量正在增加,因为它们通常比经典方法具有更优越的性能,特别是当有更大的训练数据集可用时。在深度学习应用中,离散数据,例如语言中的单词或n-gram,或生物信息学中的氨基酸或核苷酸,通常通过嵌入矩阵表示为连续向量。最近,作为模型连续迭代的一部分,直接从数据中学习这个嵌入矩阵以优化目标预测——这个过程称为“端到端学习”——已经在许多领域取得了最先进的结果字段。尽管生物信息学文献中充分描述了嵌入的使用,但与更经典的手动编码策略相比,单个氨基酸端到端学习的潜力尚未得到系统解决。为此,我们将经典编码矩阵(即 one-hot、VHSE8 和 BLOSUM62)与使用三种广泛使用的架构(即循环神经网络(RNN)、卷积神经网络)的两种不同预测任务的氨基酸嵌入的端到端学习进行了比较。神经网络 (CNN) 和混合 CNN-RNN。通过使用不同的深度学习架构,我们表明,即使训练数据有限,端到端学习也与相同维度的嵌入的经典编码相当,并且可能允许在不损失性能的情况下减少嵌入维度,当将模型部署到计算能力有限的设备时,这一点至关重要。我们发现嵌入维度是控制模型性能的主要因素。 令人惊讶的是,我们观察到深度学习模型能够从适当维度的随机向量中学习。我们的研究表明,端到端学习是一种灵活而强大的氨基酸编码方法。此外,由于深度学习系统的灵活性,氨基酸编码方案应该与相同维度的随机向量进行基准测试,以将编码方案提供的信息内容与该方案提供的可区分性效果分开。
更新日期:2020-06-09
down
wechat
bug