Probe Efficient Feature Representation of Gapped K-mer Frequency Vectors from Sequences Using Deep Neural Networks.,IEEE/ACM Transactions on Computational Biology and Bioinformatics

当前位置： X-MOL 学术 › IEEE/ACM Trans. Comput. Biol. Bioinform. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Probe Efficient Feature Representation of Gapped K-mer Frequency Vectors from Sequences Using Deep Neural Networks.
IEEE/ACM Transactions on Computational Biology and Bioinformatics ( IF 3.6 ) Pub Date : 2018-08-31 , DOI: 10.1109/tcbb.2018.2868071
Zhen Cao , Shihua Zhang

Gapped k-mers frequency vectors (gkm-fv) has been presented for extracting sequence features. Coupled with support vector machine (gkm-SVM), gkm-fvs have been used to achieve effective sequence-based predictions. However, the huge computation of a large kernel matrix prevents it from using large amount of data. And it is unclear how to combine gkm-fvs with other data sources in the context of string kernel. On the other hand, the high dimensionality, colinearity and sparsity of gkm-fvs hinder the use of many traditional machine learning methods without a kernel trick. Therefore, we proposed a flexible and scalable framework gkm-DNN to achieve feature representation from high-dimensional gkm-fvs using deep neural networks (DNN). We first proposed a more concise version of gkm-fvs, which significantly reduce the dimension of gkm-fvs. Then we implemented an efficient method to calculate the gkm-fv of a given sequence at the first time. Finally, we adopted a DNN model with gkm-fvs as inputs to achieve efficient feature representation and a prediction task. Here, we took the transcription factor binding site prediction as an illustrative application and applied gkm-DNN onto 467 small and 69 big human ENCODE ChIP-seq datasets to demonstrate its performance and compared it with the state-of-the-art method gkm-SVM.

中文翻译：

使用深度神经网络从序列中探寻缺口K-mer频率向量的有效特征表示。

已提出有间隔的k-mers频率向量（gkm-fv），用于提取序列特征。结合支持向量机（gkm-SVM），gkm-fvs已用于实现基于序列的有效预测。但是，大型内核矩阵的巨大计算阻止了它使用大量数据。尚不清楚如何在字符串内核的上下文中将gkm-fvs与其他数据源结合在一起。另一方面，gkm-fvs的高维度，共线性和稀疏性阻碍了许多传统的机器学习方法的使用，而没有内核技巧。因此，我们提出了一种灵活且可扩展的框架gkm-DNN，以使用深度神经网络（DNN）从高维gkm-fvs实现特征表示。我们首先提出了更简洁的gkm-fvs版本，该版本显着减小了gkm-fvs的尺寸。然后，我们首次实现了一种有效的方法来计算给定序列的gkm-fv。最后，我们采用了以gkm-fvs为输入的DNN模型，以实现有效的特征表示和预测任务。在这里，我们以转录因子结合位点预测为例，将gkm-DNN应用于467个小人类和69个大人类ENCODE ChIP-seq数据集，以证明其性能，并将其与最新方法gkm-支持向量机

更新日期：2020-04-22

点击分享查看原文

点击收藏

阅读更多本刊最新论文