当前位置: X-MOL 学术Front. Phys. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Predicting Sites of Epitranscriptome Modifications Using Unsupervised Representation Learning Based on Generative Adversarial Networks
Frontiers in Physics ( IF 3.1 ) Pub Date : 2020-05-01 , DOI: 10.3389/fphy.2020.00196
Sirajul Salekin , Milad Mostavi , Yu-Chiao Chiu , Yidong Chen , Jianqiu Zhang , Yufei Huang

Epitranscriptome is an exciting area that studies different types of modifications in transcripts, and the prediction of such modification sites from the transcript sequence is of significant interest. However, the scarcity of positive sites for most modifications imposes critical challenges for training robust algorithms. To circumvent this problem, we propose MR-GAN, a generative adversarial network (GAN)-based model, which is trained in an unsupervised fashion on the entire pre-mRNA sequences to learn a low-dimensional embedding of transcriptomic sequences. MR-GAN was then applied to extract embeddings of the sequences in a training dataset we created for nine epitranscriptome modifications, namely, m6A, m1A, m1G, m2G, m5C, m5U, 2′-O-Me, pseudouridine (Ψ), and dihydrouridine (D), of which the positive samples are very limited. Prediction models were trained based on the embeddings extracted by MR-GAN. We compared the prediction performance with the one-hot encoding of the training sequences and SRAMP, a state-of-the-art m6A site prediction algorithm, and demonstrated that the learned embeddings outperform one-hot encoding by a significant margin for up to 15% improvement. Using MR-GAN, we also investigated the sequence motifs for each modification type and uncovered known motifs as well as new motifs not possible with sequences directly. The results demonstrated that transcriptome features extracted using unsupervised learning could lead to high precision for predicting multiple types of epitranscriptome modifications, even when the data size is small and extremely imbalanced.



中文翻译:

基于生成对抗网络的无监督表示学习预测转录组修饰位点

转录组是一个令人兴奋的领域,它研究转录本中不同类型的修饰,因此从转录本序列预测此类修饰位点非常重要。然而,大多数修饰的阳性位点的缺乏给训练鲁棒算法带来了严峻的挑战。为了解决这个问题,我们提出了基于生成对抗网络(GAN)的MR-GAN模型,该模型以无监督的方式在整个pre-mRNA序列上进行训练,以学习转录组序列的低维嵌入。然后将MR-GAN应用于在我们为九种转录组修饰创建的训练数据集中提取序列的嵌入,即m 6 A,m 1 A,m 1 G,m 2 G,m5 C,m 5 U,2′-Ø-Me,假尿苷(Ψ)和二氢尿苷(D),其中阳性样品非常有限。根据MR-GAN提取的嵌入信息对预测模型进行训练。我们将预测性能与训练序列和SRAMP(一种最新的m 6一种站点预测算法,并证明了学习到的嵌入在很大程度上优于单热点编码,最多可提高15%。使用MR-GAN,我们还研究了每种修饰类型的序列基序和未发现的已知基序,以及直接序列无法实现的新基序。结果表明,即使数据量很小且极不平衡,使用无监督学习提取的转录组特征也可以导致预测多种类型的转录组修饰的高精度。

更新日期:2020-06-19
down
wechat
bug