当前位置: X-MOL 学术Brief. Funct. Genomics › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Pretraining model for biological sequence data
Briefings in Functional Genomics ( IF 4 ) Pub Date : 2021-04-23 , DOI: 10.1093/bfgp/elab025
Bosheng Song , Zimeng Li , Xuan Lin , Jianmin Wang , Tian Wang , Xiangzheng Fu

With the development of high-throughput sequencing technology, biological sequence data reflecting life information becomes increasingly accessible. Particularly on the background of the COVID-19 pandemic, biological sequence data play an important role in detecting diseases, analyzing the mechanism and discovering specific drugs. In recent years, pretraining models that have emerged in natural language processing have attracted widespread attention in many research fields not only to decrease training cost but also to improve performance on downstream tasks. Pretraining models are used for embedding biological sequence and extracting feature from large biological sequence corpus to comprehensively understand the biological sequence data. In this survey, we provide a broad review on pretraining models for biological sequence data. Moreover, we first introduce biological sequences and corresponding datasets, including brief description and accessible link. Subsequently, we systematically summarize popular pretraining models for biological sequences based on four categories: CNN, word2vec, LSTM and Transformer. Then, we present some applications with proposed pretraining models on downstream tasks to explain the role of pretraining models. Next, we provide a novel pretraining scheme for protein sequences and a multitask benchmark for protein pretraining models. Finally, we discuss the challenges and future directions in pretraining models for biological sequences.

中文翻译:

生物序列数据的预训练模型

随着高通量测序技术的发展,反映生命信息的生物序列数据变得越来越容易获取。特别是在 COVID-19 大流行的背景下,生物序列数据在发现疾病、分析机制和发现特定药物方面发挥着重要作用。近年来,自然语言处理中出现的预训练模型在许多研究领域引起了广泛关注,不仅可以降低训练成本,还可以提高下游任务的性能。预训练模型用于嵌入生物序列并从大型生物序列语料库中提取特征,以全面了解生物序列数据。在本次调查中,我们对生物序列数据的预训练模型进行了广泛的回顾。而且,我们首先介绍生物序列和相应的数据集,包括简要描述和可访问的链接。随后,我们系统总结了基于 CNN、word2vec、LSTM 和 Transformer 四个类别的流行的生物序列预训练模型。然后,我们在下游任务中提出了一些带有预训练模型的应用程序,以解释预训练模型的作用。接下来,我们提供了一种新的蛋白质序列预训练方案和蛋白质预训练模型的多任务基准。最后,我们讨论了生物序列预训练模型的挑战和未来方向。我们基于 CNN、word2vec、LSTM 和 Transformer 四个类别系统地总结了流行的生物序列预训练模型。然后,我们在下游任务中提出了一些带有预训练模型的应用程序,以解释预训练模型的作用。接下来,我们提供了一种新的蛋白质序列预训练方案和蛋白质预训练模型的多任务基准。最后,我们讨论了生物序列预训练模型的挑战和未来方向。我们基于 CNN、word2vec、LSTM 和 Transformer 四个类别系统地总结了流行的生物序列预训练模型。然后,我们在下游任务中提出了一些带有预训练模型的应用程序,以解释预训练模型的作用。接下来,我们提供了一种新的蛋白质序列预训练方案和蛋白质预训练模型的多任务基准。最后,我们讨论了生物序列预训练模型的挑战和未来方向。
更新日期:2021-04-23
down
wechat
bug