当前位置: X-MOL 学术J. Comput. Biol. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Reads2Vec: Efficient Embedding of Raw High-Throughput Sequencing Reads Data
Journal of Computational Biology ( IF 1.7 ) Pub Date : 2023-02-02 , DOI: 10.1089/cmb.2022.0424
Prakash Chourasia 1 , Sarwan Ali 1 , Simone Ciccolella 2 , Gianluca Della Vedova 2 , Murray Patterson 1
Affiliation  

The massive amount of genomic data appearing for SARS-CoV-2 since the beginning of the COVID-19 pandemic has challenged traditional methods for studying its dynamics. As a result, new methods such as Pangolin, which can scale to the millions of samples of SARS-CoV-2 currently available, have appeared. Such a tool is tailored to take as input assembled, aligned, and curated full-length sequences, such as those found in the GISAID database. As high-throughput sequencing technologies continue to advance, such assembly, alignment, and curation may become a bottleneck, creating a need for methods that can process raw sequencing reads directly. In this article, we propose Reads2Vec, an alignment-free embedding approach that can generate a fixed-length feature vector representation directly from the raw sequencing reads without requiring assembly. Furthermore, since such an embedding is a numerical representation, it may be applied to highly optimized classification and clustering algorithms. Experiments on simulated data show that our proposed embedding obtains better classification results and better clustering properties contrary to existing alignment-free baselines. In a study on real data, we show that alignment-free embeddings have better clustering properties than the Pangolin tool and that the spike region of the SARS-CoV-2 genome heavily informs the alignment-free clusterings, which is consistent with current biological knowledge of SARS-CoV-2.

中文翻译:

Reads2Vec:原始高通量测序读取数据的高效嵌入

自 COVID-19 大流行开始以来出现的 SARS-CoV-2 的大量基因组数据挑战了研究其动态的传统方法。因此,出现了新方法,例如穿山甲,它可以扩展到目前可用的数百万个 SARS-CoV-2 样本。这样的工具经过定制,可以将组装、比对和策划的全长序列作为输入,例如 GISAID 数据库中的序列。随着高通量测序技术的不断进步,此类组装、比对和管理可能成为瓶颈,从而需要能够直接处理原始测序读数的方法。在本文中,我们提出了 Reads2Vec,这是一种无对齐嵌入方法,可以直接从原始测序读数生成固定长度的特征向量表示,而无需组装。此外,由于这种嵌入是一种数值表示,因此它可以应用于高度优化的分类和聚类算法。对模拟数据的实验表明,与现有的无对齐基线相反,我们提出的嵌入获得了更好的分类结果和更好的聚类特性。在对真实数据的研究中,我们表明无对齐嵌入比穿山甲工具具有更好的聚类特性,并且 SARS-CoV-2 基因组的尖峰区域在很大程度上影响了无对齐聚类,这与当前的生物学知识一致SARS-CoV-2。对模拟数据的实验表明,与现有的无对齐基线相反,我们提出的嵌入获得了更好的分类结果和更好的聚类特性。在对真实数据的研究中,我们表明无对齐嵌入比穿山甲工具具有更好的聚类特性,并且 SARS-CoV-2 基因组的尖峰区域在很大程度上影响了无对齐聚类,这与当前的生物学知识一致SARS-CoV-2。对模拟数据的实验表明,与现有的无对齐基线相反,我们提出的嵌入获得了更好的分类结果和更好的聚类特性。在对真实数据的研究中,我们表明无对齐嵌入比穿山甲工具具有更好的聚类特性,并且 SARS-CoV-2 基因组的尖峰区域在很大程度上影响了无对齐聚类,这与当前的生物学知识一致SARS-CoV-2。
更新日期:2023-02-03
down
wechat
bug