VCFdbR: A method for expressing biobank-scale Variant Call Format data in a SQLite database using R,bioRxiv - Bioinformatics

当前位置： X-MOL 学术 › bioRxiv. Bioinform. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

VCFdbR: A method for expressing biobank-scale Variant Call Format data in a SQLite database using R
bioRxiv - Bioinformatics Pub Date : 2020-06-04 , DOI: 10.1101/2020.04.28.066894
Tanner Koomar , Jacob J Michaelson

As exome and whole-genome sequencing cohorts grow in size, the data they produce strains the limits of current tools and data structures. The Variant Call Format (VCF) was originally created as part of the 1,000 Genomes project. Flexible and concise enough to describe the genetic variations of thousands of samples in a single flat file, the VCF has become the standard for communicating the results of large-scale sequencing experiments. Because of its static and text-based structure, VCFs remain cumbersome to parse and filter in an interactive way, even with the aid of indexing. Iterating on previous concepts, we propose here a pipeline for converting VCFs to simple SQLite databases, which allow for rapid searching and filtering of genetic variants while minimizing memory overhead. Code can be found at https://github.com/tkoomar/VCFdbR

中文翻译：

VCFdbR：一种使用R在SQLite数据库中表达生物库规模的变异调用格式数据的方法

随着外显子组和全基因组测序队列规模的增长，它们产生的数据限制了当前工具和数据结构的局限性。变异调用格式（VCF）最初是在1,000个基因组项目中创建的。VCF具有足够的灵活性和简洁性，可以在一个平面文件中描述成千上万个样品的遗传变异，已经成为传达大规模测序实验结果的标准。由于它的静态和基于文本的结构，即使在建立索引的情况下，VCF仍然难以以交互方式进行解析和过滤。在前面的概念上进行迭代，我们在这里提出了将VCF转换为简单SQLite数据库的管道，该管道允许快速搜索和过滤遗传变异，同时最大程度地减少内存开销。可以在https://github.com/tkoomar/VCFdbR上找到代码

更新日期：2020-06-04

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文

全部期刊列表>>