当前位置: X-MOL 学术Algorithms Mol. Biol. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Fast lightweight accurate xenograft sorting
Algorithms for Molecular Biology ( IF 1.5 ) Pub Date : 2021-04-02 , DOI: 10.1186/s13015-021-00181-w
Jens Zentgraf , Sven Rahmann

With an increasing number of patient-derived xenograft (PDX) models being created and subsequently sequenced to study tumor heterogeneity and to guide therapy decisions, there is a similarly increasing need for methods to separate reads originating from the graft (human) tumor and reads originating from the host species’ (mouse) surrounding tissue. Two kinds of methods are in use: On the one hand, alignment-based tools require that reads are mapped and aligned (by an external mapper/aligner) to the host and graft genomes separately first; the tool itself then processes the resulting alignments and quality metrics (typically BAM files) to assign each read or read pair. On the other hand, alignment-free tools work directly on the raw read data (typically FASTQ files). Recent studies compare different approaches and tools, with varying results. We show that alignment-free methods for xenograft sorting are superior concerning CPU time usage and equivalent in accuracy. We improve upon the state of the art sorting by presenting a fast lightweight approach based on three-way bucketed quotiented Cuckoo hashing. Our hash table requires memory comparable to an FM index typically used for read alignment and less than other alignment-free approaches. It allows extremely fast lookups and uses less CPU time than other alignment-free methods and alignment-based methods at similar accuracy. Several engineering steps (e.g., shortcuts for unsuccessful lookups, software prefetching) improve the performance even further. Our software xengsort is available under the MIT license at http://gitlab.com/genomeinformatics/xengsort . It is written in numba-compiled Python and comes with sample Snakemake workflows for hash table construction and dataset processing.

中文翻译:

快速轻巧准确的异种移植分选

随着越来越多的患者源异种移植(PDX)模型被创建并随后进行测序以研究肿瘤异质性并指导治疗决策,对分离源自移植物(人)肿瘤的读数和源自源于人类的读数的方法的需求也同样增长来自宿主物种(小鼠)周围的组织。使用两种方法:一方面,基于比对的工具要求首先分别将读取的图谱(通过外部作图仪/比对仪)进行比对和定位,然后分别与宿主和移植物基因组进行比对。然后,工具本身会处理最终的比对和质量指标(通常是BAM文件),以分配每个读取对或读取对。另一方面,无对齐工具可直接在原始读取数据(通常为FASTQ文件)上工作。最近的研究比较了不同的方法和工具,并得出了不同的结果。我们表明,用于异种移植排序的无对齐方法在CPU时间使用方面和精度方面相当优越。通过提出一种基于三段式桶装商杜鹃哈希的快速轻量级方法,我们改进了现有的分类技术。我们的哈希表需要的内存与通常用于读取对齐的FM索引相当,并且比其他无对齐方法要少。与其他无对齐方法和基于对齐的方法相比,它可以以极高的速度进行查找,并使用更少的CPU时间,并且具有类似的精度。几个工程步骤(例如,不成功查找的快捷方式,软件预取)可进一步提高性能。我们的软件xengsort在MIT许可下可从http://gitlab.com/genomeinformatics/xengsort获得。
更新日期:2021-04-02
down
wechat
bug