当前位置: X-MOL 学术BMC Bioinform. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Accurately estimating the length distributions of genomic micro-satellites by tumor purity deconvolution.
BMC Bioinformatics ( IF 3 ) Pub Date : 2020-03-11 , DOI: 10.1186/s12859-020-3349-5
Yixuan Wang 1, 2 , Xuanping Zhang 1, 2 , Xiao Xiao 3 , Fei-Ran Zhang 4 , Xinxing Yan 1, 2 , Xuan Feng 1, 2 , Zhongmeng Zhao 1, 2 , Yanfang Guan 1, 2, 5 , Jiayin Wang 1, 2
Affiliation  

Genomic micro-satellites are the genomic regions that consist of short and repetitive DNA motifs. Estimating the length distribution and state of a micro-satellite region is an important computational step in cancer sequencing data pipelines, which is suggested to facilitate the downstream analysis and clinical decision supporting. Although several state-of-the-art approaches have been proposed to identify micro-satellite instability (MSI) events, they are limited in dealing with regions longer than one read length. Moreover, based on our best knowledge, all of these approaches imply a hypothesis that the tumor purity of the sequenced samples is sufficiently high, which is inconsistent with the reality, leading the inferred length distribution to dilute the data signal and introducing the false positive errors. In this article, we proposed a computational approach, named ELMSI, which detected MSI events based on the next generation sequencing technology. ELMSI can estimate the specific length distributions and states of micro-satellite regions from a mixed tumor sample paired with a control one. It first estimated the purity of the tumor sample based on the read counts of the filtered SNVs loci. Then, the algorithm identified the length distributions and the states of short micro-satellites by adding the Maximum Likelihood Estimation (MLE) step to the existing algorithm. After that, ELMSI continued to infer the length distributions of long micro-satellites by incorporating a simplified Expectation Maximization (EM) algorithm with central limit theorem, and then used statistical tests to output the states of these micro-satellites. Based on our experimental results, ELMSI was able to handle micro-satellites with lengths ranging from shorter than one read length to 10kbps. To verify the reliability of our algorithm, we first compared the ability of classifying the shorter micro-satellites from the mixed samples with the existing algorithm MSIsensor. Meanwhile, we varied the number of micro-satellite regions, the read length and the sequencing coverage to separately test the performance of ELMSI on estimating the longer ones from the mixed samples. ELMSI performed well on mixed samples, and thus ELMSI was of great value for improving the recognition effect of micro-satellite regions and supporting clinical decision supporting. The source codes have been uploaded and maintained at https://github.com/YixuanWang1120/ELMSI for academic use only.

中文翻译:

通过肿瘤纯度去卷积准确估算基因组微卫星的长度分布。

基因组微卫星是由短而重复的DNA图案组成的基因组区域。估计微卫星区域的长度分布和状态是癌症测序数据管道中的重要计算步骤,建议此方法有助于下游分析和临床决策支持。尽管已经提出了几种先进的方法来识别微卫星不稳定性(MSI)事件,但是它们在处理长于一个读取长度的区域时受到限制。此外,根据我们的最佳知识,所有这些方法都暗含一个假设,即测序样品的肿瘤纯度足够高,这与实际情况不符,从而导致推断的长度分布会稀释数据信号并引入假阳性错误。 。在这篇文章中,我们提出了一种名为ELMSI的计算方法,该方法基于下一代测序技术来检测MSI事件。ELMSI可以从与对照样品配对的混合肿瘤样品中估算微卫星区域的特定长度分布和状态。它首先根据过滤后的SNV基因座的读数估算肿瘤样品的纯度。然后,该算法通过在现有算法中增加最大似然估计(MLE)步骤来识别短微卫星的长度分布和状态。之后,ELMSI继续通过将简化的期望最大化(EM)算法与中央极限定理相结合,推断出长微卫星的长度分布,然后使用统计检验输出这些微卫星的状态。根据我们的实验结果,ELMSI能够处理长度小于一个读取长度到10kbps的微卫星。为了验证我们算法的可靠性,我们首先比较了使用现有算法MSIsensor对混合样本中较短的微卫星进行分类的能力。同时,我们改变了微卫星区域的数量,读取长度和测序覆盖范围,以分别测试ELMSI在从混合样本中估计更长的区域时的性能。ELMSI在混合样本中表现良好,因此ELMSI对于提高微卫星区域的识别效果和支持临床决策支持具有重要价值。源代码已上传并维护在https://github.com/YixuanWang1120/ELMSI,仅供学术使用。
更新日期:2020-03-16
down
wechat
bug