当前位置: X-MOL 学术Methods Ecol. Evol. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
TAPER: Pinpointing errors in multiple sequence alignments despite varying rates of evolution
Methods in Ecology and Evolution ( IF 6.3 ) Pub Date : 2021-08-02 , DOI: 10.1111/2041-210x.13696
Chao Zhang 1 , Yiming Zhao 2 , Edward L Braun 3 , Siavash Mirarab 2
Affiliation  

  1. Erroneous data can creep into sequence datasets for reasons ranging from contamination to annotation and alignment mistakes and reduce the accuracy of downstream analyses. As datasets keep getting larger, it has become difficult to check multiple sequence alignments visually for errors, and thus, automatic error detection methods are needed more than ever before. Alignment masking methods, which are widely used, remove entire aligned sites and may reduce signal as much as or more than they reduce the noise.
  2. The alternative we propose here is a surprisingly under-explored approach: looking for errors in small species-specific stretches of the multiple sequence alignments. We introduce a method called TAPER that uses a novel two-dimensional outlier detection algorithm. Importantly, TAPER adjusts its null expectations per site and species, and in doing so, it attempts to distinguish the real heterogeneity (signal) from errors (noise).
  3. Our results show that TAPER removes very little data yet finds much of the error. The effectiveness of TAPER depends on several properties of the alignment (e.g. evolutionary divergence levels) and the errors (e.g. their length).
  4. By enabling data clean up with minimal loss of signal, TAPER can improve downstream analyses such as phylogenetic reconstruction and selection detection. Data errors, small or large, can reduce confidence in the downstream results, and thus, eliminating them can be beneficial even when downstream analyses are not impacted.


中文翻译:

TAPER:尽管进化速度不同,但仍能精确定位多个序列比对中的错误

  1. 由于污染、注释和对齐错误等原因,错误数据可能会进入序列数据集,并降低下游分析的准确性。随着数据集越来越大,目视检查多个序列比对的错误变得越来越困难,因此比以往任何时候都更需要自动错误检测方法。广泛使用的对齐掩蔽方法去除整个对齐的位点,并且可能会减少信号,与减少噪声一样多或更多。
  2. 我们在这里提出的替代方案是一种令人惊讶的未充分探索的方法:在多序列比对的小物种特异性片段中寻找错误。我们介绍了一种称为 TAPER 的方法,它使用一种新颖的二维异常值检测算法。重要的是,TAPER 会根据站点和物种调整其零预期,在此过程中,它试图将真正的异质性(信号)与错误(噪声)区分开来。
  3. 我们的结果表明 TAPER 删除了很少的数据,但发现了很多错误。TAPER 的有效性取决于对齐的几个属性(例如进化分歧水平)和错误(例如它们的长度)。
  4. 通过在信号损失最小的情况下进行数据清理,TAPER 可以改进下游分析,例如系统发育重建和选择检测。数据错误,无论大小,都会降低对下游结果的信心,因此,即使在不影响下游分析的情况下,消除它们也可能是有益的。
更新日期:2021-08-02
down
wechat
bug