当前位置: X-MOL 学术Syst. Biol. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Detecting and removing sample contamination in phylogenomic data: an example and its implications for Cicadidae phylogeny (Insecta: Hemiptera)
Systematic Biology ( IF 6.1 ) Pub Date : 2022-06-16 , DOI: 10.1093/sysbio/syac043
Christopher L Owen 1 , David C Marshall 2 , Elizabeth J Wade 3 , Russ Meister 2 , Geert Goemans 2 , Krushnamegh Kunte 4 , Max Moulds 5 , Kathy Hill 2 , M Villet 6 , Thai-Hong Pham 7, 8 , Michelle Kortyna 9 , Emily Moriarty Lemmon 10 , Alan R Lemmon 11 , Chris Simon 2
Affiliation  

Contamination of a genetic sample with DNA from one or more non-target species is a continuing concern of molecular phylogenetic studies, both Sanger sequencing studies and Next-Generation Sequencing (NGS) studies. We developed an automated pipeline for identifying and excluding likely cross-contaminated loci based on detection of bimodal distributions of patristic distances across gene trees. When the contamination occurs between samples within a dataset, comparisons between a contaminated sample and its contaminant taxon will yield bimodal distributions with one peak close to zero patristic distance. This new method does not rely on a priori knowledge of taxon relatedness nor does it determine the causes(s) of the contamination. Exclusion of putatively contaminated loci from a dataset generated for the insect family Cicadidae showed that these sequences were affecting some topological patterns and branch supports, although the effects were sometimes subtle, with some contamination-influenced relationships exhibiting strong bootstrap support. Long tip branches and outlier values for one anchored phylogenomic pipeline statistic (AvgNHomologs) were correlated with the presence of contamination. While the AHE markers used here, which target hemipteroid taxa, proved effective in resolving deep and shallow level Cicadidae relationships in aggregate, individual markers contained inadequate phylogenetic signal, in part probably due to short length. The cleaned dataset, consisting of 429 loci, from 90 genera representing 44 of 56 current Cicadidae tribes, supported three of the four sampled Cicadidae subfamilies in concatenated-matrix maximum likelihood (ML) and multispecies coalescent-based species tree analyses, with the fourth subfamily weakly supported in the ML trees. No well-supported patterns from previous family-level Sanger sequencing studies of Cicadidae phylogeny were contradicted. One taxon (Aragualna plenalinea) did not fall with its current subfamily in the genetic tree, and this genus and its tribe Aragualnini is reclassified to Tibicininae following morphological re-examination. Only subtle differences were observed in trees after removal of loci for which divergent base frequencies were detected. Greater success may be achieved by increased taxon sampling and developing a probe set targeting a more recent common ancestor and longer loci. Searches for contamination are an essential step in phylogenomic analyses of all kinds and our pipeline is an effective solution.

中文翻译:

检测和去除系统基因组数据中的样本污染:一个例子及其对蝉科系统发育的影响(昆虫纲:半翅目)

来自一个或多个非目标物种的 DNA 对遗传样本的污染是分子系统发育研究的持续关注,无论是桑格测序研究还是下一代测序 (NGS) 研究。我们开发了一种自动化管道,用于基于跨基因树的教父距离的双峰分布检测来识别和排除可能的交叉污染基因座。当数据集中的样本之间发生污染时,污染样本与其污染物分类单元之间的比较将产生双峰分布,其中一个峰值接近于零教父距离。这种新方法不依赖于分类单元相关性的先验知识,也不确定污染的原因。从为昆虫科蝉科生成的数据集中排除假定污染的基因座表明,这些序列正在影响一些拓扑模式和分支支持,尽管影响有时是微妙的,一些受污染影响的关系表现出强大的引导支持。一个锚定的系统基因组管道统计(AvgNHomologs)的长尖端分支和异常值与污染的存在相关。虽然此处使用的针对半翅目类群的 AHE 标记被证明在解决总体上的深层次和浅层次蝉科关系方面是有效的,但个别标记包含的系统发育信号不足,部分原因可能是长度短。清理后的数据集由 429 个基因座组成,来自 90 个属,代表 56 个当前蝉科部落中的 44 个,在级联矩阵最大似然 (ML) 和基于多物种合并的物种树分析中支持四个采样的蝉科亚科中的三个,而在 ML 树中弱支持第四个亚科。以前对蝉科系统发育的家庭级 Sanger 测序研究没有得到充分支持的模式是矛盾的。一个分类单元(Aragualna plenalinea)在遗传树中没有与其当前的亚科一起落入,并且该属及其部落Aragualnini在形态学重新检查后重新分类为Tibicininae。在去除检测到不同碱基频率的基因座后,在树木中仅观察到细微的差异。通过增加分类单元采样和开发针对更近的共同祖先和更长基因座的探针组可以取得更大的成功。
更新日期:2022-06-16
down
wechat
bug