Towards Large-scale Integrative Taxonomy (LIT): resolving the data conundrum for dark taxa,Systematic Biology

当前位置： X-MOL 学术 › Syst. Biol. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Towards Large-scale Integrative Taxonomy (LIT): resolving the data conundrum for dark taxa
Systematic Biology ( IF 6.1 ) Pub Date : 2022-05-10 , DOI: 10.1093/sysbio/syac033
Emily Hartop _{1,

2,

3} , Amrita Srivathsan _{3,

4} , Fredrik Ronquist ₅ , Rudolf Meier _{3,

4}

Affiliation

New, rapid, accurate, scalable, and cost-effective species discovery and delimitation methods are needed for tackling “dark taxa”, here defined as groups for which <10% of all species are described and the estimated diversity exceeds 1000 species. Species delimitation for these taxa should be based on multiple data sources (“integrative taxonomy”) but collecting multiple types of data risks impeding a discovery process that is already too slow. We here develop Large-scale Integrative Taxonomy (LIT), an explicit method where preliminary species hypotheses are generated based on inexpensive data that can be obtained quickly and cost-effectively. These hypotheses are then evaluated based on a more expensive type of “validation data” that are only obtained for specimens selected based on objective criteria applied to the preliminary species hypotheses. We here use this approach to sort 18 000 scuttle flies (Diptera: Phoridae) into 315 preliminary species hypotheses based on NGS barcode (313 bp) clusters (using Objective Clustering (OC) with a 3% threshold). These clusters are then evaluated with morphology as the validation data. We develop quantitative indicators for predicting which barcode clusters are likely to be incongruent with morphospecies by randomly selecting 100 clusters for in-depth validation with morphology. A linear model demonstrates that the best predictors for incongruence between barcode clusters and morphology are maximum p-distance within the cluster and a newly proposed index that measures cluster stability across different clustering thresholds. A test of these indicators using the 215 remaining clusters reveals that these predictors correctly identify all clusters that are incongruent with morphology. In our study all morphospecies are true or disjoint subsets of the initial barcode clusters so that all incongruence can be eliminated by varying clustering thresholds. This leads to a discussion of when a third data source is needed to resolve incongruent grouping statements. The morphological validation step in our study involved 1039 specimens (5.8% of the total). The formal LIT protocol we propose would only have required the study of 915 (5.1%: 2.5 specimens per species), as we show that clusters without signatures of incongruence can be validated by only studying two specimens representing the most divergent haplotypes. To test the generality of our results across different barcode clustering techniques, we establish that the levels of incongruence are similar across Objective Clustering (OC), Automatic Barcode Gap Discovery (ABGD), Poisson Tree Processes (PTP) and Refined Single Linkage (RESL) (used by Barcode of Life Data System (BOLD) to assign Barcode Index Numbers (BINs)). OC and ABGD achieved a maximum congruence score with morphology of 89% while PTP was slightly less effective (84%). RESL could only be tested for a subset of the specimens because the algorithm is not public. BINs based on 277 of the original 1 714 haplotypes were 86% congruent with morphology while the values were 89% for OC, 74% for PTP, and 72% for ABGD.

中文翻译：

迈向大规模综合分类法 (LIT)：解决暗分类群的数据难题

需要新的、快速、准确、可扩展和具有成本效益的物种发现和划界方法来解决“黑暗分类群”，这里定义为描述了所有物种的＜10% 并且估计多样性超过 1000 个物种的群体。这些分类群的物种划分应基于多个数据源（“综合分类法”），但收集多种类型的数据可能会阻碍已经太慢的发现过程。我们在这里开发了大规模综合分类法 (LIT)，这是一种基于廉价数据生成初步物种假设的显式方法，这些数据可以快速且经济高效地获得。然后基于更昂贵的“验证数据”类型对这些假设进行评估，这些“验证数据”仅针对基于适用于初步物种假设的客观标准选择的标本获得。我们在这里使用这种方法根据 NGS 条形码 (313 bp) 簇（使用具有 3% 阈值的目标聚类 (OC)）将 18 000 只乌贼（双翅目：Phoridae）分类为 315 个初步物种假设。然后用形态学作为验证数据对这些集群进行评估。我们通过随机选择 100 个簇进行形态学深度验证，开发了用于预测哪些条形码簇可能与形态物种不一致的定量指标。线性模型表明，条形码簇和形态之间不一致的最佳预测因子是簇内的最大 p 距离和一个新提出的指标，用于测量不同聚类阈值下的簇稳定性。使用剩余的 215 个聚类对这些指标进行的测试表明，这些预测因子正确识别了所有与形态不一致的聚类。在我们的研究中，所有形态物种都是初始条形码簇的真实或不相交的子集，因此可以通过改变聚类阈值来消除所有不一致。这引发了关于何时需要第三个数据源来解决不一致的分组语句的讨论。我们研究中的形态验证步骤涉及 1039 个标本（占总数的 5.8%）。我们提出的正式 LIT 协议只需要研究 915 个（5.1%：每个物种 2.5 个样本），因为我们表明，只研究代表最不同单倍型的两个样本就可以验证没有不一致特征的集群。为了测试我们在不同条码聚类技术中结果的普遍性，我们确定目标聚类 (OC)、自动条码间隙发现 (ABGD)、泊松树过程 (PTP) 和精炼单链接 (RESL) 的不一致水平是相似的（由 Barcode of Life Data System (BOLD) 用于分配条形码索引号 (BIN)）。OC 和 ABGD 的最大一致性得分为 89%，而 PTP 的效果稍差（84%）。RESL 只能针对样本的一个子集进行测试，因为该算法是不公开的。

更新日期：2022-05-10

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11