当前位置: X-MOL 学术Syst. Biol. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Repositories for Taxonomic Data: Where We Are and What is Missing
Systematic Biology ( IF 6.5 ) Pub Date : 2020-04-16 , DOI: 10.1093/sysbio/syaa026
Aurélien Miralles 1, 2 , Teddy Bruy 1, 2 , Katherine Wolcott 2, 3 , Mark D Scherz 4, 5 , Dominik Begerow 6 , Bank Beszteri 7 , Michael Bonkowski 8 , Janine Felden 9, 10 , Birgit Gemeinholzer 11 , Frank Glaw 4 , Frank Oliver Glöckner 10 , Oliver Hawlitschek 4, 12 , Ivaylo Kostadinov 13 , Tim W Nattkemper 14 , Christian Printzen 15 , Jasmin Renz 16 , Nataliya Rybalka 17 , Marc Stadler 18 , Tanja Weibulat 13 , Thomas Wilke 19 , Susanne S Renner 2 , Miguel Vences 20
Affiliation  

Abstract Natural history collections are leading successful large-scale projects of specimen digitization (images, metadata, DNA barcodes), thereby transforming taxonomy into a big data science. Yet, little effort has been directed towards safeguarding and subsequently mobilizing the considerable amount of original data generated during the process of naming 15,000–20,000 species every year. From the perspective of alpha-taxonomists, we provide a review of the properties and diversity of taxonomic data, assess their volume and use, and establish criteria for optimizing data repositories. We surveyed 4113 alpha-taxonomic studies in representative journals for 2002, 2010, and 2018, and found an increasing yet comparatively limited use of molecular data in species diagnosis and description. In 2018, of the 2661 papers published in specialized taxonomic journals, molecular data were widely used in mycology (94%), regularly in vertebrates (53%), but rarely in botany (15%) and entomology (10%). Images play an important role in taxonomic research on all taxa, with photographs used in >80% and drawings in 58% of the surveyed papers. The use of omics (high-throughput) approaches or 3D documentation is still rare. Improved archiving strategies for metabarcoding consensus reads, genome and transcriptome assemblies, and chemical and metabolomic data could help to mobilize the wealth of high-throughput data for alpha-taxonomy. Because long-term—ideally perpetual—data storage is of particular importance for taxonomy, energy footprint reduction via less storage-demanding formats is a priority if their information content suffices for the purpose of taxonomic studies. Whereas taxonomic assignments are quasifacts for most biological disciplines, they remain hypotheses pertaining to evolutionary relatedness of individuals for alpha-taxonomy. For this reason, an improved reuse of taxonomic data, including machine-learning-based species identification and delimitation pipelines, requires a cyberspecimen approach—linking data via unique specimen identifiers, and thereby making them findable, accessible, interoperable, and reusable for taxonomic research. This poses both qualitative challenges to adapt the existing infrastructure of data centers to a specimen-centered concept and quantitative challenges to host and connect an estimated \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$ \le $\end{document}2 million images produced per year by alpha-taxonomic studies, plus many millions of images from digitization campaigns. Of the 30,000–40,000 taxonomists globally, many are thought to be nonprofessionals, and capturing the data for online storage and reuse therefore requires low-complexity submission workflows and cost-free repository use. Expert taxonomists are the main stakeholders able to identify and formalize the needs of the discipline; their expertise is needed to implement the envisioned virtual collections of cyberspecimens. [Big data; cyberspecimen; new species; omics; repositories; specimen identifier; taxonomy; taxonomic data.]

中文翻译:

分类数据存储库:我们在哪里以及缺少什么

摘要 自然历史收藏引领着成功的大规模样本数字化项目(图像、元数据、DNA 条形码),从而将分类学转变为大数据科学。然而,在保护和随后动员每年命名 15,000-20,000 个物种的过程中产生的大量原始数据方面几乎没有做出任何努力。从 alpha 分类学家的角度,我们回顾了分类数据的属性和多样性,评估了它们的数量和用途,并建立了优化数据存储库的标准。我们调查了 2002、2010 和 2018 年代表性期刊中的 4113 项 alpha 分类学研究,发现分子数据在物种诊断和描述中的使用越来越多但相对有限。2018 年,在专业分类学期刊上发表的 2661 篇论文中,分子数据被广泛用于真菌学 (94%),经常用于脊椎动物 (53%),但很少用于植物学 (15%) 和昆虫学 (10%)。图像在所有分类群的分类学研究中发挥着重要作用,超过 80% 的被调查论文使用了照片,58% 的论文使用了绘图。组学(高通量)方法或 3D 文档的使用仍然很少见。改进的元条形码共识读取、基因组和转录组组装以及化学和代谢组学数据的存档策略有助于为 alpha 分类法调动大量高通量数据。因为长期(理想情况下是永久的)数据存储对于分类学特别重要,如果信息内容足以满足分类学研究的目的,则优先考虑通过较少存储要求的格式来减少能源足迹。尽管分类学分配是大多数生物学科的准事实,但它们仍然是与 alpha 分类法的个体进化相关性有关的假设。出于这个原因,分类数据的改进重用,包括基于机器学习的物种识别和定界管道,需要一种网络标本方法——通过唯一的标本标识符链接数据,从而使它们可查找、可访问、可互操作和可重复用于分类研究. 这带来了使数据中心的现有基础设施适应以样本为中心的概念的定性挑战和托管和连接估计的 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts } \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$ \le $\end{document}200 万每年通过 alpha 分类研究生成的图像,以及来自数字化活动的数百万张图像。在全球 30,000-40,000 名分类学家中,许多被认为是非专业人士,因此捕获数据以进行在线存储和重用需要低复杂性的提交工作流程和免费的存储库使用。专家分类学家是能够识别和正式确定学科需求的主要利益相关者;需要他们的专业知识来实施设想的网络样本虚拟集合。[大数据; 网络样本;新物种;组学;存储库;标本标识符;分类; 分类数据。]
更新日期:2020-04-16
down
wechat
bug