当前位置: X-MOL 学术Database J. Biol. Databases Curation › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Benchmarks for measurement of duplicate detection methods in nucleotide databases.
Database: The Journal of Biological Databases and Curation ( IF 3.4 ) Pub Date : 2017-01-08 , DOI: 10.1093/database/baw164
Qingyu Chen 1 , Justin Zobel 1 , Karin Verspoor 1
Affiliation  

Duplication of information in databases is a major data quality challenge. The presence of duplicates, implying either redundancy or inconsistency, can have a range of impacts on the quality of analyses that use the data. To provide a sound basis for research on this issue in databases of nucleotide sequences, we have developed new, large-scale validated collections of duplicates, which can be used to test the effectiveness of duplicate detection methods. Previous collections were either designed primarily to test efficiency, or contained only a limited number of duplicates of limited kinds. To date, duplicate detection methods have been evaluated on separate, inconsistent benchmarks, leading to results that cannot be compared and, due to limitations of the benchmarks, of questionable generality. In this study, we present three nucleotide sequence database benchmarks, based on information drawn from a range of resources, including information derived from mapping to two data sections within the UniProt Knowledgebase (UniProtKB), UniProtKB/Swiss-Prot and UniProtKB/TrEMBL. Each benchmark has distinct characteristics. We quantify these characteristics and argue for their complementary value in evaluation. The benchmarks collectively contain a vast number of validated biological duplicates; the largest has nearly half a billion duplicate pairs (although this is probably only a tiny fraction of the total that is present). They are also the first benchmarks targeting the primary nucleotide databases. The records include the 21 most heavily studied organisms in molecular biology research. Our quantitative analysis shows that duplicates in the different benchmarks, and in different organisms, have different characteristics. It is thus unreliable to evaluate duplicate detection methods against any single benchmark. For example, the benchmark derived from UniProtKB/Swiss-Prot mappings identifies more diverse types of duplicates, showing the importance of expert curation, but is limited to coding sequences. Overall, these benchmarks form a resource that we believe will be of great value for development and evaluation of the duplicate detection or record linkage methods that are required to help maintain these essential resources. Database URL : https://bitbucket.org/biodbqual/benchmarks.

中文翻译:


核苷酸数据库中重复检测方法测量的基准。



数据库中的信息重复是一个主要的数据质量挑战。重复项的存在意味着冗余或不一致,可能会对使用数据的分析质量产生一系列影响。为了为核苷酸序列数据库中这一问题的研究提供坚实的基础,我们开发了新的、大规模的经过验证的重复集合,可用于测试重复检测方法的有效性。以前的集合要么主要是为了测试效率而设计的,要么仅包含有限数量的有限类型的重复项。迄今为止,重复的检测方法是在单独的、不一致的基准上进行评估的,导致结果无法比较,并且由于基准的限制,其普遍性值得怀疑。在这项研究中,我们基于从一系列资源中获取的信息,提出了三个核苷酸序列数据库基准,包括从映射到 UniProt 知识库 (UniProtKB)、UniProtKB/Swiss-Prot 和 UniProtKB/TrEMBL 内两个数据部分中获得的信息。每个基准都有独特的特征。我们量化这些特征并论证它们在评估中的互补价值。这些基准共同包含大量经过验证的生物复制品;最大的有近十亿个重复对(尽管这可能只是现有总数的一小部分)。它们也是针对主要核苷酸数据库的第一个基准。这些记录包括分子生物学研究中研究最深入的 21 种生物体。我们的定量分析表明,不同基准和不同生物体中的重复具有不同的特征。 因此,根据任何单一基准来评估重复检测方法是不可靠的。例如,源自 UniProtKB/Swiss-Prot 映射的基准可识别更多不同类型的重复项,显示专家管理的重要性,但仅限于编码序列。总体而言,这些基准形成了一种资源,我们认为对于帮助维护这些基本资源所需的重复检测或记录链接方法的开发和评估具有巨大价值。数据库网址:https://bitbucket.org/biodbqual/benchmarks。
更新日期:2020-04-17
down
wechat
bug