当前位置: X-MOL 学术ChemMedChem › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Caveat Usor: Assessing Differences between Major Chemistry Databases.
ChemMedChem ( IF 3.6 ) Pub Date : 2018-02-23 , DOI: 10.1002/cmdc.201700724
Christopher Southan 1
Affiliation  

The three databases of PubChem, ChemSpider, and UniChem capture the majority of open chemical structure records with February 2018 totals of 95, 63, and 154 million, respectively. Collectively, they constitute a massively enabling resource for cheminformatics, chemical biology, and drug discovery. As meta-portals, they subsume and link out to the major proportion of public bioactivity data extracted from the literature and screening center assay results. Therefore, they not only present three different entry points, but the many subsumed independent resources present a fourth entry point in the form of standalone databases. Because this creates a complex picture it is important for users to have at least some appreciation of differential content to enable utility judgments for the tasks at hand. This turns out to be challenging. By comparing the three resources in detail, this review assesses their differences, some of which are not obvious. This includes the fact that coverage is significantly different between the 587, 282, and 38 contributing sources, respectively. This not only presents the "who-has-what" question, but also the reason "why" any particular inclusion is considered valuable is rarely made explicit. Also confusing is that sources nominally in common (i.e., having the same submitter name) can have significantly different structure counts, not only in each of the three but also from their standalone instantiations. Assessing a series of examples indicates that differences in loading dates and structural standardization are the main causes of this inter-portal discordance.

中文翻译:

警告用户:评估主要化学数据库之间的差异。

PubChem、ChemSpider 和 UniChem 三个数据库捕获了大部分开放化学结构记录,2018 年 2 月的总数分别为 95、63 和 1.54 亿。总的来说,它们构成了化学信息学、化学生物学和药物发现的巨大支持资源。作为元门户,它们包含并链接到从文献和筛选中心检测结果中提取的大部分公共生物活性数据。因此,它们不仅提供了三个不同的入口点,而且许多包含的独立资源以独立数据库的形式提供了第四个入口点。因为这会创建一个复杂的图片,所以对于用户来说,至少对差异内容有一定的了解,以便能够对手头的任务进行实用性判断,这一点很重要。事实证明这很有挑战性。通过详细比较这三种资源,本次审查评估了它们的差异,其中一些差异并不明显。这包括 587 个、282 个和 38 个贡献源的覆盖范围分别存在显着差异的事实。这不仅提出了“谁拥有什么”的问题,而且“为什么”任何特定的包含物被认为有价值的原因也很少明确。同样令人困惑的是,名义上相同的源(即具有相同的提交者名称)可能具有显着不同的结构计数,不仅在这三个源中,而且在它们的独立实例中也是如此。评估一系列示例表明,加载日期和结构标准化的差异是造成门户间不一致的主要原因。
更新日期:2018-02-23
down
wechat
bug