canSAR chemistry registration and standardization pipeline,Journal of Cheminformatics

当前位置： X-MOL 学术 › J. Cheminfom. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

canSAR chemistry registration and standardization pipeline
Journal of Cheminformatics ( IF 7.1 ) Pub Date : 2022-05-28 , DOI: 10.1186/s13321-022-00606-7
Daniela Dolciami _{1,

2,

3} , Eloy Villasclaras-Fernandez ₁ , Christos Kannas ₄ , Mirco Meniconi _{2,

5} , Bissan Al-Lazikani ₆ , Albert A Antolin _{1,

2}

Affiliation

Integration of medicinal chemistry data from numerous public resources is an increasingly important part of academic drug discovery and translational research because it can bring a wealth of important knowledge related to compounds in one place. However, different data sources can report the same or related compounds in various forms (e.g., tautomers, racemates, etc.), thus highlighting the need of organising related compounds in hierarchies that alert the user on important bioactivity data that may be relevant. To generate these compound hierarchies, we have developed and implemented canSARchem, a new compound registration and standardization pipeline as part of the canSAR public knowledgebase. canSARchem builds on previously developed ChEMBL and PubChem pipelines and is developed using KNIME. We describe the pipeline which we make publicly available, and we provide examples on the strengths and limitations of the use of hierarchies for bioactivity data exploration. Finally, we identify canonicalization enrichment in FDA-approved drugs, illustrating the benefits of our approach. We created a chemical registration and standardization pipeline in KNIME and made it freely available to the research community. The pipeline consists of five steps to register the compounds and create the compounds’ hierarchy: 1. Structure checker, 2. Standardization, 3. Generation of canonical tautomers and representative structures, 4. Salt strip, and 5. Generation of abstract structure to generate the compound hierarchy. Unlike ChEMBL’s RDKit pipeline, we carry out compound canonicalization ahead of getting the parent structure, similar to PubChem’s OpenEye pipeline. canSARchem has a lower rejection rate compared to both PubChem and ChEMBL. We use our pipeline to assess the impact of grouping the compounds in hierarchies for bioactivity data exploration. We find that FDA-approved drugs show statistically significant sensitivity to canonicalization compared to the majority of bioactive compounds which demonstrates the importance of this step. We use canSARchem to standardize all the compounds uploaded in canSAR (> 3 million) enabling efficient data integration and the rapid identification of alternative compound forms with useful bioactivity data. Comparison with PubChem and ChEMBL pipelines evidenced comparable performances in compound standardization, but only PubChem and canSAR canonicalize tautomers and canSAR has a slightly lower rejection rate. Our results highlight the importance of compound hierarchies for bioactivity data exploration. We make canSARchem available under a Creative Commons Attribution-ShareAlike 4.0 International License (CC BY-SA 4.0) at https://gitlab.icr.ac.uk/cansar-public/compound-registration-pipeline .

中文翻译：

canSAR 化学注册和标准化流程

整合来自众多公共资源的药物化学数据是学术药物发现和转化研究中越来越重要的部分，因为它可以将与化合物相关的大量重要知识集中在一个地方。然而，不同的数据源可以报告不同形式的相同或相关化合物（例如，互变异构体、外消旋体等），因此突出了将相关化合物组织成层次结构的需要，以提醒用户可能相关的重要生物活性数据。为了生成这些复合层次结构，我们开发并实施了 canSARchem，这是一个新的复合注册和标准化管道，作为 canSAR 公共知识库的一部分。canSARchem 建立在先前开发的 ChEMBL 和 PubChem 管道之上，并使用 KNIME 开发。我们描述了我们公开提供的管道，并提供了使用层次结构进行生物活性数据探索的优势和局限性的示例。最后，我们确定了 FDA 批准的药物中的规范化富集，说明了我们方法的好处。我们在 KNIME 中创建了一个化学注册和标准化管道，并将其免费提供给研究界。该管道包括注册化合物和创建化合物层次结构的五个步骤：1. 结构检查器，2. 标准化，3. 典型互变异构体和代表性结构的生成，4. 盐条和 5. 生成抽象结构以生成复合层次结构。与 ChEMBL 的 RDKit 管道不同，我们在获取父结构之前进行复合规范化，类似于 PubChem 的 OpenEye 管道。与 PubChem 和 ChEMBL 相比，canSARchem 的拒绝率较低。我们使用我们的管道来评估在层次结构中对化合物进行分组以进行生物活性数据探索的影响。我们发现，与大多数生物活性化合物相比，FDA 批准的药物对规范化具有统计学意义的敏感性，这证明了这一步骤的重要性。我们使用 canSARchem 对上传到 canSAR（> 300 万）中的所有化合物进行标准化，从而实现高效的数据集成和快速识别具有有用生物活性数据的替代化合物形式。与 PubChem 和 ChEMBL 管道的比较证明了化合物标准化的可比性能，但只有 PubChem 和 canSAR 规范化互变异构体，而 canSAR 的拒绝率略低。我们的结果强调了化合物层次结构对生物活性数据探索的重要性。我们在 https://gitlab.icr.ac.uk/cansar-public/compound-registration-pipeline 根据知识共享署名-相同方式共享 4.0 国际许可 (CC BY-SA 4.0) 提供 canSARchem。

更新日期：2022-05-31

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南11