MolData, a molecular benchmark for disease and target based machine learning,Journal of Cheminformatics

当前位置： X-MOL 学术 › J. Cheminfom. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

MolData, a molecular benchmark for disease and target based machine learning
Journal of Cheminformatics ( IF 7.1 ) Pub Date : 2022-03-07 , DOI: 10.1186/s13321-022-00590-y
Arash Keshavarzi Arshadi ₁ , Milad Salem ₂ , Arash Firouzbakht ₃ , Jiann Shiun Yuan ₂

Affiliation

Deep learning’s automatic feature extraction has been a revolutionary addition to computational drug discovery, infusing both the capabilities of learning abstract features and discovering complex molecular patterns via learning from molecular data. Since biological and chemical knowledge are necessary for overcoming the challenges of data curation, balancing, training, and evaluation, it is important for databases to contain information regarding the exact target and disease of each bioassay. The existing depositories such as PubChem or ChEMBL offer the screening data for millions of molecules against a variety of cells and targets, however, their bioassays contain complex biological descriptions which can hinder their usage by the machine learning community. In this work, a comprehensive disease and target-based dataset is collected from PubChem in order to facilitate and accelerate molecular machine learning for better drug discovery. MolData is one the largest efforts to date for democratizing the molecular machine learning, with roughly 170 million drug screening results from 1.4 million unique molecules assigned to specific diseases and targets. It also provides 30 unique categories of targets and diseases. Correlation analysis of the MolData bioassays unveils valuable information for drug repurposing for multiple diseases including cancer, metabolic disorders, and infectious diseases. Finally, we provide a benchmark of more than 30 models trained on each category using multitask learning. MolData aims to pave the way for computational drug discovery and accelerate the advancement of molecular artificial intelligence in a practical manner. The MolData benchmark data is available at https://GitHub.com/Transilico/MolData as well as within the additional files.

中文翻译：

MolData，基于疾病和目标的机器学习的分子基准

深度学习的自动特征提取是计算药物发现的革命性补充，它融合了学习抽象特征和通过从分子数据中学习发现复杂分子模式的能力。由于生物和化学知识是克服数据管理、平衡、培训和评估挑战所必需的，因此数据库必须包含有关每个生物测定的确切目标和疾病的信息。现有的存储库（例如 PubChem 或 ChEMBL）提供了针对各种细胞和靶标的数百万个分子的筛选数据，但是，它们的生物测定包含复杂的生物学描述，这可能会阻碍机器学习社区对它们的使用。在这项工作中，从 PubChem 收集了一个全面的疾病和基于目标的数据集，以促进和加速分子机器学习，从而更好地发现药物。MolData 是迄今为止使分子机器学习民主化的最大努力之一，从分配给特定疾病和目标的 140 万个独特分子中获得了大约 1.7 亿个药物筛选结果。它还提供了 30 种独特的目标和疾病类别。MolData 生物测定的相关性分析揭示了对多种疾病（包括癌症、代谢紊乱和传染病）进行药物再利用的宝贵信息。最后，我们提供了使用多任务学习在每个类别上训练的 30 多个模型的基准。MolData 旨在为计算药物发现铺平道路，并以实用的方式加速分子人工智能的发展。MolData 基准数据可在 https://GitHub.com/Transilico/MolData 以及其他文件中获得。

更新日期：2022-03-07

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南11