Multitask machine learning models for predicting lipophilicity (logP) in the SAMPL7 challenge,Journal of Computer-Aided Molecular Design

当前位置： X-MOL 学术 › J. Comput. Aid. Mol. Des. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Multitask machine learning models for predicting lipophilicity (logP) in the SAMPL7 challenge
Journal of Computer-Aided Molecular Design ( IF 3.5 ) Pub Date : 2021-07-17 , DOI: 10.1007/s10822-021-00405-6
Eelke B Lenselink ₁ , Pieter F W Stouten ₁

Affiliation

Accurate prediction of lipophilicity—logP—based on molecular structures is a well-established field. Predictions of logP are often used to drive forward drug discovery projects. Driven by the SAMPL7 challenge, in this manuscript we describe the steps that were taken to construct a novel machine learning model that can predict and generalize well. This model is based on the recently described Directed-Message Passing Neural Networks (D-MPNNs). Further enhancements included: both the inclusion of additional datasets from ChEMBL (RMSE improvement of 0.03), and the addition of helper tasks (RMSE improvement of 0.04). To the best of our knowledge, the concept of adding predictions from other models (Simulations Plus logP and logD@pH7.4, respectively) as helper tasks is novel and could be applied in a broader context. The final model that we constructed and used to participate in the challenge ranked 2/17 ranked submissions with an RMSE of 0.66, and an MAE of 0.48 (submission: Chemprop). On other datasets the model also works well, especially retrospectively applied to the SAMPL6 challenge where it would have ranked number one out of all submissions (RMSE of 0.35). Despite the fact that our model works well, we conclude with suggestions that are expected to improve the model even further.

中文翻译：

用于预测 SAMPL7 挑战中的亲脂性 (logP) 的多任务机器学习模型

基于分子结构准确预测亲脂性 - logP - 是一个成熟的领域。logP 的预测通常用于推动药物发现项目。在 SAMPL7 挑战的推动下，在这份手稿中，我们描述了构建一个可以很好地预测和泛化的新型机器学习模型所采取的步骤。该模型基于最近描述的定向消息传递神经网络 (D-MPNN)。进一步的增强包括：包括来自 ChEMBL 的额外数据集（RMSE 提高 0.03）和辅助任务的添加（RMSE 提高 0.04）。据我们所知，添加来自其他模型（分别为 Simulations Plus logP 和 logD@pH7.4）的预测作为辅助任务的概念是新颖的，可以在更广泛的环境中应用。我们构建并用于参与挑战的最终模型以 2/17 的排名提交，RMSE 为 0.66，MAE 为 0.48（提交：Chemprop）。在其他数据集上，该模型也运行良好，特别是回顾性地应用于 SAMPL6 挑战，它在所有提交中排名第一（RMSE 为 0.35）。尽管我们的模型运行良好，但我们总结了一些有望进一步改进模型的建议。

更新日期：2021-07-18

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>