当前位置: X-MOL 学术ChemRxiv › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Masked Graph Modeling for Molecule Generation
ChemRxiv Pub Date : 2021-01-21
Omar Mahmood, Elman Mansimov, Richard Bonneau, Kyunghyun Cho

De novo, in-silico design of molecules is a challenging problem with applications in drug discovery and material design.Here, we introduce a masked graph model which learns a distribution over graphs by capturing all possible conditional distributions over unobserved nodes and edges given observed ones.We train our masked graph model on existing molecular graphs and then sample novel molecular graphs from it by iteratively masking and replacing different parts of initialized graphs. We evaluate our approach on the QM9 and ChEMBL datasets using the distribution-learning benchmark from the GuacaMol framework.The benchmark contains five metrics: the validity, uniqueness, novelty, KL-divergence and Fréchet ChemNet Distance scores, the last two of which are measures of the similarity of the generated samples to the training, validation and test distributions. We find that KL-divergence and Fréchet ChemNet Distance scores are anti-correlated with novelty scores. By varying generation initialization and the fraction of the graph masked and replaced at each generation step, we can increase the Fréchet score at the cost of novelty. In this way, we show that our model offers transparent and tunable control of the trade-off between these metrics, a point of control currently lacking in other approaches to molecular graph generation.We observe that our model outperforms previously proposed graph-based approaches and is competitive with SMILES-based approaches.Finally, we show that our model can generate molecules with desired values of specified properties while maintaining physiochemical similarity to molecules from the training distribution.

中文翻译:

用于分子生成的屏蔽图建模

从头开始,分子的计算机内设计对于药物发现和材料设计中的应用来说是一个具有挑战性的问题。在此,我们引入了一个屏蔽图模型,该模型通过捕获未观察到的节点和给定观察到的边上的所有可能的条件分布来学习图上的分布我们在现有分子图上训练蒙版图模型,然后通过迭代地蒙版和替换初始化图的不同部分来从中采样新的分子图。我们使用来自GuacaMol框架的分布学习基准评估QM9和ChEMBL数据集的方法,该基准包含五个指标:有效性,唯一性,新颖性,KL散度和FréchetChemNet距离得分,最后两个是度量生成的样本与训练的相似性,验证和测试分布。我们发现KL散度和FréchetChemNet距离得分与新颖性得分呈反相关。通过更改生成初始化,并在每个生成步骤中屏蔽和替换图的分数,我们可以以新颖性为代价提高Fréchet分数。通过这种方式,我们证明了我们的模型提供了对这些指标之间权衡的透明且可调的控制,这是目前其他分子图生成方法中所缺乏的控制点。我们观察到,我们的模型优于先前提出的基于图的方法和最后,我们证明了我们的模型可以生成具有指定特性的期望值的分子,同时保持与训练分布中的分子的理化相似性。我们发现KL散度和FréchetChemNet距离得分与新颖性得分呈反相关。通过更改生成初始化,并在每个生成步骤中屏蔽和替换图的分数,我们可以以新颖性为代价提高Fréchet分数。通过这种方式,我们证明了我们的模型提供了对这些指标之间权衡的透明且可调的控制,这是目前其他分子图生成方法中所缺乏的控制点。我们观察到,我们的模型优于先前提出的基于图的方法和最后,我们证明了我们的模型可以生成具有指定特性的期望值的分子,同时保持与训练分布中的分子的理化相似性。我们发现KL散度和FréchetChemNet距离得分与新颖性得分呈反相关。通过更改生成初始化,并在每个生成步骤中屏蔽和替换图的分数,我们可以以新颖性为代价提高Fréchet分数。通过这种方式,我们证明了我们的模型提供了对这些指标之间权衡的透明且可调的控制,这是目前其他分子图生成方法中所缺乏的控制点。我们观察到,我们的模型优于先前提出的基于图的方法和最后,我们证明了我们的模型可以生成具有指定特性的期望值的分子,同时保持与训练分布中的分子的理化相似性。
更新日期:2021-01-21
down
wechat
bug