当前位置: X-MOL 学术Nat. Mach. Intell. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Chemical language models enable navigation in sparsely populated chemical space
Nature Machine Intelligence ( IF 23.8 ) Pub Date : 2021-07-19 , DOI: 10.1038/s42256-021-00368-1
Michael A. Skinnider 1 , R. Greg Stacey 1 , Leonard J. Foster 1, 2 , David S. Wishart 3, 4, 5, 6
Affiliation  

Deep generative models are powerful tools for the exploration of chemical space, enabling the on-demand generation of molecules with desired physical, chemical or biological properties. However, these models are typically thought to require training datasets comprising hundreds of thousands, or even millions, of molecules. This perception limits the application of deep generative models in regions of chemical space populated by a relatively small number of examples. Here, we systematically evaluate and optimize generative models of molecules based on recurrent neural networks in low-data settings. We find that robust models can be learned from far fewer examples than has been widely assumed. We identify strategies that further reduce the number of molecules required to learn a model of equivalent quality, notably including data augmentation by non-canonical SMILES enumeration, and demonstrate the application of these principles by learning models of bacterial, plant and fungal metabolomes. The structure of our experiments also allows us to benchmark the metrics used to evaluate generative models themselves. We find that many of the most widely used metrics in the field fail to capture model quality, but we identify a subset of well-behaved metrics that provide a sound basis for model development. Collectively, our work provides a foundation for directly learning generative models in sparsely populated regions of chemical space.



中文翻译:

化学语言模型可在人口稀少的化学空间中导航

深度生成模型是探索化学空间的强大工具,可以按需生成具有所需物理、化学或生物特性的分子。然而,这些模型通常被认为需要包含数十万甚至数百万个分子的训练数据集。这种看法限制了深度生成模型在由相对少量示例填充的化学空间区域中的应用。在这里,我们在低数据环境下系统地评估和优化基于递归神经网络的分子生成模型。我们发现,可以从比广泛假设的少得多的示例中学习稳健的模型。我们确定了进一步减少学习同等质量模型所需的分子数量的策略,特别是包括通过非规范 SMILES 枚举来增强数据,并通过学习细菌、植物和真菌代谢组模型来证明这些原理的应用。我们的实验结构还允许我们对用于评估生成模型本身的指标进行基准测试。我们发现该领域中许多最广泛使用的指标未能捕捉模型质量,但我们确定了一个行为良好的指标子集,为模型开发提供了良好的基础。总的来说,我们的工作为在人口稀少的化学空间区域中直接学习生成模型提供了基础。我们的实验结构还允许我们对用于评估生成模型本身的指标进行基准测试。我们发现该领域中许多最广泛使用的指标未能捕捉模型质量,但我们确定了一个行为良好的指标子集,为模型开发提供了良好的基础。总的来说,我们的工作为在人口稀少的化学空间区域中直接学习生成模型提供了基础。我们的实验结构还允许我们对用于评估生成模型本身的指标进行基准测试。我们发现该领域中许多最广泛使用的指标未能捕捉模型质量,但我们确定了一个行为良好的指标子集,为模型开发提供了良好的基础。总的来说,我们的工作为在人口稀少的化学空间区域中直接学习生成模型提供了基础。

更新日期:2021-07-19
down
wechat
bug