当前位置: X-MOL 学术J. Cheminfom. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Translating the InChI: adapting neural machine translation to predict IUPAC names from a chemical identifier
Journal of Cheminformatics ( IF 7.1 ) Pub Date : 2021-10-07 , DOI: 10.1186/s13321-021-00535-x
Jennifer Handsel 1 , Brian Matthews 1 , Nicola J Knight 2 , Simon J Coles 2
Affiliation  

We present a sequence-to-sequence machine learning model for predicting the IUPAC name of a chemical from its standard International Chemical Identifier (InChI). The model uses two stacks of transformers in an encoder-decoder architecture, a setup similar to the neural networks used in state-of-the-art machine translation. Unlike neural machine translation, which usually tokenizes input and output into words or sub-words, our model processes the InChI and predicts the IUPAC name character by character. The model was trained on a dataset of 10 million InChI/IUPAC name pairs freely downloaded from the National Library of Medicine’s online PubChem service. Training took seven days on a Tesla K80 GPU, and the model achieved a test set accuracy of 91%. The model performed particularly well on organics, with the exception of macrocycles, and was comparable to commercial IUPAC name generation software. The predictions were less accurate for inorganic and organometallic compounds. This can be explained by inherent limitations of standard InChI for representing inorganics, as well as low coverage in the training data.

中文翻译:

翻译 InChI:采用神经机器翻译从化学标识符预测 IUPAC 名称

我们提出了一种序列到序列的机器学习模型,用于根据标准国际化学标识符 (InChI) 预测化学物质的 IUPAC 名称。该模型在编码器-解码器架构中使用两堆变压器,这种设置类似于最先进的机器翻译中使用的神经网络。与通常将输入和输出标记为单词或子词的神经机器翻译不同,我们的模型处理 InChI 并逐个字符地预测 IUPAC 名称。该模型在包含 1000 万个 InChI/IUPAC 名称对的数据集上进行训练,这些名称对可从国家医学图书馆的在线 PubChem 服务免费下载。在 Tesla K80 GPU 上训练了 7 天,模型的测试集准确率达到了 91%。该模型在除大环化合物之外的有机物上表现特别好,并且与商业 IUPAC 名称生成软件相当。对于无机和有机金属化合物的预测不太准确。这可以通过标准 InChI 代表无机物的固有局限性以及训练数据的覆盖率低来解释。
更新日期:2021-10-07
down
wechat
bug