当前位置: X-MOL 学术Mach. Learn. Sci. Technol. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Self-referencing embedded strings (SELFIES): A 100% robust molecular string representation
Machine Learning: Science and Technology ( IF 6.3 ) Pub Date : 2020-11-03 , DOI: 10.1088/2632-2153/aba947
Mario Krenn 1, 2, 3 , Florian Hse 1, 2, 3, 4 , AkshatKumar Nigam 2 , Pascal Friederich 2, 5 , Alan Aspuru-Guzik 1, 2, 3, 6
Affiliation  

The discovery of novel materials and functional molecules can help to solve some of society’s most urgent challenges, ranging from efficient energy harvesting and storage to uncovering novel pharmaceutical drug candidates. Traditionally matter engineering–generally denoted as inverse design–was based massively on human intuition and high-throughput virtual screening. The last few years have seen the emergence of significant interest in computer-inspired designs based on evolutionary or deep learning methods. The major challenge here is that the standard strings molecular representation SMILES shows substantial weaknesses in that task because large fractions of strings do not correspond to valid molecules. Here, we solve this problem at a fundamental level and introduce SELFIES (SELF-referencIng Embedded Strings), a string-based representation of molecules which is 100% robust. Every SELFIES string corresponds to a valid molecule, and SELFIES can represent every molecule. SELFIES can be directly applied in arbitrary machine learning models without the adaptation of the models; each of the generated molecule candidates is valid. In our experiments, the model’s internal memory stores two orders of magnitude more diverse molecules than a similar test with SMILES. Furthermore, as all molecules are valid, it allows for explanation and interpretation of the internal working of the generative models.



中文翻译:

自引用嵌入式字符串(SELFIES):100%健壮的分子字符串表示形式

新材料和功能分子的发现可以帮助解决社会上一些最紧迫的挑战,从有效的能量收集和存储到发现新的候选药物。传统上的物质工程(通常称为逆设计)很大程度上基于人类的直觉和高通量虚拟筛选。最近几年,人们对基于进化或深度学习方法的计算机启发式设计产生了浓厚的兴趣。这里的主要挑战是标准字符串分子表示形式SMILES在该任务中显示出明显的弱点,因为很大一部分字符串不对应于有效分子。在这里,我们从根本上解决这个问题并介绍S ELFIES(SELF-referencIng Embedded Strings),一种基于字符串的分子表示,具有100%的鲁棒性。每个S ELFIES字符串都对应一个有效分子,S ELFIES可以代表每个分子。S ELFIES可以直接应用于任意机器学习模型,而无需对模型进行修改;每个生成的候选分子都是有效的。在我们的实验中,与使用SMILES进行的类似测试相比,该模型的内部存储器存储了两个数量级更大的分子。此外,由于所有分子均有效,因此可以对生成模型的内部工作进行解释和解释。

更新日期:2020-11-03
down
wechat
bug