Learning to SMILES: BAN-based strategies to improve latent representation learning from molecules,Briefings in Bioinformatics

当前位置： X-MOL 学术 › Brief. Bioinform. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Learning to SMILES: BAN-based strategies to improve latent representation learning from molecules
Briefings in Bioinformatics ( IF 6.8 ) Pub Date : 2021-07-27 , DOI: 10.1093/bib/bbab327
Cheng-Kun Wu ₁ , Xiao-Chen Zhang ₂ , Zhi-Jiang Yang ₃ , Ai-Ping Lu ₄ , Ting-Jun Hou ₅ , Dong-Sheng Cao ₆

Affiliation

Computational methods have become indispensable tools to accelerate the drug discovery process and alleviate the excessive dependence on time-consuming and labor-intensive experiments. Traditional feature-engineering approaches heavily rely on expert knowledge to devise useful features, which could be costly and sometimes biased. The emerging deep learning (DL) methods deliver a data-driven method to automatically learn expressive representations from complex raw data. Inspired by this, researchers have attempted to apply various deep neural network models to simplified molecular input line entry specification (SMILES) strings, which contain all the composition and structure information of molecules. However, current models usually suffer from the scarcity of labeled data. This results in a low generalization ability of SMILES-based DL models, which prevents them from competing with the state-of-the-art computational methods. In this study, we utilized the BiLSTM (bidirectional long short term merory) attention network (BAN) in which we employed a novel multi-step attention mechanism to facilitate the extracting of key features from the SMILES strings. Meanwhile, SMILES enumeration was utilized as a data augmentation method in the training phase to substantially increase the number of labeled data and enlarge the probability of mining more patterns from complex SMILES. We again took advantage of SMILES enumeration in the prediction phase to rectify model prediction bias and provide a more accurate prediction. Combined with the BAN model, our strategies can greatly improve the performance of latent features learned from SMILES strings. In 11 canonical absorption, distribution, metabolism, excretion and toxicity-related tasks, our method outperformed the state-of-the-art approaches.

中文翻译：

Learning to SMILES：基于 BAN 的改进分子潜在表征学习的策略

计算方法已成为加速药物发现过程和减轻对耗时和劳动密集型实验的过度依赖的不可或缺的工具。传统的特征工程方法严重依赖专家知识来设计有用的特征，这可能是昂贵的，有时是有偏差的。新兴的深度学习 (DL) 方法提供了一种数据驱动的方法，可以从复杂的原始数据中自动学习表达性表示。受此启发，研究人员尝试将各种深度神经网络模型应用于简化的分子输入行条目规范（SMILES）字符串，其中包含分子的所有组成和结构信息。然而，当前的模型通常受到标记数据稀缺的困扰。这导致基于 SMILES 的 DL 模型的泛化能力较低，这可以防止它们与最先进的计算方法竞争。在这项研究中，我们利用了 BiLSTM（双向长短期记忆）注意网络（BAN），其中我们采用了一种新颖的多步注意机制来促进从 SMILES 字符串中提取关键特征。同时，在训练阶段利用 SMILES 枚举作为数据增强方法，大大增加了标记数据的数量，并扩大了从复杂 SMILES 中挖掘更多模式的概率。我们再次在预测阶段利用 SMILES 枚举来纠正模型预测偏差并提供更准确的预测。结合 BAN 模型，我们的策略可以大大提高从 SMILES 字符串中学习到的潜在特征的性能。在 11 个典型的吸收、分布、

更新日期：2021-07-27

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11