当前位置: X-MOL 学术ACS Omega › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Distributed Representation of Chemical Fragments
ACS Omega ( IF 4.1 ) Pub Date : 2018-03-08 00:00:00 , DOI: 10.1021/acsomega.7b02045
Suman K. Chakravarti 1
Affiliation  

This article describes an unsupervised machine learning method for computing distributed vector representation of molecular fragments. These vectors encode fragment features in a continuous high-dimensional space and enable similarity computation between individual fragments, even for small fragments with only two heavy atoms. The method is based on a word embedding algorithm borrowed from natural language processing field, and approximately 6 million unlabeled PubChem chemicals were used for training. The resulting dense fragment vectors are in contrast to the traditional sparse “one-hot” fragment representation and capture rich relational structure in the fragment space. The vectors of small linear fragments were averaged to yield distributed vectors of bigger fragments and molecules, which were used for different tasks, e.g., clustering, ligand recall, and quantitative structure–activity relationship modeling. The distributed vectors were found to be better at clustering ring systems and recall of kinase ligands as compared to standard binary fingerprints. This work demonstrates unsupervised learning of fragment chemistry from large sets of unlabeled chemical structures and subsequent application to supervised training on relatively small data sets of labeled chemicals.

中文翻译:

化学碎片的分布式表示

本文介绍了一种用于计算分子片段的分布式矢量表示形式的无监督机器学习方法。这些向量在连续的高维空间中编码片段特征,即使在只有两个重原子的小片段的情况下,也可以计算单个片段之间的相似度。该方法基于从自然语言处理领域借用的词嵌入算法,并且大约600万种未标记的PubChem化学药品用于训练。生成的密集片段矢量与传统的稀疏“单热”片段表示形式相反,并在片段空间中捕获了丰富的关系结构。将线性小片段的向量取平均值,得出较大片段和分子的分布向量,这些向量可用于不同的任务,例如聚类,配体召回和定量构效关系模型。与标准的二元指纹相比,发现分布的载体在簇环系统和激酶配体的召回中更好。这项工作证明了从大量未标记化学结构的无监督学习片段化学,以及随后在相对较少的标记化学数据集的有监督训练中的应用。
更新日期:2018-03-08
down
wechat
bug