当前位置: X-MOL 学术J. Cheminfom. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
A probabilistic molecular fingerprint for big data settings.
Journal of Cheminformatics ( IF 7.1 ) Pub Date : 2018-12-18 , DOI: 10.1186/s13321-018-0321-8
Daniel Probst 1 , Jean-Louis Reymond 1
Affiliation  

Among the various molecular fingerprints available to describe small organic molecules, extended connectivity fingerprint, up to four bonds (ECFP4) performs best in benchmarking drug analog recovery studies as it encodes substructures with a high level of detail. Unfortunately, ECFP4 requires high dimensional representations (≥ 1024D) to perform well, resulting in ECFP4 nearest neighbor searches in very large databases such as GDB, PubChem or ZINC to perform very slowly due to the curse of dimensionality. Herein we report a new fingerprint, called MinHash fingerprint, up to six bonds (MHFP6), which encodes detailed substructures using the extended connectivity principle of ECFP in a fundamentally different manner, increasing the performance of exact nearest neighbor searches in benchmarking studies and enabling the application of locality sensitive hashing (LSH) approximate nearest neighbor search algorithms. To describe a molecule, MHFP6 extracts the SMILES of all circular substructures around each atom up to a diameter of six bonds and applies the MinHash method to the resulting set. MHFP6 outperforms ECFP4 in benchmarking analog recovery studies. By leveraging locality sensitive hashing, LSH approximate nearest neighbor search methods perform as well on unfolded MHFP6 as comparable methods do on folded ECFP4 fingerprints in terms of speed and relative recovery rate, while operating in very sparse and high-dimensional binary chemical space. MHFP6 is a new molecular fingerprint, encoding circular substructures, which outperforms ECFP4 for analog searches while allowing the direct application of locality sensitive hashing algorithms. It should be well suited for the analysis of large databases. The source code for MHFP6 is available on GitHub ( https://github.com/reymond-group/mhfp ).

中文翻译:

大数据设置的概率分子指纹。

在可用来描述有机小分子的各种分子指纹中,扩展的连接指纹,多达四个键(ECFP4)在对药物类似物回收研究进行基准测试中表现最佳,因为它可以对高度详细的亚结构进行编码。不幸的是,ECFP4需要高维表示(≥1024D)才能很好地执行,由于维数的诅咒,导致ECFP4在非常大的数据库(例如GDB,PubChem或ZINC)中的最近邻居搜索执行得非常慢。在这里,我们报告了一个新的指纹,称为MinHash指纹,最多包含六个键(MHFP6),它使用ECFP的扩展连接原理以根本不同的方式对详细的子结构进行编码,提高基准研究中精确最近邻搜索的性能,并启用局部敏感哈希(LSH)近似最近邻搜索算法的应用。为了描述一个分子,MHFP6提取每个原子周围所有圆形亚结构的SMILES,直到直径为6个键,然后将MinHash方法应用于结果集。在基准模拟恢复研究中,MHFP6优于ECFP4。通过利用局部敏感散列,LSH在展开的MHFP6上的近似性能与在折叠ECFP4指纹上的可比方法在速度和相对恢复率上的性能相当,同时在非常稀疏和高维的二元化学空间中运行。MHFP6是一种新的分子指纹,编码圆形亚结构,它在模拟搜索方面优于ECFP4,同时允许直接应用位置敏感的哈希算法。它应该非常适合大型数据库的分析。MHFP6的源代码可在GitHub(https://github.com/reymond-group/mhfp)上找到。
更新日期:2018-12-18
down
wechat
bug