当前位置: X-MOL 学术J. Cheminfom. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Visualization of very large high-dimensional data sets as minimum spanning trees
Journal of Cheminformatics ( IF 8.6 ) Pub Date : 2020-02-12 , DOI: 10.1186/s13321-020-0416-x
Daniel Probst , Jean-Louis Reymond

The chemical sciences are producing an unprecedented amount of large, high-dimensional data sets containing chemical structures and associated properties. However, there are currently no algorithms to visualize such data while preserving both global and local features with a sufficient level of detail to allow for human inspection and interpretation. Here, we propose a solution to this problem with a new data visualization method, TMAP, capable of representing data sets of up to millions of data points and arbitrary high dimensionality as a two-dimensional tree (http://tmap.gdb.tools). Visualizations based on TMAP are better suited than t-SNE or UMAP for the exploration and interpretation of large data sets due to their tree-like nature, increased local and global neighborhood and structure preservation, and the transparency of the methods the algorithm is based on. We apply TMAP to the most used chemistry data sets including databases of molecules such as ChEMBL, FDB17, the Natural Products Atlas, DSSTox, as well as to the MoleculeNet benchmark collection of data sets. We also show its broad applicability with further examples from biology, particle physics, and literature.

中文翻译:

将非常大的高维数据集可视化为最小生成树

化学科学正在产生史无前例的包含化学结构和相关属性的大型,高维数据集。但是,目前尚无算法可视化此类数据,同时保留足够详细的级别以允许人工检查和解释的全局和局部特征。在这里,我们提出了一种使用新的数据可视化方法TMAP来解决此问题的方法,该方法可以将多达数百万个数据点和任意高维度的数据集表示为二维树(http://tmap.gdb.tools )。与大数据集相比,基于TMAP的可视化比t-SNE或UMAP更适合于探索和解释,这是因为它们具有树状性质,增加的局部和全局邻域以及结构保留功能,以及算法所基于方法的透明度。我们将TMAP应用于最常用的化学数据集,包括诸如ChEMBL,FDB17,Natural Products Atlas,DSSTox等分子的数据库,以及MoleculeNet基准数据集。我们还将通过生物学,粒子物理学和文学中的更多示例展示其广泛的适用性。
更新日期:2020-02-12
down
wechat
bug