当前位置: X-MOL 学术Mach. Learn. Sci. Technol. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
A bin and hash method for analyzing reference data and descriptors in machine learning potentials
Machine Learning: Science and Technology ( IF 6.3 ) Pub Date : 2021-04-23 , DOI: 10.1088/2632-2153/abe663
Martín Leandro Paleico , Jörg Behler

In recent years the development of machine learning potentials (MLPs) has become a very active field of research. Numerous approaches have been proposed, which allow one to perform extended simulations of large systems at a small fraction of the computational costs of electronic structure calculations. The key to the success of modern MLPs is the close-to first principles quality description of the atomic interactions. This accuracy is reached by using very flexible functional forms in combination with high-level reference data from electronic structure calculations. These data sets can include up to hundreds of thousands of structures covering millions of atomic environments to ensure that all relevant features of the potential energy surface are well represented. The handling of such large data sets is nowadays becoming one of the main challenges in the construction of MLPs. In this paper we present a method, the bin-and-hash (BAH) algorithm, to overcome this problem by enabling the efficient identification and comparison of large numbers of multidimensional vectors. Such vectors emerge in multiple contexts in the construction of MLPs. Examples are the comparison of local atomic environments to identify and avoid unnecessary redundant information in the reference data sets that is costly in terms of both the electronic structure calculations as well as the training process, the assessment of the quality of the descriptors used as structural fingerprints in many types of MLPs, and the detection of possibly unreliable data points. The BAH algorithm is illustrated for the example of high-dimensional neural network potentials using atom-centered symmetry functions for the geometrical description of the atomic environments, but the method is general and can be combined with any current type of MLP.



中文翻译:

一种用于分析机器学习潜力中的参考数据和描述符的 bin 和 hash 方法

近年来,机器学习潜力(MLP)的发展已成为一个非常活跃的研究领域。已经提出了许多方法,它们允许以电子结构计算的计算成本的一小部分来执行大型系统的扩展模拟。现代 MLP 成功的关键是对原子相互作用的接近第一原则的质量描述。这种精度是通过使用非常灵活的函数形式并结合来自电子结构计算的高级参考数据来实现的。这些数据集可以包含多达数十万个结构,覆盖数百万个原子环境,以确保能很好地表示势能表面的所有相关特征。如今,处理如此大的数据集已成为构建 MLP 的主要挑战之一。在本文中,我们提出了一种方法,即 bin-and-hash (BAH) 算法,通过有效识别和比较大量多维向量来克服这个问题。在 MLP 的构建过程中,此类载体出现在多种环境中。例子是比较局部原子环境以识别和避免参考数据集中不必要的冗余信息,这在电子结构计算和训练过程方面都是昂贵的,用作结构指纹的描述符的质量评估在许多类型的 MLP 中,以及检测可能不可靠的数据点。

更新日期:2021-04-23
down
wechat
bug