Elsevier

Science Bulletin

Volume 66, Issue 2, 30 January 2021, Pages 111-119
Science Bulletin

Article
High performance computing of DGDFT for tens of thousands of atoms using millions of cores on Sunway TaihuLight

https://doi.org/10.1016/j.scib.2020.06.025Get rights and content

Abstract

High performance computing (HPC) is a powerful tool to accelerate the Kohn–Sham density functional theory (KS-DFT) calculations on modern heterogeneous supercomputers. Here, we describe a massively parallel implementation of discontinuous Galerkin density functional theory (DGDFT) method on the Sunway TaihuLight supercomputer. The DGDFT method uses the adaptive local basis (ALB) functions generated on-the-fly during the self-consistent field (SCF) iteration to solve the KS equations with high precision comparable to plane-wave basis set. In particular, the DGDFT method adopts a two-level parallelization strategy that deals with various types of data distribution, task scheduling, and data communication schemes, and combines with the master–slave multi-thread heterogeneous parallelism of SW26010 processor, resulting in large-scale HPC KS-DFT calculations on the Sunway TaihuLight supercomputer. We show that the DGDFT method can scale up to 8,519,680 processing cores (131,072 core groups) on the Sunway TaihuLight supercomputer for studying the electronic structures of two-dimensional (2D) metallic graphene systems that contain tens of thousands of carbon atoms.

Introduction

The Kohn–Sham density functional theory (KS-DFT) [1], [2] is the most powerful methodology to perform first-principles calculations for studying the electronic structures of molecules and solids. However, conventional KS-DFT calculations show cubic computational complexity O(N3) with respect to the system size N. The computational cost and memory usage of KS-DFT calculations increase rapidly as the system size and the KS-DFT calculations are only limited to small systems containing hundreds of atoms. Therefore, the KS-DFT calculations become prohibitively expensive for first-principles materials simulations on large-scale systems that contain thousands of atoms.

Several low-scaling methods have been proposed for reducing the computational cost of KS-DFT calculations, such as linear scaling O(N) methods [3], [4], [5], divide-and-conquer (DAC) methods [6] and fragment molecular orbital (FMO) methods [7]. These low-scaling methods principally rely on the nearsightedness principle in molecules and semiconductors, and have been widely implemented with small localized basis sets in real-space, such as Gaussian [8] and numerical atomic orbitals [4], resulting in the sparse Hamiltonian in real space. Based on these low-scaling methods, several highly efficient KS-DFT materials simulation software packages have been developed, such as SIESTA [9], CP2K [10], CONQUEST [11] and HONPAS [12], which are beneficial to take full advantage of the massive parallelism available on modern high performance computing (HPC) architectures due to the local data communication of sparse Hamiltonian generated in small localized basis sets.

However, the accuracy of these low-scaling methods strongly depends on the parameters of localized basis sets, and is difficult to be improved systematically, compared to large uniform basis sets, such as plane-waves. Several KS-DFT materials simulation software packages have been developed by using plane wave basis set, such as VASP [13] and QUANTUM ESPRESSO [14]. But such plane wave basis set always requires large number of basis functions for the high accuracy and is difficult to take advantage of the HPC calculations on modern heterogeneous supercomputers due to the large all-to-all data communications of dense Hamiltonian [15].

The recently developed discontinuous Galerkin density functional theory (DGDFT) [16], [17], [18], [19], [20] aims to combine the advantages of both small localized and large uniform basis sets, which can reduce the number of basis functions similar to numerical atomic basis sets, while maintaining the high precision comparable to that of plane-wave basis set. The DGDFT method is discretized on an adaptive local basis (ALB) set [16]. Its unique feature is that each ALB function is strictly localized in a subdomain in real space, which results in the sparse Hamiltonian in unchanged block-tridiagonal structure for both metallic and semiconducting systems. Therefore, the DGDFT method is beneficial to take full advantage of the massive parallelism available on modern heterogeneous supercomputers [17].

It should be noticed that such parallel KS-DFT calculations increasingly require more complicated software development to achieve better parallel performance and scalability across the vastly diverse ecosystem of modern heterogeneous supercomputers, especially, the widely used X86 CPU (Central Processing Unit) architectures. Large-scale KS-DFT calculations have been performed in CP2K [10] and CONQUEST [11] on the Cray supercomputer with X86 architecture. In particular, DGDFT has demonstrated scaling to 128,000 cores on the Edison supercomputer at the USA NERSC platform for performing large-scale KS-DFT calculations on semiconducting phosphorene systems with 14, 000 atoms [17].

In China, the Sunway TaihuLight [21] is a new generation of the fastest supercomputers in the world, which uses the Chinese home-grown SW26010 processors based on a new Sunway master–slave heterogeneous architecture. Different from the widely used X86 CPU architectures, each master processing core can be effectively multi-thread accelerated by 64 slave processing cores on the SW26010 processor, similar to the multi-thread (64 threads) parallelism that bridges the gap between the Open Multi-Processing (OpenMP) (16–32 threads) and Graphics Processing Unit (GPU) (256–512 threads) parallel programming. Such hardware advantage requires the KS-DFT software packages to be reimplemented into the new Sunway TaihuLight supercomputer.

In the present work, we describe a massively parallel implementation of the DGDFT method on the Sunway TaihuLight supercomputer. We demonstrate that the DGDFT method adopts a two-level parallelization strategy that makes use of different types of data distribution, task scheduling, and data communication schemes, and combines with the feature of master–slave multi-thread heterogeneous parallelism of SW26010 processors, resulting in extreme-scale HPC KS-DFT calculations for tens of thousands of atoms using millions of cores on the Sunway TaihuLight supercomputer.

Section snippets

Methodology

In this section, we describe the theoretical algorithms and parallel implementation of DGDFT on the Sunway TaihuLight supercomputer in detail. The key spirit of DGDFT is to discretize the global KS equations by using the adaptive local basis (ALB) set in a discontinuous Galerkin (DG) framework [16]. The scalable implementation of DGDFT is based on the two-level parallelization strategy of DGDFT combining with the master–slave multi-thread heterogeneous parallelism of SW26010 processor on the

Results and discussion

In this section, we demonstrate the computational efficiency and parallel scalability of the DGDFT method to accelerate large-scale KS-DFT calculations on the Sunway TaihuLight supercomputer. We have implemented the DGDFT method as software package also called DGDFT [17], which has been written in the C/C++ programming language with the message passing interface (MPI) for parallel programming. DGDFT supports the Hartwigsen-Goedecker-Hutter (HGH) [28] norm-conserving pseudo-potential. In this

Conclusion and outlook

In summary, we demonstrate that DGDFT can be used to push the envelope to investigate the electronic structures of ultra-large-scale metallic systems containing tens of thousands of atoms by combing with the two-level parallelization strategy of DGDFT and the master–slave multi-thread heterogeneous parallelism of the Sunway TaihuLight supercomputer. We show that DGDFT can achieve a high parallel efficiency up to 32.3% (speedup as high as 42,382.9) by using 8,519,680 processing cores (131,072

Conflict of interest

The authors declare that they have no conflict of interest.

Acknowledgments

This work was partly supported by the Supercomputer Application Project Trail Funding from Wuxi Jiangnan Institute of Computing Technology (BB2340000016), the Strategic Priority Research Program of Chinese Academy of Sciences (XDC01040100), the National Natural Science Foundation of China (21688102, 21803066), the Anhui Initiative in Quantum Information Technologies (AHY090400), the National Key Research and Development Program of China (2016YFA0200604), the Fundamental Research Funds for

Wei Hu is currently a Professor at the Hefei National Laboratory for Physical Sciences at the Microscale (HFNL) of the University of Science and Technology of China (USTC). He received his Ph.D. degree in Theoretical and Computational Chemistry from the USTC in 2013. Then, he joined the Computational Research Division of Lawrence Berkeley National Laboratory as a postdoctoral fellow for developing high performance computing (HPC) density functional theory (DFT) software. His research interest

References (29)

  • H. Shang et al.

    Linear scaling electronic structure calculations with numerical atomic basis set

    Int Rev Phys Chem

    (2010)
  • D.R. Bowler et al.

    O(n) methods in electronic structure calculations

    Rep Prog Phys

    (2012)
  • Z. Zhao et al.

    A divide-and-conquer linear scaling three-dimensional fragment method for large scale electronic structure calculations

    J Phys Condens Matter

    (2008)
  • M.J. Frisch et al.

    Self-consistent molecular orbital methods 25. Supplementary functions for gaussian basis sets

    J Chem Phys

    (1984)
  • Cited by (17)

    • Cost-efficient simulations of large-scale electronic structures in the standalone manycore architecture

      2021, Computer Physics Communications
      Citation Excerpt :

      In spite of their theoretical rigor and wide applicability, DFT calculations require huge computing cost whose parallel efficiency may not be good [9,10]. They normally explore systems that are made up of a few hundred atoms, and recently showed the capability to simulate a single-layer graphene having 11,520 atoms [11]. To model physically realizable finite systems that easily include more than several million atoms, e.g., nanoscale semiconductor devices such as quantum dot photodetectors and nanowire transistors, researchers have considered empirical methods [12–15], whose focal idea is to describe a single atom with a set of parameters that are fit to match DFT-driven or experimentally uncovered bandstructures of bulk materials.

    • Parallel Optimization of BLAS on a New-Generation Sunway Supercomputer

      2023, Journal of Circuits, Systems and Computers
    View all citing articles on Scopus

    Wei Hu is currently a Professor at the Hefei National Laboratory for Physical Sciences at the Microscale (HFNL) of the University of Science and Technology of China (USTC). He received his Ph.D. degree in Theoretical and Computational Chemistry from the USTC in 2013. Then, he joined the Computational Research Division of Lawrence Berkeley National Laboratory as a postdoctoral fellow for developing high performance computing (HPC) density functional theory (DFT) software. His research interest focuses on the method development and scientific applications of large-scale DFT calculations.

    Hong An is currently a Professor at School of Computer Science and Technology, University of Science and Technology of China (USTC). She received her Ph.D. degree in Computer Science from the USTC in 2000. She is the director of the Advanced Computer System Architecture (ACSA) Lab in the USTC. Her main research focuses on parallel computer architecture, parallel programming, operating system design, and high performance computing.

    Jinlong Yang is currently a Professor at the University of Science and Technology of China (USTC). He received his Ph.D. degree in Condensed Matter Physics from USTC in 1991. He became the Dean of the School of Chemistry and Materials Science of USTC in 2009 and the Vice President of USTC in 2018. His research interest focuses on the development of first-principles methods and their applications to clusters, nanostructures, solid materials, surfaces, and interfaces.

    View full text