Elsevier

Science Bulletin

Volume 66, Issue 19, 15 October 2021, Pages 1973-1985
Science Bulletin

Article
JAMIP: an artificial-intelligence aided data-driven infrastructure for computational materials informatics

https://doi.org/10.1016/j.scib.2021.06.011Get rights and content

Abstract

Materials informatics has emerged as a promisingly new paradigm for accelerating materials discovery and design. It exploits the intelligent power of machine learning methods in massive materials data from experiments or simulations to seek new materials, functionality, and principles, etc. Developing specialized facilities to generate, collect, manage, learn, and mine large-scale materials data is crucial to materials informatics. We herein developed an artificial-intelligence-aided data-driven infrastructure named Jilin Artificial-intelligence aided Materials-design Integrated Package (JAMIP), which is an open-source Python framework to meet the research requirements of computational materials informatics. It is integrated by materials production factory, high-throughput first-principles calculations engine, automatic tasks submission and monitoring progress, data extraction, management and storage system, and artificial intelligence machine learning based data mining functions. We have integrated specific features such as an inorganic crystal structure prototype database to facilitate high-throughput calculations and essential modules associated with machine learning studies of functional materials. We demonstrated how our developed code is useful in exploring materials informatics of optoelectronic semiconductors by taking halide perovskites as typical case. By obeying the principles of automation, extensibility, reliability, and intelligence, the JAMIP code is a promisingly powerful tool contributing to the fast-growing field of computational materials informatics.

Introduction

Currently, materials science research is entering a new paradigm featuring materials informatics, which applies machine learning techniques to massive materials data [1], [2], [3], [4] to accelerate materials discovery and design. It was first promoted in part by the Materials Genome Initiative [5] and emerged in practice as the result of the fast development of experimental materials synthesis, characterization approaches, and theoretical materials simulation through available computation resources that yield massive materials data [1]. By utilizing the recognized predictive power of materials informatics, encouraging breakthroughs have been made, including the design of new materials, the identification of new relationships [6], [7], and the prediction of new principles [8], [9]. In the fields of both experimental and computational materials informatics, there is an urgent need to develop powerful facilities/infrastructures to meet research requirements under this new paradigm.

Generating massive materials data of the composition-structure-property relationship is essentially the first step for materials informatics studies. In the field of computational materials science, there are generally three materials data generation techniques: the first is through direct calculations of single or a few materials. This is used frequently in the old paradigm when theorists want to study some material properties observed experimentally or to explore new physics in some previously unstudied materials. Such data are dispersedly distributed in literature and may be gathered manually or by automatic text extraction approaches [10], [11]. The second is through the high-throughput (HT) calculations, where extremely large numbers of materials spanning different chemical compositions are automatically calculated based on prototype structures [12], [13]. Along this direction, a series of computational infrastructures are developed and used in high-throughput calculations [12], [14], [15], [16], [17], [18], [19], [20], [21], [22], [23], [24]. The accumulated computational materials data lead to the formation of databases that are accessible for query and research purposes [25], [26], [27], [28], [29]. The third is through crystal structure searches. These seek new stable and metastable material structures under the typical chemical composition. This method complements high-throughput calculations. It explores the potential energy profile of materials, usually by combining artificial intelligence global optimization methods with energetic calculations [25], [30], [31], [32], [33], [34]. Because in structure search studies, the object of the greatest concern is the lowest-energy ground-state structure, the generated materials data are usually regarded as intermediate products, and do not accumulate in materials informatics studies.

Machine learning techniques act as an engine that triggers/accelerates materials discovery in computational materials informatics [35], [36], [37]. Various machine learning models are already available for use in computational material informatics, such as supervised learning, unsupervised learning, and active learning [38]. Supervised learning requires an artificially constructed training process, which then understands the mapping relationships between material descriptors and material properties. Using this model, both the underlying physical laws between descriptors and the material properties can be mined [39], [40]. Additionally, it can be used to reverse-engineer new materials to uncover new materials [41], such as predicting energy-related materials for catalysis, batteries, solar cells, and gas capture [42], [43], [44]. For instance, Chen et al. [44] reported a powerful machine learning model with the extreme gradient boosting regression (XGBR) algorithm to predict the Gibbs free energy changes of CO adsorption in the dispersed metal-nonmetal codoped graphene electrocatalysts. Unsupervised learning, by contrast, is generally used to uncover differences among materials or chemical systems in terms of unmarked “descriptors”. Clustering algorithms are the most widely used for this purpose. For example, Chen et al. [45] used the K-mean clustering algorithm to analyze the oxygen diffusion pattern, hopping statistics, and site occupation within crystals. In addition, in the field of materials discovery, active learning, and Bayesian optimization, with the self-optimizing characteristics of model, allow us to search the potential materials space for high-quality materials in a limited data range, and contribute to the material experimental and computational design [46].

The software/infrastructure for computational materials informatics needs to meet at least the following requirements, but is not limited to them: (1) adaptability for diverse materials discovery and design: the software should flexibly deal with different material systems from inorganic and organic to organic-inorganic hybrid systems, and different materials variations/combinations, such as alloy or defect, surface or interface, and heterostructure or superlattice. To fit the potential complexity of different materials with different properties, computational workflow modules should have high scalability and ensure flexible combinations among different types of computing tasks. (2) Efficiency and reliability in data generation and management: a high-level automation framework to initialize, run, and analyze large-scale high-throughput calculations is essential. Monitoring computational tasks in real-time and reporting/correcting potential errors are important. The software should have tools to efficiently extract and collect results and store them in a self-contained database. (3) Orientation by materials functionality: the software should be oriented by the functionality of materials (e.g., photovoltaic, thermoelectric, ferroelectric, and catalytic), which is central to materials discovery. Rational design of workflow modules needs to be sufficiently considered to effectively calculate the functionalities of different material systems. (4) Synthetical integration of data generation, storage, processing, and learning: it is desirable to integrate data generation, storage, processing, and learning modules with a unified infrastructure framework, which greatly facilitates the data fluxion and conversion procedure, as well as the efficiency of data mining. In the integrated framework, newly calculated data motivated by the feedback from data-learning procedures may accelerate knowledge accumulation.

In this paper, we report on an artificial-intelligence-aided data-driven infrastructure that we have been developing to fulfill the above requirements for computational materials informatics. It is an open-source Python-integrated framework named the Jilin Artificial-intelligence-aided Materials-design Integrated Package (JAMIP). The code is integrated by intimately connected units of Data generation (e.g., high-throughput materials calculations and screening), Data collection (e.g., automatic data extraction, management, and storage) and Data learning (e.g., integrated feature engineering, and machine learning based data mining functions). Below, we describe the JAMIP code framework and its usage in carrying out materials-informatics-related research on optoelectronic semiconductors by taking halide perovskites as instances. The code and manual describing its detailed usage options are freely available for download at www.jamip-code.com.

Section snippets

Code framework

The organization of JAMIP abides by the data lifecycle in the research field of computational materials informatics, from data generation to collection and learning, as shown in Fig. 1. In the data generation stage, users perform the large-scale high-throughput materials calculations, which are done through obtaining structure prototypes from the JAMIP database, generating candidate structures through the materials production factory, initializing the input parameters of HT calculation tasks,

Demonstration of usage

In principle, our JAMIP code can be used to perform materials-informatics-related research on the functional material system of which the functionality-involved properties can be accurately described by first-principles calculations. In this section, we briefly describe its use in studying halide perovskites-based semiconductors for optoelectronic applications. Halide perovskites (HPs, formula as ABX3), as promising optoelectronic materials (e.g., photovoltaic solar cells, light-emitting

Discussion and conclusion

To summarize, we have reported the development of an artificial-intelligence-aided data-driven infrastructure named Jilin Artificial-intelligence aided Materials-design Integrated Package (JAMIP), which is designed purposely to meet the requirements of the studies of computational materials informatics. It is an open-source software implemented mainly in Python language. With the emphasis on automation, extensibility, reliability, and intelligence, it is integrated by intimately connected units

Conflict of interest

The authors declare that they have no conflict of interest.

Acknowledgments

This work was supported by the National Natural Science Foundation of China (61722403, 92061113, and 12004131) and the Interdisciplinary Research Grant for PhDs of Jilin University (101832020DJX043). Calculations were performed in part at the High-Performance Computing Center of Jilin University. We acknowledge the important contributions to the development of JAMIP code from the following group members: Yuwei Li, Zhun Liu, Dongwen Yang, Qiaoling Xu, Yawen Li, Xueting Wang, Yilin Zhang,

Xin-Gang Zhao received his Ph.D. degree in Physical Chemistry from Jilin University in 2017. Since 2018, he has been working as a research associate at the University of Colorado Boulder, USA. His research interest mainly focuses on functionalized material design and exploring the relationship between the underlying local symmetry-breaking and physical properties in the perovskite system.

References (60)

  • D.C. Lonie et al.

    XtalOpt: an open-source evolutionary algorithm for crystal structure prediction

    Comput Phys Commun

    (2011)
  • Y. Wang et al.

    CALYPSO: a method for crystal structure prediction

    Comput Phys Commun

    (2012)
  • B. Gao et al.

    Interface structure prediction via CALYPSO method

    Sci Bull

    (2019)
  • Y. Liu et al.

    Materials discovery and design using machine learning

    J Mater

    (2017)
  • M.A. Khan et al.

    The crystal structure of indium diiodide, indium(I) tetraiodoindate(III), In[InI4]

    Inorg Chim Acta

    (1985)
  • X.-G. Zhao et al.

    Rational design of halide double perovskites for optoelectronic applications

    Joule

    (2018)
  • A. Agrawal et al.

    Perspective: materials informatics and big data: realization of the “fourth paradigm” of science in materials science

    APL Mater

    (2016)
  • G.R. Schleder et al.

    From DFT to machine learning: recent approaches to materials science-a review

    J Phys: Mater

    (2019)
  • C. Draxl et al.

    Big data-driven materials science and its FAIR data infrastructure

    Handb Mater Model

    (2020)
  • L. Himanen et al.

    Data-driven materials science: status, challenges, and perspectives

    Adv Sci

    (2019)
  • T Kalil et al.

    Materials genome initiative: a renaissance of American manufacturing

    (2011)
  • L.M. Ghiringhelli et al.

    Big data of materials science: critical role of the descriptor

    Phys Rev Lett

    (2015)
  • Duvenaud D, Maclaurin D, Aguilera-Iparraguirre J, et al. Convolutional networks on graphs for learning molecular...
  • M. Schmidt et al.

    Distilling free-form natural laws from experimental data

    Science

    (2009)
  • S.H. Rudy et al.

    Data-driven discovery of partial differential equations

    Sci Adv

    (2017)
  • K. Fundel et al.

    RelEx—Relation extraction using dependency parse trees

    Bioinformatics

    (2007)
  • V. Tshitoyan et al.

    Unsupervised word embeddings capture latent knowledge from materials science literature

    Nature

    (2019)
  • S. Luo et al.

    High-throughput computational materials screening and discovery of optoelectronic semiconductors

    Wiley Interdiscip Rev: Comput Mol Sci

    (2021)
  • Z. Liu et al.

    Computational functionality-driven design of semiconductors for optoelectronic applications

    InfoMat

    (2020)
  • X.-G. Zhao et al.

    Design of lead-free inorganic halide perovskites for solar cells via cation-transmutation

    J Am Chem Soc

    (2017)
  • Cited by (37)

    • Machine learning guided BCC or FCC phase prediction in high entropy alloys

      2024, Journal of Materials Research and Technology
    • Accurate structural descriptor enabled screening for nitrogen and oxygen vacancy codoped TiO<inf>2</inf> with a large bandgap narrowing

      2022, Journal of Materials Science and Technology
      Citation Excerpt :

      To do a systematic screening, the one-by-one DFT calculation (including structural relaxation and electronic structure calculation) for all configurations is obviously too expensive. Sampling a subset of representative configurations for DFT calculation then predicting the electronic properties of the rest by machine learning [39–42] is much more efficient. This is possible only if each doping configuration can be represented by a structural descriptor (a vector uniquely determined by dopant spatial ordering), which should be highly accurate to ensure that doping configurations with similar descriptors exhibit similar electronic properties.

    View all citing articles on Scopus

    Xin-Gang Zhao received his Ph.D. degree in Physical Chemistry from Jilin University in 2017. Since 2018, he has been working as a research associate at the University of Colorado Boulder, USA. His research interest mainly focuses on functionalized material design and exploring the relationship between the underlying local symmetry-breaking and physical properties in the perovskite system.

    Kun Zhou obtained his B.S. degree from Jilin University in 2019. He is currently a Ph.D. candidate at the College of Materials Science and Engineering, Jilin University. His research interest mainly focuses on the design of novel photoelectric materials based on first-principles calculations combined with machine learning.

    Bangyu Xing is currently a Ph.D. candidate at the College of Materials Science and Engineering, Jilin University. His research interest includes the development of new material design methods based on first-principles calculations, high-throughput calculations, and machine learning.

    Ruoting Zhao received his Bachelor’s degree from the College of Materials Science and Engineering, Changchun University of Science and Technology in 2019. He is currently a Master’s candidate at the College of Materials Science and Engineering, Jilin University. His research interest mainly focuses on combining first principles and machine learning to design optoelectronic materials and explore structure-properties relationships.

    Yuhao Fu received his Ph.D. degree from Jilin University in 2017. During 2017–2019, he worked as a postdoctoral researcher at the University of Missouri-Columbia, USA. Currently, he is an associate professor of the College of Physics at Jilin University. His research interest mainly focuses on the development of simulation methods on material design and prediction of transport properties, and exploring microscopic transport mechanisms in functional semiconductors.

    Lijun Zhang obtained his B.S. degree from Northeast Normal University in 2003, and completed his Ph.D. degree at Jilin University in 2008. He then worked as a postdoctoral researcher at Oak Ridge National Laboratory (2008–2010) and National Renewable Energy Laboratory (2010–2013), and became a research assistant professor at the University of Colorado at Boulder (2013–2014). He is currently a Tang-Aoqing Distinguished professor of the School of Materials Science and Engineering at Jilin University. His current research interest focuses on materials by design and band structures engineering of functional semiconductors for optoelectronic applications.

    1

    These authors contributed equally to this work.

    View full text