JAMIP: an artificial-intelligence aided data-driven infrastructure for computational materials informatics

doi:10.1016/j.scib.2021.06.011

Science Bulletin

Volume 66, Issue 19, 15 October 2021, Pages 1973-1985

https://doi.org/10.1016/j.scib.2021.06.011 Get rights and content

Abstract

Materials informatics has emerged as a promisingly new paradigm for accelerating materials discovery and design. It exploits the intelligent power of machine learning methods in massive materials data from experiments or simulations to seek new materials, functionality, and principles, etc. Developing specialized facilities to generate, collect, manage, learn, and mine large-scale materials data is crucial to materials informatics. We herein developed an artificial-intelligence-aided data-driven infrastructure named Jilin Artificial-intelligence aided Materials-design Integrated Package (JAMIP), which is an open-source Python framework to meet the research requirements of computational materials informatics. It is integrated by materials production factory, high-throughput first-principles calculations engine, automatic tasks submission and monitoring progress, data extraction, management and storage system, and artificial intelligence machine learning based data mining functions. We have integrated specific features such as an inorganic crystal structure prototype database to facilitate high-throughput calculations and essential modules associated with machine learning studies of functional materials. We demonstrated how our developed code is useful in exploring materials informatics of optoelectronic semiconductors by taking halide perovskites as typical case. By obeying the principles of automation, extensibility, reliability, and intelligence, the JAMIP code is a promisingly powerful tool contributing to the fast-growing field of computational materials informatics.

Graphical abstract

Introduction

Currently, materials science research is entering a new paradigm featuring materials informatics, which applies machine learning techniques to massive materials data [1], [2], [3], [4] to accelerate materials discovery and design. It was first promoted in part by the Materials Genome Initiative [5] and emerged in practice as the result of the fast development of experimental materials synthesis, characterization approaches, and theoretical materials simulation through available computation resources that yield massive materials data [1]. By utilizing the recognized predictive power of materials informatics, encouraging breakthroughs have been made, including the design of new materials, the identification of new relationships [6], [7], and the prediction of new principles [8], [9]. In the fields of both experimental and computational materials informatics, there is an urgent need to develop powerful facilities/infrastructures to meet research requirements under this new paradigm.

Generating massive materials data of the composition-structure-property relationship is essentially the first step for materials informatics studies. In the field of computational materials science, there are generally three materials data generation techniques: the first is through direct calculations of single or a few materials. This is used frequently in the old paradigm when theorists want to study some material properties observed experimentally or to explore new physics in some previously unstudied materials. Such data are dispersedly distributed in literature and may be gathered manually or by automatic text extraction approaches [10], [11]. The second is through the high-throughput (HT) calculations, where extremely large numbers of materials spanning different chemical compositions are automatically calculated based on prototype structures [12], [13]. Along this direction, a series of computational infrastructures are developed and used in high-throughput calculations [12], [14], [15], [16], [17], [18], [19], [20], [21], [22], [23], [24]. The accumulated computational materials data lead to the formation of databases that are accessible for query and research purposes [25], [26], [27], [28], [29]. The third is through crystal structure searches. These seek new stable and metastable material structures under the typical chemical composition. This method complements high-throughput calculations. It explores the potential energy profile of materials, usually by combining artificial intelligence global optimization methods with energetic calculations [25], [30], [31], [32], [33], [34]. Because in structure search studies, the object of the greatest concern is the lowest-energy ground-state structure, the generated materials data are usually regarded as intermediate products, and do not accumulate in materials informatics studies.

Machine learning techniques act as an engine that triggers/accelerates materials discovery in computational materials informatics [35], [36], [37]. Various machine learning models are already available for use in computational material informatics, such as supervised learning, unsupervised learning, and active learning [38]. Supervised learning requires an artificially constructed training process, which then understands the mapping relationships between material descriptors and material properties. Using this model, both the underlying physical laws between descriptors and the material properties can be mined [39], [40]. Additionally, it can be used to reverse-engineer new materials to uncover new materials [41], such as predicting energy-related materials for catalysis, batteries, solar cells, and gas capture [42], [43], [44]. For instance, Chen et al. [44] reported a powerful machine learning model with the extreme gradient boosting regression (XGBR) algorithm to predict the Gibbs free energy changes of CO adsorption in the dispersed metal-nonmetal codoped graphene electrocatalysts. Unsupervised learning, by contrast, is generally used to uncover differences among materials or chemical systems in terms of unmarked “descriptors”. Clustering algorithms are the most widely used for this purpose. For example, Chen et al. [45] used the K-mean clustering algorithm to analyze the oxygen diffusion pattern, hopping statistics, and site occupation within crystals. In addition, in the field of materials discovery, active learning, and Bayesian optimization, with the self-optimizing characteristics of model, allow us to search the potential materials space for high-quality materials in a limited data range, and contribute to the material experimental and computational design [46].

The software/infrastructure for computational materials informatics needs to meet at least the following requirements, but is not limited to them: (1) adaptability for diverse materials discovery and design: the software should flexibly deal with different material systems from inorganic and organic to organic-inorganic hybrid systems, and different materials variations/combinations, such as alloy or defect, surface or interface, and heterostructure or superlattice. To fit the potential complexity of different materials with different properties, computational workflow modules should have high scalability and ensure flexible combinations among different types of computing tasks. (2) Efficiency and reliability in data generation and management: a high-level automation framework to initialize, run, and analyze large-scale high-throughput calculations is essential. Monitoring computational tasks in real-time and reporting/correcting potential errors are important. The software should have tools to efficiently extract and collect results and store them in a self-contained database. (3) Orientation by materials functionality: the software should be oriented by the functionality of materials (e.g., photovoltaic, thermoelectric, ferroelectric, and catalytic), which is central to materials discovery. Rational design of workflow modules needs to be sufficiently considered to effectively calculate the functionalities of different material systems. (4) Synthetical integration of data generation, storage, processing, and learning: it is desirable to integrate data generation, storage, processing, and learning modules with a unified infrastructure framework, which greatly facilitates the data fluxion and conversion procedure, as well as the efficiency of data mining. In the integrated framework, newly calculated data motivated by the feedback from data-learning procedures may accelerate knowledge accumulation.

In this paper, we report on an artificial-intelligence-aided data-driven infrastructure that we have been developing to fulfill the above requirements for computational materials informatics. It is an open-source Python-integrated framework named the Jilin Artificial-intelligence-aided Materials-design Integrated Package (JAMIP). The code is integrated by intimately connected units of Data generation (e.g., high-throughput materials calculations and screening), Data collection (e.g., automatic data extraction, management, and storage) and Data learning (e.g., integrated feature engineering, and machine learning based data mining functions). Below, we describe the JAMIP code framework and its usage in carrying out materials-informatics-related research on optoelectronic semiconductors by taking halide perovskites as instances. The code and manual describing its detailed usage options are freely available for download at www.jamip-code.com.

Section snippets

Code framework

The organization of JAMIP abides by the data lifecycle in the research field of computational materials informatics, from data generation to collection and learning, as shown in Fig. 1. In the data generation stage, users perform the large-scale high-throughput materials calculations, which are done through obtaining structure prototypes from the JAMIP database, generating candidate structures through the materials production factory, initializing the input parameters of HT calculation tasks,

Demonstration of usage

In principle, our JAMIP code can be used to perform materials-informatics-related research on the functional material system of which the functionality-involved properties can be accurately described by first-principles calculations. In this section, we briefly describe its use in studying halide perovskites-based semiconductors for optoelectronic applications. Halide perovskites (HPs, formula as ABX₃), as promising optoelectronic materials (e.g., photovoltaic solar cells, light-emitting

Discussion and conclusion

To summarize, we have reported the development of an artificial-intelligence-aided data-driven infrastructure named Jilin Artificial-intelligence aided Materials-design Integrated Package (JAMIP), which is designed purposely to meet the requirements of the studies of computational materials informatics. It is an open-source software implemented mainly in Python language. With the emphasis on automation, extensibility, reliability, and intelligence, it is integrated by intimately connected units

Conflict of interest

The authors declare that they have no conflict of interest.

Acknowledgments

This work was supported by the National Natural Science Foundation of China (61722403, 92061113, and 12004131) and the Interdisciplinary Research Grant for PhDs of Jilin University (101832020DJX043). Calculations were performed in part at the High-Performance Computing Center of Jilin University. We acknowledge the important contributions to the development of JAMIP code from the following group members: Yuwei Li, Zhun Liu, Dongwen Yang, Qiaoling Xu, Yawen Li, Xueting Wang, Yilin Zhang,

Xin-Gang Zhao received his Ph.D. degree in Physical Chemistry from Jilin University in 2017. Since 2018, he has been working as a research associate at the University of Colorado Boulder, USA. His research interest mainly focuses on functionalized material design and exploring the relationship between the underlying local symmetry-breaking and physical properties in the perovskite system.

References (60)

S. Curtarolo et al.
AFLOWLIB.ORG: a distributed materials properties repository from high-throughput ab initio calculations
Comput Mater Sci
(2012)
S. Curtarolo et al.
AFLOW: an automatic framework for high-throughput materials discovery
Comput Mater Sci
(2012)
G. Pizzi et al.
AiiDA: automated interactive infrastructure and database for computational science
Comput Mater Sci
(2016)
K. Mathew et al.
MPInterfaces: a materials project based Python tool for high-throughput computational screening of interfacial systems
Comput Mater Sci
(2016)
X. Yang et al.
MatCloud: a high-throughput computational infrastructure for integrated management of materials simulation, data and resources
Comput Mater Sci
(2018)
W. Zhu et al.
SEHC: a high-throughput materials computing framework with automatic self-evaluation filtering
Sci Eng: B
(2020)
K. Mathew et al.
Atomate: a high-level interface to generate, execute, and analyze computational materials science workflows
Comput Mater Sci
(2017)
G. Wang et al.
ALKEMIE: an intelligent computational platform for accelerating materials discovery and design
Comput Mater Sci
(2021)
P. Gorai et al.
TE design lab: a virtual laboratory for thermoelectric material design
Comput Mater Sci
(2016)
C.W. Glass et al.
USPEX—Evolutionary crystal structure prediction
Comput Phys Commun
(2006)

D.C. Lonie et al.

XtalOpt: an open-source evolutionary algorithm for crystal structure prediction

Comput Phys Commun

(2011)

Y. Wang et al.

CALYPSO: a method for crystal structure prediction

Comput Phys Commun

(2012)

B. Gao et al.

Interface structure prediction via CALYPSO method

Sci Bull

(2019)

Y. Liu et al.

Materials discovery and design using machine learning

J Mater

(2017)

M.A. Khan et al.

The crystal structure of indium diiodide, indium(I) tetraiodoindate(III), In[InI₄]

Inorg Chim Acta

(1985)

X.-G. Zhao et al.

Rational design of halide double perovskites for optoelectronic applications

Joule

(2018)

A. Agrawal et al.

Perspective: materials informatics and big data: realization of the “fourth paradigm” of science in materials science

APL Mater

(2016)

G.R. Schleder et al.

From DFT to machine learning: recent approaches to materials science-a review

J Phys: Mater

(2019)

C. Draxl et al.

Big data-driven materials science and its FAIR data infrastructure

Handb Mater Model

(2020)

L. Himanen et al.

Data-driven materials science: status, challenges, and perspectives

Adv Sci

(2019)

T Kalil et al.

Materials genome initiative: a renaissance of American manufacturing

(2011)

L.M. Ghiringhelli et al.

Big data of materials science: critical role of the descriptor

Phys Rev Lett

(2015)

Duvenaud D, Maclaurin D, Aguilera-Iparraguirre J, et al. Convolutional networks on graphs for learning molecular...

M. Schmidt et al.

Distilling free-form natural laws from experimental data

Science

(2009)

S.H. Rudy et al.

Data-driven discovery of partial differential equations

Sci Adv

(2017)

K. Fundel et al.

RelEx—Relation extraction using dependency parse trees

Bioinformatics

(2007)

V. Tshitoyan et al.

Unsupervised word embeddings capture latent knowledge from materials science literature

Nature

(2019)

S. Luo et al.

High-throughput computational materials screening and discovery of optoelectronic semiconductors

Wiley Interdiscip Rev: Comput Mol Sci

(2021)

Z. Liu et al.

Computational functionality-driven design of semiconductors for optoelectronic applications

InfoMat

(2020)

X.-G. Zhao et al.

Design of lead-free inorganic halide perovskites for solar cells via cation-transmutation

J Am Chem Soc

(2017)

Cited by (37)

Machine learning guided BCC or FCC phase prediction in high entropy alloys
2024, Journal of Materials Research and Technology
High entropy alloys （HEAs） have excellent properties because they can form simple solid solution (SS) phases, including body-centered cubic (BCC) phase, face-centered cubic (FCC) phase, or FCC + BCC phase, so phase prediction is the first step in alloy design. In current research, machine learning (ML) approach had been widely used to guide the discovery and design of materials. The prediction of HEAs phase structure based on machine learning (ML) is a hot topic. In this work, five ML algorithms were utilized to predict HEAs for SS and amorphous (AM) phases based on 399 collected data sets, including 120 BCC alloys, 87 FCC alloys, 82 BCC + FCC alloys and 110 a.m. alloys. To enhance the model's accuracy, grid search and K-fold cross validation were used to optimize performance. Valence electron concentration (VEC) and ΔH_mix exhibit high importance in prediction in compared to other parameters. The results show that the random forest can effectively distinguish BCC phase, FCC phase, mixed solid solution phase (FCC + BCC) and AM, with an accuracy is 0.87. After that, the CoCrFeNiAlx (x = 0, 0.5, 1) system alloys were characterized by XRD and SEM-EDS. The experimental results validated that the phase structure of CoCrFeNiAlx alloys changed from FCC to BCC + FCC and BCC with the increase of Al content, which is consistent with the ML prediction.
VASPMATE: An integrated user-interface program for high-throughput first principles computations through VASP code
2024, Computational Materials Science
We have developed an integrated user-interface C++ program for high-throughput (HT) first principles computations through VASP code, abbreviated as VASPMATE, with powerful pre-processing capabilities for various structures modeling and calculation parameters setting, as well as post-processing analysis for electronic, energetic and other properties. For the former, it includes the redefinition of equivalent cell, the conversion of coordinate system, the modification and constraint of atomic coordinates, the construction and deformation of supercell, the setting of k-points, and various necessary parameters as well as the automatic combination of potentials. The latter is designed to extract and analyze the raw data to generate the electronic, physical and chemical properties, e.g. Kohn-Sham orbitals, band structure, density of states, charge density difference, Fermi surface, thermo energy correction, formation enthalpy, etc. In addition, VASPMATE provides simple and neat command mode, which can be readily used to build several robust HT workflows with lightweight script. It has also been implemented in SPaMD to facilitate the design of user-friendly graphical interfaces, which helps in customizing the complex workflow for automatic derivation of various properties by means of first-principles computations. Particularly, massive automatic routines are provided in SPaMD, through which one may submit a batch of HT tasks to server as well as extract the calculated results and generate a statistical report. The efficiency and functionalities of this program were critically validated by conducting several evaluations and tests, which provided guidance and confidence in its potential applications for targeted first principles calculations. We believe that this program will be helpful to users in the field of computational materials science.
Accurate structural descriptor enabled screening for nitrogen and oxygen vacancy codoped TiO<inf>2</inf> with a large bandgap narrowing
2022, Journal of Materials Science and Technology
Citation Excerpt :
To do a systematic screening, the one-by-one DFT calculation (including structural relaxation and electronic structure calculation) for all configurations is obviously too expensive. Sampling a subset of representative configurations for DFT calculation then predicting the electronic properties of the rest by machine learning [39–42] is much more efficient. This is possible only if each doping configuration can be represented by a structural descriptor (a vector uniquely determined by dopant spatial ordering), which should be highly accurate to ensure that doping configurations with similar descriptors exhibit similar electronic properties.
Nitrogen (N) doping has been widely adopted to improve the light absorption of TiO₂. However, the newly introduced N-2p states are largely localized thus barely overlap with O-2p states in the valence band of TiO₂, resulting in a shoulder-like absorption edge. To realize an apparent overlap between N-2p and O-2p states, charge compensation between N³⁻ and O²⁻ via electron transfer from oxygen vacancies (V_O) to N dopants is one possible strategy. To verify this, in numerous doping configurations of N/V_O-codoped anatase TiO₂, we identified two types of V_O position independent N-dopant spatial orderings by efficient screening enabled with a newly designed structural descriptor. Compared with others, these two types of the N-dopant spatial orderings are highly beneficial for charge compensation to produce an apparent overlap between N-2p and O-2p states, therefore achieving a large bandgap narrowing. Furthermore, the two types of the N-dopant spatial orderings can also be generalized to N/V_O-codoped rutile TiO₂ for bandgap narrowing.
Global instability index as a crystallographic stability descriptor of halide and chalcogenide perovskites
2022, Journal of Energy Chemistry
Crystallographic stability is an important factor that affects the stability of perovskites. The stability dictates the commercial applications of lead-based organometal halide perovskites. The tolerance factor ( $t$ ) and octahedral factor ( $μ$ ) form the state-of-the-art criteria used to evaluate the perovskite crystallographic stability. We studied the crystallographic stabilities of halide and chalcogenide perovskites by exploring an effective alternative descriptor, the global instability index (GII) that was used as an indicator of the stability of perovskite oxides. We particularly focused on determining crystallographic reliability by calculating GII. We analyzed the bond valence models of the 243 halide and chalcogenide perovskites that occupied the lowest-energy cubic-phase structures determined by conducting the first-principles-based total energy minimization calculations. The decomposition energy ( $Δ H_{D}$ ) reflects the thermodynamic stability of the system and is considered as the benchmark that helps assess the effectiveness of GII in evaluating the crystallographic stability of the systems under study. The results indicated that the accuracy of predicting thermodynamic stability was significantly higher when GII (73.6%) was analyzed compared to the cases when t (55%) and μ (39.1%) were analyzed to determine the stability. The results obtained from the machine learning-based data mining method further indicate that GII is an important descriptor of the stability of the perovskite family.
MLMD: a programming-free AI platform to predict and design materials
2024, npj Computational Materials
MatGPT: A Vane of Materials Informatics from Past, Present, to Future
2024, Advanced Materials

View all citing articles on Scopus

Kun Zhou obtained his B.S. degree from Jilin University in 2019. He is currently a Ph.D. candidate at the College of Materials Science and Engineering, Jilin University. His research interest mainly focuses on the design of novel photoelectric materials based on first-principles calculations combined with machine learning.

Bangyu Xing is currently a Ph.D. candidate at the College of Materials Science and Engineering, Jilin University. His research interest includes the development of new material design methods based on first-principles calculations, high-throughput calculations, and machine learning.

Ruoting Zhao received his Bachelor’s degree from the College of Materials Science and Engineering, Changchun University of Science and Technology in 2019. He is currently a Master’s candidate at the College of Materials Science and Engineering, Jilin University. His research interest mainly focuses on combining first principles and machine learning to design optoelectronic materials and explore structure-properties relationships.

Yuhao Fu received his Ph.D. degree from Jilin University in 2017. During 2017–2019, he worked as a postdoctoral researcher at the University of Missouri-Columbia, USA. Currently, he is an associate professor of the College of Physics at Jilin University. His research interest mainly focuses on the development of simulation methods on material design and prediction of transport properties, and exploring microscopic transport mechanisms in functional semiconductors.

Lijun Zhang obtained his B.S. degree from Northeast Normal University in 2003, and completed his Ph.D. degree at Jilin University in 2008. He then worked as a postdoctoral researcher at Oak Ridge National Laboratory (2008–2010) and National Renewable Energy Laboratory (2010–2013), and became a research assistant professor at the University of Colorado at Boulder (2013–2014). He is currently a Tang-Aoqing Distinguished professor of the School of Materials Science and Engineering at Jilin University. His current research interest focuses on materials by design and band structures engineering of functional semiconductors for optoelectronic applications.

¹: These authors contributed equally to this work.

View full text

ArticleJAMIP: an artificial-intelligence aided data-driven infrastructure for computational materials informatics

Abstract

Graphical abstract

Introduction

Section snippets

Code framework

Demonstration of usage

Discussion and conclusion

Conflict of interest

Acknowledgments

Comput Mater Sci

Comput Mater Sci

Comput Mater Sci

Comput Mater Sci

Comput Mater Sci

Sci Eng: B

Comput Mater Sci

Comput Mater Sci

Comput Mater Sci

Comput Phys Commun

Comput Phys Commun

Comput Phys Commun

Sci Bull

J Mater

Inorg Chim Acta

Joule

Perspective: materials informatics and big data: realization of the “fourth paradigm” of science in materials science

APL Mater

From DFT to machine learning: recent approaches to materials science-a review

J Phys: Mater

Big data-driven materials science and its FAIR data infrastructure

Handb Mater Model

Data-driven materials science: status, challenges, and perspectives

Adv Sci

Materials genome initiative: a renaissance of American manufacturing

Big data of materials science: critical role of the descriptor

Phys Rev Lett

Distilling free-form natural laws from experimental data

Science

Data-driven discovery of partial differential equations

Sci Adv

RelEx—Relation extraction using dependency parse trees

Bioinformatics

Unsupervised word embeddings capture latent knowledge from materials science literature

Nature

High-throughput computational materials screening and discovery of optoelectronic semiconductors

Wiley Interdiscip Rev: Comput Mol Sci

Computational functionality-driven design of semiconductors for optoelectronic applications

InfoMat

Design of lead-free inorganic halide perovskites for solar cells via cation-transmutation

J Am Chem Soc

Article
JAMIP: an artificial-intelligence aided data-driven infrastructure for computational materials informatics