ArticleJAMIP: an artificial-intelligence aided data-driven infrastructure for computational materials informatics
Graphical abstract
Introduction
Currently, materials science research is entering a new paradigm featuring materials informatics, which applies machine learning techniques to massive materials data [1], [2], [3], [4] to accelerate materials discovery and design. It was first promoted in part by the Materials Genome Initiative [5] and emerged in practice as the result of the fast development of experimental materials synthesis, characterization approaches, and theoretical materials simulation through available computation resources that yield massive materials data [1]. By utilizing the recognized predictive power of materials informatics, encouraging breakthroughs have been made, including the design of new materials, the identification of new relationships [6], [7], and the prediction of new principles [8], [9]. In the fields of both experimental and computational materials informatics, there is an urgent need to develop powerful facilities/infrastructures to meet research requirements under this new paradigm.
Generating massive materials data of the composition-structure-property relationship is essentially the first step for materials informatics studies. In the field of computational materials science, there are generally three materials data generation techniques: the first is through direct calculations of single or a few materials. This is used frequently in the old paradigm when theorists want to study some material properties observed experimentally or to explore new physics in some previously unstudied materials. Such data are dispersedly distributed in literature and may be gathered manually or by automatic text extraction approaches [10], [11]. The second is through the high-throughput (HT) calculations, where extremely large numbers of materials spanning different chemical compositions are automatically calculated based on prototype structures [12], [13]. Along this direction, a series of computational infrastructures are developed and used in high-throughput calculations [12], [14], [15], [16], [17], [18], [19], [20], [21], [22], [23], [24]. The accumulated computational materials data lead to the formation of databases that are accessible for query and research purposes [25], [26], [27], [28], [29]. The third is through crystal structure searches. These seek new stable and metastable material structures under the typical chemical composition. This method complements high-throughput calculations. It explores the potential energy profile of materials, usually by combining artificial intelligence global optimization methods with energetic calculations [25], [30], [31], [32], [33], [34]. Because in structure search studies, the object of the greatest concern is the lowest-energy ground-state structure, the generated materials data are usually regarded as intermediate products, and do not accumulate in materials informatics studies.
Machine learning techniques act as an engine that triggers/accelerates materials discovery in computational materials informatics [35], [36], [37]. Various machine learning models are already available for use in computational material informatics, such as supervised learning, unsupervised learning, and active learning [38]. Supervised learning requires an artificially constructed training process, which then understands the mapping relationships between material descriptors and material properties. Using this model, both the underlying physical laws between descriptors and the material properties can be mined [39], [40]. Additionally, it can be used to reverse-engineer new materials to uncover new materials [41], such as predicting energy-related materials for catalysis, batteries, solar cells, and gas capture [42], [43], [44]. For instance, Chen et al. [44] reported a powerful machine learning model with the extreme gradient boosting regression (XGBR) algorithm to predict the Gibbs free energy changes of CO adsorption in the dispersed metal-nonmetal codoped graphene electrocatalysts. Unsupervised learning, by contrast, is generally used to uncover differences among materials or chemical systems in terms of unmarked “descriptors”. Clustering algorithms are the most widely used for this purpose. For example, Chen et al. [45] used the K-mean clustering algorithm to analyze the oxygen diffusion pattern, hopping statistics, and site occupation within crystals. In addition, in the field of materials discovery, active learning, and Bayesian optimization, with the self-optimizing characteristics of model, allow us to search the potential materials space for high-quality materials in a limited data range, and contribute to the material experimental and computational design [46].
The software/infrastructure for computational materials informatics needs to meet at least the following requirements, but is not limited to them: (1) adaptability for diverse materials discovery and design: the software should flexibly deal with different material systems from inorganic and organic to organic-inorganic hybrid systems, and different materials variations/combinations, such as alloy or defect, surface or interface, and heterostructure or superlattice. To fit the potential complexity of different materials with different properties, computational workflow modules should have high scalability and ensure flexible combinations among different types of computing tasks. (2) Efficiency and reliability in data generation and management: a high-level automation framework to initialize, run, and analyze large-scale high-throughput calculations is essential. Monitoring computational tasks in real-time and reporting/correcting potential errors are important. The software should have tools to efficiently extract and collect results and store them in a self-contained database. (3) Orientation by materials functionality: the software should be oriented by the functionality of materials (e.g., photovoltaic, thermoelectric, ferroelectric, and catalytic), which is central to materials discovery. Rational design of workflow modules needs to be sufficiently considered to effectively calculate the functionalities of different material systems. (4) Synthetical integration of data generation, storage, processing, and learning: it is desirable to integrate data generation, storage, processing, and learning modules with a unified infrastructure framework, which greatly facilitates the data fluxion and conversion procedure, as well as the efficiency of data mining. In the integrated framework, newly calculated data motivated by the feedback from data-learning procedures may accelerate knowledge accumulation.
In this paper, we report on an artificial-intelligence-aided data-driven infrastructure that we have been developing to fulfill the above requirements for computational materials informatics. It is an open-source Python-integrated framework named the Jilin Artificial-intelligence-aided Materials-design Integrated Package (JAMIP). The code is integrated by intimately connected units of Data generation (e.g., high-throughput materials calculations and screening), Data collection (e.g., automatic data extraction, management, and storage) and Data learning (e.g., integrated feature engineering, and machine learning based data mining functions). Below, we describe the JAMIP code framework and its usage in carrying out materials-informatics-related research on optoelectronic semiconductors by taking halide perovskites as instances. The code and manual describing its detailed usage options are freely available for download at www.jamip-code.com.
Section snippets
Code framework
The organization of JAMIP abides by the data lifecycle in the research field of computational materials informatics, from data generation to collection and learning, as shown in Fig. 1. In the data generation stage, users perform the large-scale high-throughput materials calculations, which are done through obtaining structure prototypes from the JAMIP database, generating candidate structures through the materials production factory, initializing the input parameters of HT calculation tasks,
Demonstration of usage
In principle, our JAMIP code can be used to perform materials-informatics-related research on the functional material system of which the functionality-involved properties can be accurately described by first-principles calculations. In this section, we briefly describe its use in studying halide perovskites-based semiconductors for optoelectronic applications. Halide perovskites (HPs, formula as ABX3), as promising optoelectronic materials (e.g., photovoltaic solar cells, light-emitting
Discussion and conclusion
To summarize, we have reported the development of an artificial-intelligence-aided data-driven infrastructure named Jilin Artificial-intelligence aided Materials-design Integrated Package (JAMIP), which is designed purposely to meet the requirements of the studies of computational materials informatics. It is an open-source software implemented mainly in Python language. With the emphasis on automation, extensibility, reliability, and intelligence, it is integrated by intimately connected units
Conflict of interest
The authors declare that they have no conflict of interest.
Acknowledgments
This work was supported by the National Natural Science Foundation of China (61722403, 92061113, and 12004131) and the Interdisciplinary Research Grant for PhDs of Jilin University (101832020DJX043). Calculations were performed in part at the High-Performance Computing Center of Jilin University. We acknowledge the important contributions to the development of JAMIP code from the following group members: Yuwei Li, Zhun Liu, Dongwen Yang, Qiaoling Xu, Yawen Li, Xueting Wang, Yilin Zhang,
Xin-Gang Zhao received his Ph.D. degree in Physical Chemistry from Jilin University in 2017. Since 2018, he has been working as a research associate at the University of Colorado Boulder, USA. His research interest mainly focuses on functionalized material design and exploring the relationship between the underlying local symmetry-breaking and physical properties in the perovskite system.
References (60)
- et al.
AFLOWLIB.ORG: a distributed materials properties repository from high-throughput ab initio calculations
Comput Mater Sci
(2012) - et al.
AFLOW: an automatic framework for high-throughput materials discovery
Comput Mater Sci
(2012) - et al.
AiiDA: automated interactive infrastructure and database for computational science
Comput Mater Sci
(2016) - et al.
MPInterfaces: a materials project based Python tool for high-throughput computational screening of interfacial systems
Comput Mater Sci
(2016) - et al.
MatCloud: a high-throughput computational infrastructure for integrated management of materials simulation, data and resources
Comput Mater Sci
(2018) - et al.
SEHC: a high-throughput materials computing framework with automatic self-evaluation filtering
Sci Eng: B
(2020) - et al.
Atomate: a high-level interface to generate, execute, and analyze computational materials science workflows
Comput Mater Sci
(2017) - et al.
ALKEMIE: an intelligent computational platform for accelerating materials discovery and design
Comput Mater Sci
(2021) - et al.
TE design lab: a virtual laboratory for thermoelectric material design
Comput Mater Sci
(2016) - et al.
USPEX—Evolutionary crystal structure prediction
Comput Phys Commun
(2006)
XtalOpt: an open-source evolutionary algorithm for crystal structure prediction
Comput Phys Commun
CALYPSO: a method for crystal structure prediction
Comput Phys Commun
Interface structure prediction via CALYPSO method
Sci Bull
Materials discovery and design using machine learning
J Mater
The crystal structure of indium diiodide, indium(I) tetraiodoindate(III), In[InI4]
Inorg Chim Acta
Rational design of halide double perovskites for optoelectronic applications
Joule
Perspective: materials informatics and big data: realization of the “fourth paradigm” of science in materials science
APL Mater
From DFT to machine learning: recent approaches to materials science-a review
J Phys: Mater
Big data-driven materials science and its FAIR data infrastructure
Handb Mater Model
Data-driven materials science: status, challenges, and perspectives
Adv Sci
Materials genome initiative: a renaissance of American manufacturing
Big data of materials science: critical role of the descriptor
Phys Rev Lett
Distilling free-form natural laws from experimental data
Science
Data-driven discovery of partial differential equations
Sci Adv
RelEx—Relation extraction using dependency parse trees
Bioinformatics
Unsupervised word embeddings capture latent knowledge from materials science literature
Nature
High-throughput computational materials screening and discovery of optoelectronic semiconductors
Wiley Interdiscip Rev: Comput Mol Sci
Computational functionality-driven design of semiconductors for optoelectronic applications
InfoMat
Design of lead-free inorganic halide perovskites for solar cells via cation-transmutation
J Am Chem Soc
Cited by (37)
Machine learning guided BCC or FCC phase prediction in high entropy alloys
2024, Journal of Materials Research and TechnologyVASPMATE: An integrated user-interface program for high-throughput first principles computations through VASP code
2024, Computational Materials ScienceAccurate structural descriptor enabled screening for nitrogen and oxygen vacancy codoped TiO<inf>2</inf> with a large bandgap narrowing
2022, Journal of Materials Science and TechnologyCitation Excerpt :To do a systematic screening, the one-by-one DFT calculation (including structural relaxation and electronic structure calculation) for all configurations is obviously too expensive. Sampling a subset of representative configurations for DFT calculation then predicting the electronic properties of the rest by machine learning [39–42] is much more efficient. This is possible only if each doping configuration can be represented by a structural descriptor (a vector uniquely determined by dopant spatial ordering), which should be highly accurate to ensure that doping configurations with similar descriptors exhibit similar electronic properties.
Global instability index as a crystallographic stability descriptor of halide and chalcogenide perovskites
2022, Journal of Energy ChemistryMLMD: a programming-free AI platform to predict and design materials
2024, npj Computational MaterialsMatGPT: A Vane of Materials Informatics from Past, Present, to Future
2024, Advanced Materials
Xin-Gang Zhao received his Ph.D. degree in Physical Chemistry from Jilin University in 2017. Since 2018, he has been working as a research associate at the University of Colorado Boulder, USA. His research interest mainly focuses on functionalized material design and exploring the relationship between the underlying local symmetry-breaking and physical properties in the perovskite system.
Kun Zhou obtained his B.S. degree from Jilin University in 2019. He is currently a Ph.D. candidate at the College of Materials Science and Engineering, Jilin University. His research interest mainly focuses on the design of novel photoelectric materials based on first-principles calculations combined with machine learning.
Bangyu Xing is currently a Ph.D. candidate at the College of Materials Science and Engineering, Jilin University. His research interest includes the development of new material design methods based on first-principles calculations, high-throughput calculations, and machine learning.
Ruoting Zhao received his Bachelor’s degree from the College of Materials Science and Engineering, Changchun University of Science and Technology in 2019. He is currently a Master’s candidate at the College of Materials Science and Engineering, Jilin University. His research interest mainly focuses on combining first principles and machine learning to design optoelectronic materials and explore structure-properties relationships.
Yuhao Fu received his Ph.D. degree from Jilin University in 2017. During 2017–2019, he worked as a postdoctoral researcher at the University of Missouri-Columbia, USA. Currently, he is an associate professor of the College of Physics at Jilin University. His research interest mainly focuses on the development of simulation methods on material design and prediction of transport properties, and exploring microscopic transport mechanisms in functional semiconductors.
Lijun Zhang obtained his B.S. degree from Northeast Normal University in 2003, and completed his Ph.D. degree at Jilin University in 2008. He then worked as a postdoctoral researcher at Oak Ridge National Laboratory (2008–2010) and National Renewable Energy Laboratory (2010–2013), and became a research assistant professor at the University of Colorado at Boulder (2013–2014). He is currently a Tang-Aoqing Distinguished professor of the School of Materials Science and Engineering at Jilin University. His current research interest focuses on materials by design and band structures engineering of functional semiconductors for optoelectronic applications.
- 1
These authors contributed equally to this work.