Introduction

Dielectric materials are among the most vital components for microelectronic device manufacturing. They are used in memory devices, capacitor-based energy storage, field-effect transistors, etc1,2,3. The dielectric constant (denoted here as ϵ), more commonly referred to as the relative permittivity, is the factor by which the electric field strength decreases inside a material compared to the vacuum when it is placed near a finite electric charge. The ϵ values of commonly used dielectric materials range between 20 and 301,4,5—for example, Ta2O5 (ϵ ~ 23–27, Eg = 4.2 eV)1,2,6,7 and TiO2 (ϵ = 27, Eg = 3.5 eV)1,2,8. There is a high demand to find novel materials with high ϵ to increase the device performance and reliability. Typically, ϵ and Eg are inversely related2,9 in a compound. As a result, although several materials are reported to have even larger ϵ values, they often have a small Eg9,10,11,12, making the dielectric vulnerable to leakage currents under exposure to large electric fields1,2. Therefore, compounds with high ϵ and large band gaps are preferred while designing charge storage applications and microelectronic devices.

One of the methods to find high-ϵ compounds is to calculate the dielectric constants and band gaps of a large number of compounds that are available in large materials databases such as the Open Quantum Materials Database (OQMD)13,14, Materials Project (MP)15, etc using ab initio methods such as density functional theory (DFT). However, since the accurate calculation of dielectric properties using density functional perturbation theory16 (DFPT) is computationally very expensive, it would be practically unfeasible to estimate the dielectric constants of tens of thousands of materials available in those databases using high-throughput methods. In this work, we employ an advanced screening strategy to identify compounds with better dielectric properties. Thus, the goal of this work is to find dielectric materials with large values for both ϵ and Eg by screening materials databases but at the expense of conducting as few DFPT calculations as possible. To accomplish this task, we have employed a materials design strategy comprised of statistical optimization models and DFPT calculations on a small set of compounds. While our training set consists of a small amount of data (dielectric constants) from the MP, the search-space contains a vast set of compounds available in the OQMD.

Several online data repositories exist today that are dedicated to hosting large sets of open-sourced inorganic crystal structure data generated from high-throughput (HT) DFT calculations such as the MP15, OQMD13,14, and AFLOWLib17 among others18,19. The design and discovery of novel materials using statistical modeling has become an active research area20,21,22 in recent times, largely attributed to the availability of such HT datasets. Recently, multiple studies have reported HT-generation of dielectric data and subsequent analysis9,23,24. For example, Morita et al. reported25 machine learning modeling of data from MP11,12,15 to assess the reliability of the theoretical models currently available to describe the dielectric properties of crystals.

In this work, we use the MP dataset of 1864 dielectric tensors11,12 to train statistical models and subsequently identify dielectrics from the set of stable materials in the OQMD. Thus the MP data forms the training-data and the set of materials from OQMD forms the search-space for the materials design. This work is a successful demonstration of the scenario where the data obtained from multiple sources can be utilized to discover new compounds. The negligible difference found between the representation vectors, which are also called as feature vectors in machine learning, generated for equivalent materials in MP and OQMD made the cross-database design possible in this work. Overall, we conducted three design cycles which required us to perform dielectric calculations for just 17 materials using DFPT. We report the dielectric constant values of all the 17 materials among which three of them (HoClO, Eu5SiCl6O4, and Tl3PbBr5) have very large ϵ (69 < ϵ < 101) and Eg (2.9 eV < Eg < 5.5 eV) values making them part of the Pareto front of the known data, and four other materials (Sr2LuBiO6, Bi5IO7, Bi3ClO4, and Bi3BrO4) have moderately large ϵ (20 < ϵ < 40) and Eg (2.3 eV < Eg < 2.7 eV) values.

Results

Materials design strategy

Our objective is to find large band gap materials with optimal dielectric constants. Since the dielectric tensor of a compound has nine components, the optimization of all nine components leads to a nine-objective optimization problem which is difficult to solve with training-data of size ~2000. Thus, we specifically optimize the largest eigenvalue of the dielectric tensor, referred to from here onward as ϵ, via statistical modeling through the materials design workflow, as depicted in Fig. 1. The workflow is similar to the strategies that have been previously reported in literature26,27, where each design cycle consists of three steps—data processing, statistical modeling, and ab initio DFPT calculations. The largest eigenvalue of the total dielectric tensor is chosen as the property to be optimized because that is the highest possible dielectric behavior from a single crystal when it is aligned perfectly along the corresponding direction between two metallic plates. The total dielectric tensor is calculated as the sum of ionic and electronic dielectric tensors. The good agreement between dielectric tensor eigenvalues obtained from MP’s DFPT HT framework and experimentally measured dielectric constant values was reported by Petousis et al.28. We preferred the largest eigenvalue over the average of eigenvalues because the latter value may severely underestimate the highest possible dielectric behavior from a single crystal (Supplementary Fig. 1), even though it is a popular choice to estimate the polycrystalline dielectric constant12,28. The new data produced from DFPT calculations at the end of each cycle is fed into the next design cycle. In the first step, we collected the relevant data from the MP database (training-data) and OQMD (search-space). All materials in the training-data have a known value for ϵ and Eg, while the materials in the search-space have known values of Eg but their ϵ values are unknown. In the second step, Modeling, we created an ensemble of artificial neural network (ANN)29 models, fit on the training-data, which learn to predict the ϵ value of materials when their crystal structures and Eg values are known. Using this ANN ensemble, we predicted the ϵ of each material in the search-space. Since the prediction was done from an ensemble, the results were a distribution of ϵ values for each material, contrary to the usage of a single ANN model where a single prediction value is obtained. The trained ANN ensemble was used to predict the ϵ-distributions of 11,102 stable non-metallic materials in the search-space, obtained from the OQMD.

Fig. 1: Materials design workflow used in this work.
figure 1

The three parts of the design workflow shown here together complete a single design cycle. The newly computed DFPT results from a design cycle are fed back into the training-data for the next design cycle. Since the dielectric tensor of a crystal is of shape 3 × 3, optimizing all nine components of the dielectric tensor leads to a nine-objective optimization problem. Thus, we specifically optimize the largest value of dielectric constant (ϵ) among all crystallographic directions, also referred to as the target property hereinafter. This scalar value is quantified as the largest eigenvalue of the total dielectric tensor. The training-data in this work consist of nearly 2000 compounds from the MP for which ϵ and Eg were known and the training-data came from the OQMD, for which only Eg were known at the beginning of this work.

Further, the predicted distribution of ϵ was input into the Efficient Global Optimization (EGO)26 algorithm. EGO takes into account the distribution’s mean and standard deviation to rank the materials in search-space based on their potential to increase the chances of finding high-ϵ materials in this workflow within as few design cycles as possible. In this work, the optimization in dielectrics refers to the identification of dielectrics with large ϵ values. The reason for employing an EGO algorithm to explore the search-space is to account for the uncertainty in ANN model predictions when the available training-data may not have sampled the material space uniformly. The advantages of EGO-based optimization in materials design were first reported and benchmarked by Balachandran et al.26,30,31. In this work, we used the EGO algorithm to select the best candidates that are either predicted to have a high ϵ value or have a large uncertainty in their ANN-ensemble predictions. Materials that belong to the latter category are from the regions of materials yet to be sampled by the training-data. The DFPT characterization of such materials is expected to increase the reliability of ANN-ensemble predictions after each design cycle and eventually lead to better optimization of dielectrics during the course of this work.

The metric that is used to rank the materials is called expected improvement, or E(I). More details on how the E(I) is calculated, are provided in the “Methods” section. A few (5–6) materials were selected in this step with the highest values of E(I) and carried onto the next step—DFPT calculations. In this final step, the dielectric tensors of the selected materials were calculated using DFPT calculations. If DFPT results show that any of the materials have a high value of Eg and ϵ, we stop the design workflow at that point. Otherwise, a new design cycle is started after transferring the newly computed ϵ values and the corresponding materials to the training-data from the search-space. With an increased size of training-data, the ANN ensemble is expected to have less uncertainty in ϵ predictions in the new design cycle. The design cycle was repeated with feedback three times in total in this work until three materials with very large values for Eg and ϵ were found.

Data

A dataset containing information about crystal structures, chemical compositions, band gap energy values, and dielectric tensors of 1864 stable materials was obtained from the MP11,12,15 data repository. This dataset was used to generate the training-data. The target property, ϵ, was obtained for each material in this database from its calculated dielectric tensor. Another dataset consisting of 11,102 stable, non-metallic materials containing information about crystal structures, chemical compositions, and band gap energy values was obtained from OQMD13,14. This OQMD dataset was used to generate the search-space in which the search to find dielectrics was conducted. The dielectric tensor data of all crystals included in the search-space were unknown at the beginning of this work.

The materials need to be represented as vectors of uniform length in order to be input into a statistical model. We generated the material representations using the Magpie32 crystal property generator tool. Magpie generates a set of physical features (such as the mean electronegativity of constituent atoms, average coordination number inside the unit cell, etc.) from a given chemical composition and crystal structure. Within Magpie, the crystal’s structure-related features are generated by building Voronoi tessellations inside the crystal and finding the nearest neighbors of each individual atom33. Magpie generated 271 input features that include 145 composition-based, and 126 structure-based features to represent each material. In addition to these, the material’s DFT Eg value was also added as an extra feature to the representation vector since it is already known for all materials in both MP and OQMD datasets. The addition of Eg increased the size of the representation vector to 272, which was generated for each material in training-data and search-space. The input feature-vector size was further reduced to 100 using the widely-used feature reduction techniques such as principal component analysis and model-based selection, implemented in the Scikit-learn python library34. The set of material representation vectors of training-data and the search-space, in addition to the target values associated with the training-data, completes the first step of materials design as depicted in Fig. 1. The size of the training-dataset increases after each design cycle as a result of conducting DFPT calculations on new materials from the search-space.

Statistical modeling utilizing data from multiple computational material databases is prone to errors arising from the differences in the DFT parameters used at each database’s high-throughput calculation strategy. Here, we have investigated the difference in Magpie-generated features for equivalent materials in OQMD and MP, cross-referenced based on their associated Inorganic Crystal Structure Database35 (ICSD) Collection Codes. In total, 1717 out of 1864 materials in training-data had an ICSD Collection Code associated with them. The crystal structures from OQMD corresponding to all the 1717 ICSD materials were obtained, and their Magpie-generated features were compared against that of the structures obtained from MP as a part of the training-data. The results, as plotted in Fig. 2a, show negligible (≤2%) relative difference in 263 out of a total of 271 Magpie features, while the other eight features have low relative differences (≤7%). All 145 composition-based features are computed to be identical across the databases, as expected. The finite difference in some of the structure-based features originates because of the difference in the accuracy of crystal structural minimization across databases. Band gap, which joins the Magpie features to form the final material representation vector, was also compared between OQMD and MP for the 1717 equivalent materials, as shown in Fig. 2b. Band gap values showed a mean and median absolute deviation of 0.1 eV and 0.0 eV respectively, pointing toward a negligible difference between the calculations of band gap for materials included in the training-data across OQMD and MP. Overall, the materials representation vector considered in this design is generated in a cross-comparable manner across OQMD and MP structures with very low errors.

Fig. 2: Comparison of material representation vectors between the OQMD and MP structures.
figure 2

The difference in material representation vectors of the structures obtained from OQMD and MP for 1717 materials. a Mean absolute difference in the Magpie-generated representational feature vectors on structures obtained from the MP and OQMD for 1717 materials in the training-data. Crystal structures of all 1717 materials were first obtained from MP as a part of generating training-data, and further cross-referenced to find their equivalent structures in OQMD based on their ICSD Collection Codes. ICSD Collection Codes were not available for the rest of the 143 materials in the MP training-data. b Difference in band gap values of 1717 materials from training-data that have a corresponding structure entry in MP and OQMD, which are cross-referenced based on their ICSD Collection Codes.

The ϵ values in the training-data obtained from MP are predominantly concentrated in the range of 0 to 25, making it difficult to model the data reliably for materials with large ϵ due to a possible bias toward smaller values. Less than 5% of the materials in the training-data have ϵ > 50. The median of ϵ values in the MP dataset is 12.2 while the mean and standard deviation are 20.2 and 42.8 respectively. The distribution of ϵ in training-data is shown in Supplementary Fig. 2. The large spread of ϵ values is decreased upon a log-scale transformation, as shown in Fig. 3a. A smaller spread of target values helps stabilize the machine learning model during the training by reducing the probability of excessive changes in internal parameters, such as the weights in an ANN. We also analyzed the correlation between ϵ and Eg values for the materials in the training-data, and it is given in Supplementary Fig. 3.

Fig. 3: Results from statistical modeling.
figure 3

a ANN model validation on a test set of 373 materials split from the training-data. We used an ensemble of ANNs to predict a distribution of values for each material. This particular model-fit plot is taken from a single ANN model that was part of the ensemble in design cycle 2. The 373 materials plotted here were not seen by this particular ANN model at any stage during the training. These predictions are made only for this particular ANN model to show its learning capabilities, and it is not part of the design workflow that we created. In the design workflow, each ANN model in the ensemble is exposed only to a unique subset of the full MP training-data, excluding 373 randomly chosen materials. Further, in the design workflow, this trained ANN model is used to predict the dielectric values of only the search-space materials from OQMD, not the 373 unseen materials from the MP dataset. The model was trained to predict \({\log }_{2}(\epsilon )\) because the ϵ values were highly non-uniform in the training-data with most of the values below 25, making some of the very large values outliers. A log-scale transformation of ϵ reduced the numerical difference between the largest ϵ value and the median, making the former less of an outlier in ANN modeling. The model fit shown in this plot has an R2 score of 70%, and a Spearman's rank correlation of 85%. b This plot shows the predicted ϵ-distributions and corresponding E(I) values on the same test dataset consisting of 373 materials split from the training-data. The error bars represent the standard deviation in ANN-ensemble predictions which is quantified as the uncertainty of ANN modeling. For a clearer perspective, the radius and color of the circles represent the same quantity—the expected improvement, E(I), value calculated using the EGO algorithm. A point without an outer circle around it represents a material with a negligible (<10−3) value for E(I). In this figure, only 25 materials have an E(I) value that is greater than 10−3.

The original dataset downloaded from MP listed BeO (MP ID: mp-1794) as having large ab initio computed values for ϵ(=312) and Eg(=8.2 eV). This large value of ϵ is possibly caused by the improper relaxation of the primitive cell of BeO in MP that leads to a large volume change. Hence, the succeeding calculations on this compound such as DFPT may be incorrect. We conducted a separate DFT cell-relaxation and DFPT calculation for BeO using VASP starting with the MP’s initial structure and find that the computed ϵ value for the correctly relaxed structure is 4—well in agreement with the previously reported values in literature36. This compound was removed from the training-data before proceeding further. We looked up other materials in training-data with very high ϵ and smaller Eg individually and confirmed that they did not have a large cell-volume change upon relaxation in MP.

Statistical modeling

The predictions from trained machine learning models, such as ANNs, are often prone to errors arising from the insufficient sampling of material space by training-data. We needed to quantify the uncertainty associated with the ϵ value predictions even though the available ANN algorithms explicitly do not provide that value from a single ANN model. So we created an ensemble of ANNs, each of which was trained on a randomly chosen subset of the training-data, and has different architectures and internal parameters. An ANN ensemble containing 2000 independent ANN models was created and trained at each design cycle. Each ANN in the ensemble predicted a single ϵ value upon inputting a material-representation vector, resulting in a distribution of 2000 predicted ϵ values for each material in the search-space. The standard deviation of each of the predicted ϵ-distribution was defined as the uncertainty of ANN modeling for the corresponding material.

Further, a statistical single-objective optimization algorithm, called EGO26,37,38,39,40, was used in this work to evaluate the ϵ-distribution and quantify a measure of probable optimization associated with each material in the search-space. EGO is not a method to model the data and predict ϵ. Instead, EGO is an algorithm to select the best candidates from a given search-space, based on their ϵ-distributions predicted by the ANN ensemble, in order to discover as many high-ϵ materials from as few design cycles as possible. Here, the desired optimization is the maximization of ϵ among all the materials in the search-space. The quantified measure of predicted optimization in EGO is called expected improvement, denoted as E(I). Conceptually, the E(I) of a material in search-space is the quantified probability with which a DFPT calculation of ϵ for that material will lead to the identification of high-ϵ material in the design workflow within as few design cycles as possible. Figure 3a shows the results from an ANN model validation as a part of model training during the second design cycle. The values of E(I) computed for the same validation data split from the training-data are shown in Fig. 3b. A simplified illustration of E(I) with the help of an example is given below.

Example illustration of E(I)

Suppose the predicted ϵ-distribution belonging to a material M1 in the search-space has a large standard deviation. Then it is highly probable that the material M1 belongs to a part of the material representation vector space which was not sampled very well in the training set. Computing the ϵ of M1 using DFPT and feeding back that information to the training-data will lead to better ANN modeling in the subsequent design cycles. Thus, M1 will have a large value of E(I). Now consider another material M2 in search-space with a large mean and a small standard deviation for its predicted ϵ-distribution. The material M2 belongs to a part of the material representation vector space that was sufficiently sampled by the training-data. So it is highly probable that M2 will turn out to be a high-ϵ material upon DFPT calculations. Because of that, M2 will also have a large value of E(I).

In EGO, the calculation of E(I) for a general optimization problem proceeds as follows (also shown in Fig. 4).

Fig. 4: The optimization algorithm.
figure 4

The value \({y}_{t}^{{\rm{max}}}\) represents the currently available highest value of ϵ among all materials in the training-data. μ and σ represent the mean and standard deviation of the ANN-ensemble predicted distribution of ϵ for a material (blue dot) in the search-space. Within the predicted distribution, which is assumed as a normalized Gaussian function here, the region above \({y}_{t}^{{\rm{max}}}\) represents the region of improvement—as shown in green. If the ab initio DFPT calculation determines that the material's ϵ value exists within the green-shaded region, it will be considered as an improvement over the current best value \({y}_{t}^{{\rm{max}}}\) in training-data.

Let Y be the target property to be maximized and φ(Y) be the predicted distribution of Y for a given search-space material. The value, φ(Y = y) is the probability when the value of Y is y. The largest value of the target property in the training-data is denoted as \({y}_{t}^{{\rm{max}}}\). The EGO algorithm, as formulated by Jones et al.38, computes the expected improvement, E(I), as:

$$E(I)=\int\nolimits_{{y}_{t}^{{\rm{max}}}}^{\infty }(y-{y}_{t}^{{\rm{max}}})\ \varphi (Y=y)\ dy$$
(1)

As mentioned in Balachandran et al.26, if the predicted distribution is approximated as a normal (i.e., Gaussian) distribution with a mean μ and a standard deviation σ, the above equation can be re-written as:

$$E(I)=\sigma [\phi (z)+z{{\Phi }}(z)]$$
(2)

where, \(z=\frac{\mu -{y}_{t}^{{\rm{max}}}}{\sigma }\), ϕ is the probability density function, and Φ is the cumulative distribution function38 of the normal distribution, φ(Y).

For dielectric design, Y is the dielectric constant (ϵ) of a candidate material, and \({y}_{t}^{{\rm{max}}}\) is the highest value of ϵ in the training-data obtained from DFPT calculations. In the MP dataset, the largest ϵ value is for TiO2 with ϵ = 988 and Eg = 1.8 eV. But our goal in this work is to find materials with large ϵ’s, not necessarily higher than 988 as long as the Eg’s are greater than 1.8 eV. Thus the \({y}_{t}^{{\rm{max}}}\) in this work was set at 100.0 for all design cycles, instead of setting it at 988.0, to consider the search-space materials whose ϵ values are predicted to be sufficiently high. The φ(Y) is approximated to be a normal distribution with the same mean, μ, and standard deviation, σ, as that of the original ϵ-distribution predicted by the ANN ensemble for each search-space material.

Design cycles with feedback

The ϵ values of a few materials selected from the statistical modeling are computed from DFPT calculations, as shown in the final segment of a design cycle in Fig. 1. The results from the DFPT calculations are used to determine whether to conduct any further design cycles. In this work, we conducted the design cycles until at least one high-ϵ dielectric with a large Eg is identified. When no such materials are found during a design cycle, all the selected materials along with their newly DFPT-estimated ϵ values are transferred from search-space to training-data, resulting in a feedback of information prior to the beginning of the next design cycle. The feedback is one of the most crucial parts of our material design workflow because it results in a better sampling of material representation vector space by training-data and thus, more reliable ANN model predictions during the next design cycle. The advantage of the feedback mechanism is prominent during the quantification of uncertainty which is used directly by the EGO algorithm to identify the best candidates for the next set of DFPT calculations. After the end of a design cycle, the uncertainty on predicting the ϵ values is decreased for the set of materials which are similar to the materials whose ϵ values were calculated using DFPT in the given cycle.

In addition to the feedback mechanism, another factor that influenced the candidate selection in the design workflow is the minimum cutoff imposed on the band gap values of materials when they are included in the search-space. The reason for implementing a cutoff is to externally introduce a character of multi-objective optimization in this work. Without explicitly setting a minimum band gap limit, the candidate selection process that is dictated by the EGO algorithm tries to optimize only a single objective, which is the ϵ value. We conducted three design cycles sequentially with feedback of the newly calculated data into training-data after each cycle. In the first design cycle, we set no band gap minimum cutoffs to allow the full exploration of the search-space that consists of 11,102 non-metals from OQMD. In the second design cycle, a minimum cutoff of 2.25 eV was set, leaving 6191 materials in the search-space. In the final cycle, the minimum cutoff was increased to 5 eV to limit the candidate selection only to the materials with very high Eg. Hence, the search-space size in the final cycle was reduced to 1046 materials. The workflow that we adopted in this work deviates from the ideal situation where a dedicated multi-objective optimization statistical algorithm will be used to find a material with high ϵ and large Eg values. Since the band gap values are already available for all materials in the search-space, the best approach here was to implement a statistical optimization algorithm to quickly find high-ϵ materials while the preference for large band gap values is achieved by manually setting a minimum cutoff. This work stands as an example for the modifications required to practically implement the statistical algorithms that are often benchmarked on idealistic scenarios.

New dielectric materials

The materials that are part of the Pareto front of MP data are listed in Table 1, while the Pareto front of training-data at each design cycle is plotted in Fig. 5. Since the maximization of ϵ and Eg values are considered as optimal in this study, each material in the Pareto front has a higher value of either ϵ or Eg than any other material in the corresponding training-data. Therefore, the modification of the training-data’s Pareto front by any of the newly calculated dielectric constants after each design cycle may indicate the identification of suitable, high-dielectric materials.

Table 1 The Pareto front of dielectric materials dataset from Materials Project.
Fig. 5: Evolution of the Pareto front with design cycles.
figure 5

The ϵ and Eg values of the training-data and newly characterized dielectric materials are plotted for a design cycle 1, b design cycle 2, and c design cycle 3. All the data shown in these plots originated from DFPT calculations. Plot a shows the original Pareto front of the dataset from Materials Project (MP) because none of the materials measured in cycle 1 became part of the Pareto front—predominantly owing to their low band gap values. Assigning no restrictions on the band gap of search-space materials during the first cycle directed the design algorithm to pick two materials without any preference for large band gaps. The numerical values of the materials in the Pareto front of the MP dataset are given in Table 1. In both b and c, only the materials with Eg values greater than 2.0 eV are plotted to highlight the area where some of the newly discovered dielectrics in their corresponding cycles joined the Pareto front. Due to this cropping, two materials from the MP dataset which are actually in the Pareto front in plots b and c with very high ϵ values—tetragonal TiO2 (ϵ = 988, Eg = 1.8 eV) and cubic KTaO3 (ϵ = 640, Eg = 2.1 eV), are not shown here.

During the first design cycle, the EGO algorithm picked out the five most promising candidates with the largest E(I) values in the search-space. The ϵ values of these five selected materials were calculated using DFPT. Two materials among them turned out to have very high ϵ values (~370) but very low Eg (~0.5 eV). The low Eg values are not unexpected since the EGO algorithm implemented in this work aims to maximize only the ϵ values. None of the materials selected in this cycle modified the Pareto front of the MP dataset, as shown in Fig. 5a. The ϵ values of these five materials were appended to the training-data prior to starting the next design cycle.

Five materials were selected in the second cycle and their dielectric constants were calculated. Our calculations predict a large dielectric constant for one of the five new materials—tetragonal Tl3PbBr5 (ϵ = 101, Eg = 2.9 eV). Tl3PbBr5 joined the Pareto front, as shown in Fig. 5b. Three other new materials—Bi5IO7 (ϵ = 36, Eg = 2.7 eV), Bi3ClO4 (ϵ = 39, Eg = 2.3 eV), and Bi3BrO4 (ϵ = 39, Eg = 2.3 eV), have moderately large ϵ values, even though they did not improve the existing Pareto front. All the five new materials were appended into the training-data before proceeding to begin the third design cycle.

During the third and final design cycle consisting of only materials with very large Eg in search-space, seven new candidate materials were selected to do DFPT calculations. Two among them—Eu5SiCl6O4 (ϵ = 69, Eg = 5.5 eV) and HoClO (ϵ = 75, Eg = 5.2 eV) joined the Pareto front due to their large ϵ and Eg values, as shown in Fig. 5c. In total, three new dielectric materials in the Pareto front were discovered after three design cycles and 17 new DFPT calculations were performed in the entire workflow. No further design cycles were conducted since we have already identified multiple compounds with high ϵ and Eg, which remained unexplored experimentally.

The ϵ values of all 17 materials which were obtained in this work are given in Table 2. The ϵ and Eg of all materials belonging to the Pareto front of the MP dataset is listed in Table 1 for comparison. Among all the newly discovered dielectrics with large ϵ values, tetragonal HoClO and monoclinic Eu5SiCl6O4 stand out because of their very large DFT-calculated band gap energies (5.2 eV and 5.5 eV respectively). These two rare earth oxychlorides are reported to have been experimentally synthesized41,42,43,44 but their dielectric properties remained unstudied to the extent of our knowledge. Both of these compounds are mixed-anionic inorganic compounds—a class of emerging functional materials45. Interestingly, the monoclinic Eu5SiCl6O4 has 32 atoms in its primitive unit cell which often exceeds the maximum cutoff on the number of atomic sites in HT studies involving computationally expensive material properties11,19.

Table 2 Dielectric constants of 17 materials calculated using DFT in this work.

Thermodynamic stability of a dielectric when in contact with Si or other semiconductors is an important requirement for it to be used in electronic applications. Several of the high-ϵ dielectrics identified in the published literature were shown to be unstable while forming an interface with Si in subsequent experimental studies conducted at or above the room temperature. The formation of SiOx and other undesired metal oxides were reported at the interface between Si and the popular high-ϵ dielectrics such as Ta2O346,47,48, TiO249,50, BaTiO351, and SrTiO352,53. The thermodynamic stability between two compounds can be assessed from the phase diagram involving those compounds. In this work, the phase diagram is constructed by computing the convex hull54 of formation energies of all the materials that belong to a given phase space spanned by their constituent elements. Each of the compounds that form the convex hull not only has the lowest formation energy at its composition but also has lower energy than any linear combination of other materials in that phase space. The difference between the formation energy of a compound and energy at the convex hull for the same composition is called as the hull distance (Ehd). By definition, each material that is on the convex hull has a hull distance of zero (i.e., Ehd = 0) and is considered to be stable. On the other hand, every material that falls above the convex hull is considered as metastable (0 < Ehd ≤ 50 meV per atom) or unstable (Ehd > 50 meV per atom) depending on the magnitude of Ehd according to the heuristic conventions adopted in literature31,55,56,57,58. The presence of a tie-line between two compounds in a convex hull phase diagram indicates that they are thermodynamically stable phases when in contact with each other. Our thermodynamic stability analysis on Ta2O3, TiO2, BaTiO3, and SrTiO3 in OQMD using the qmpy API14 showed no tie-lines connecting any of them to Si, indicating they are unstable when in contact with Si. This is consistent with the published results46,47,48,49,50,51,52,53. We also analyzed Gd2O3, a high ϵ (~2059) that is proven to be stable against Si60, and found that a tie-line does exist between Si and Gd2O3. These phase diagram plots are provided in Supplementary Fig. 6. In Fig. 6, we report a phase diagram to assess the stability of newly discovered high-ϵ dielectrics—HoClO and Eu5SiCl6O4. The phase diagram shows that both these materials are thermodynamically stable with the semiconductors such as Si, Ge, GaAs, GaN, and SiC at 0K, a requirement for them to be used in microelectronic devices where an interface with one of the common semiconductors is often necessary61. The next most promising candidate, tetragonal Tl3PbBr5, has a very large ϵ (101) but possesses a relatively smaller band gap (2.9 eV) and is computed to be thermodynamically metastable at 0K (Ehd = 16 meV per atom) according to the data obtained from the OQMD. Tl3PbBr5 is also reported in the literature to have been experimentally synthesized62,63,64, without any mention of its dielectric properties.

Fig. 6: Phase diagram of all stable compounds in Ho-Cl-O-Eu-Si-Ge-Ga-As-C-N phase space from OQMD (as of January 2022).
figure 6

The two new most promising dielectrics, HoClO and Eu5SiCl6O4 are plotted in large green circles in the center. The elements (Ho, Eu, Si, Cl, Ge, Ga, As, C, N, and O) and semiconductors of interest (Si, Ge, GaAs, SiC, and GaN) are plotted in the middle layer in medium-sized yellow circles. All other stable compounds in the phase diagram are plotted in small dark circles in the outermost layer. Tie-lines between the new dielectrics and the semiconductors or elements are shown as thick red lines. Other tie-lines from the dielectrics to the rest of the stable materials in the outer layer are drawn as narrow gray lines. Another 2326 tie-lines exist in this phase diagram that do not include either of the dielectrics, and are not shown in this network-plot for better visibility of relevant information. The elements and compounds without any visible tie-lines in the outermost layer do not have tie-lines with HoClO or Eu5SiCl6O4, but they have tie-lines with some of the other materials in the outer layer—making them part of this phase diagram. There exist a tie-line from each dielectric material to each semiconductor that is considered here for comparison. This indicates that both HoClO and Eu5SiCl6O4 are in thermodynamic equilibrium with Si, Ge, GaAs, GaN, and SiC at 0K. Thermodynamic stability is a requirement for dielectrics that needs to form a stable interface with semiconductors in electronic applications61.

Discussion

We report the identification of three dielectric materials that contain a combination of high-dielectric constant and large band gap—HoClO(ϵ = 75, Eg = 5.2 eV), Eu5SiCl6O4(ϵ = 69, Eg = 5.5 eV), and Tl3PbBr5(ϵ = 101, Eg = 2.9 eV). These compounds modify the Pareto front of previously known high-throughput dielectric constants data available from the MP database. Our screening strategy also uncovers four other dielectric materials with large Eg and moderately large ϵ—Sr2LuBiO6(ϵ = 24, Eg = 2.4 eV), Bi5IO7(ϵ = 36, Eg = 2.7 eV), Bi3ClO4(ϵ = 39, Eg = 2.3 eV), and Bi3BrO4(ϵ = 39, Eg = 2.3 eV)—at the cost of conducting only 17 DFPT calculations overall. We utilize the data available in the open-source databases (OQMD, MP) to build a statistical optimization model and use it to select the best candidates after searching among 11,102 stable non-metals that are available in the OQMD. Among the newly discovered dielectrics, two mixed-anionic materials—HoClO and Eu5SiCl6O4 are shown to have tie-lines with multiple, commonly used semiconductors on their phase diagrams, that indicate their thermodynamic equilibrium.

The presence of rare earth elements such as Ho and Eu in dielectrics can be a challenge for their use in practical applications. However, the ongoing efforts toward increasing their availability such as efficient recycling of rare earth materials65,66 can result in a sufficient supply of elements for mass production of small electronic components. In particular, Ho is an underutilized element in the industry67 even though it is more abundant in the earth’s crust than other widely mined elements such as Mo, Bi, and precious metals68. Eu is more abundant on earth’s crust than Ho and some of the heavily mined elements such as W and As68. Hence, an active exploration of cheaper and easier extraction methods for rare earth elements may make it feasible to include them in mass-produced electronics in the near future. The presence of toxic elements such as Pb and Tl can stand as a barrier against including Tl3PbBr5 in consumer electronics. Since mixed-anionic materials are an emerging class of functional materials, our identification of promising dielectric materials in this family opens up further research opportunities on rational design of high-performance dielectrics and their experimental characterizations.

We also assessed the thermodynamic stability of the new dielectrics by creating a large convex hull diagram containing the best two new dielectrics (HoClO and Eu5SiCl6O4) and several commonly used materials in electronics. The relevance of this analysis is also provided in detail along with examples of previously reported high-ϵ dielectrics46,47,48,49,50,51,52,53 that were later found out to be unstable when in contact with common electronic component materials such as SiO2. Our convex hull analysis indicates that both HoClO and Eu5SiCl6O4 are stable against the common electronic materials that we considered.

To understand what features of HoClO, Eu5SiCl6O4, and Tl3PbBr5 make them the best dielectric candidates in this study, we have calculated their electronic structures and partial density of states (Supplementary Fig. 5). Our analysis shows that the top of the valence bands and bottom of the conduction bands in these compounds consists of primarily the contributions from the anions (Cl, Br) and cations (Ho, Eu, Tl), respectively. This analysis indicates that having lighter anions (such as Cl, Br) is advantageous as their valence orbitals making up the valence band edge in those compounds will have lower energies, hence, a relatively larger band gap that is desired in high-ϵ materials.

In addition to the identification of high-dielectrics, we successfully demonstrated an implementation of a cross-database statistical design for computational materials selection. Datasets from the MP and OQMD repositories are used in this work as training-data and search-space, respectively. The successful identification of new materials from such a workflow is another motivation for actively moving toward the interoperability of materials databases, which is one of the four pillars of FAIR data principles69 in scientific data management. Therefore, better interoperability across databases amplifies the flexibility in utilizing materials data while solving a complex materials problem.

Lastly, this work also stands as an example of the practical implementation of a computational design strategy for property optimization via data-informed material selection. A multi-objective optimization problem (maximizing ϵ and Eg) is converted into a single objective optimization using statistical methods (maximizing ϵ) combined with explicit constraining of band gap values (higher Eg) among materials since Eg is already available for all materials in the search-space. The deviation from the ideal, statistically benchmarked multi-objective optimization workflows27 enabled the efficient utilization of resources and resulted in the identification of three high-ϵ dielectrics at the cost of just 17 new DFPT calculations.

Methods

ANN modeling

The individual models in the ANN ensemble consisted of a single hidden layer with the number of neurons in the range of 102. The exact number of neurons varied randomly within a small range (10–30) to avoid any bias that may arise from model architecture since the subset of training-data for each ANN was randomly sampled. Each ANN ensemble consisted of 2000 independent ANNs. Thus, the ϵ-distribution for each material consisted of 2000 independent ϵ predictions. A new ANN ensemble was created and trained for each new design cycle to learn the incremented training-data. The Nadam optimizer is used for network optimization during the training. Both L2 layer regularization and early-stopping callback as implemented in Keras70, are implemented for each ANN in the ensemble to prevent over-fitting. On average, it took between 300 to 400 epochs to reach the local minimum of the loss function. Each epoch is a full iteration of fitting the training-data to update the internal weights of an ANN. Validation details of one of the randomly chosen ANN models from the ensemble are plotted in Fig. 3a for reference. Feature dimensional reduction prior to the training of ANNs was done using the principal component analysis algorithm implemented in scikit-learn34. Model validation during the training of one of the 2000 ANN models in the second design cycle is plotted in Fig. 3a.

DFPT calculations

We performed all DFT calculations using the Vienna Ab initio Simulation Package (VASP)71,72 with potentials derived using the projector-augmented wave73,74 method. We calculated the total dielectric constant (sum of electronic and ionic components) values for selected materials using DFPT as implemented in VASP. All the compounds were fully relaxed before the dielectric calculations. We used an energy cutoff of 520 eV, k-mesh of 6000 k-points per reciprocal atom, and an energy-threshold of 10−8 eV during the self-consistent calculations. The forces on the atoms after structural relaxations were less than 10−3 eV Å−1. We used the generalized gradient approximation75 to approximate the exchange-correlation energies of the electrons. A detailed discussion on DFPT calculations is provided in the Supplementary Methods section included within the Supplementary Material. We did DFPT calculations on a set of well-known dielectrics and a few rare earth compounds, and benchmarked the results against previously reported results in the literature. These results indicate the reliability of our calculated ϵ values, which are provided in Supplementary Table 2. Specifically, two rare earth oxides (EuO and Ho2O3) and one rare earth halide (EuF2) were benchmarked to test the accuracy of the standard DFPT calculations in modeling these compounds. Furthermore, our calculations reveal that no imaginary phonon modes appear in HoClO, Eu5SiCl6O4, and Tl3PbBr5, the best high-ϵ materials identified in this work. More details are provided in Supplementary Table 1 and Supplementary Fig. 4.