Introduction

Motivation

Driven by the success of machine learning (ML) in commercial applications (e.g., product recommendations and advertising), there are significant efforts to exploit these tools to analyze scientific data. One such effort is the emerging discipline of Materials Informatics which applies ML methods to accelerate the selection, development, and discovery of materials by learning structure–property relationships. Materials Informatics researchers are increasingly adopting ML methods in their workflow to predict materials’ physical, mechanical, optoelectronic, and thermal properties (e.g., crystal structure, melting temperature, formation enthalpy, bandgap). While commercial use cases and material science applications may appear similar in their overall goals, we argue that fundamental differences exist in the corresponding data, tasks, and requirements. Applying ML techniques without careful consideration of their assumptions and limitations may lead to missed opportunities at best and a waste of substantial resources and incorrect scientific inferences at worst. In the following, we mention unique challenges that the Materials Informatics community must overcome for universal acceptance of ML solutions in material science.

Learning from underrepresented and distributionally skewed data

One of the fundamental assumptions of current ML methods is the availability of densely and uniformly sampled (or balanced) training data. When there is an under-representation of certain classes in the data, standard ML algorithms may provide incorrect inferences across the classes of the data.1 Unfortunately, in several material science applications, balanced data is exceedingly rare, and various forms of extrapolation is required due to underrepresented data and severe class distribution skews. As an example, materials scientists are often interested in designing (or discovering) compounds with uncommon targeted properties, e.g., high \({T}_{{\mathrm {C}}}\) superconductivity or large \(ZT\) for improved thermoelectric power,2 shape memory alloys (SMAs) with the targeted property of very low thermal hysteresis,3 and bandgap energy in the desired range (0.9–1.7 eV) for solar cells.4 In such applications, we encounter highly imbalanced data (with targeted materials being in the minority class) due to these design choices or constraints. Consider a task of predicting material properties (e.g., bandgap energy, formation energy, stability, etc.) from a set of feature vectors (or descriptors) corresponding to crystalline compounds. One representative database for such a data set is the open quantum materials database (OQMD),5 which contains several properties of crystalline compounds as calculated using density functional theory (DFT). Note that, the OQMD database contains data sets with strongly imbalanced distributions of target variables, i.e., material properties. In Fig. 1, we plot the histogram of several commonly targeted properties. It can be seen that, the data set exhibits severe distribution skews. For example, \(95 \%\) of the compounds in the OQMD are possibly conductors with bandgap values equal to zero. Note that if the sole aim of the ML model is to maximize overall accuracy, the ML algorithm will perform quite well by ignoring or discarding the minority class. However, in practice, correctly classifying and learning from the minority class of interest may be more important than possibly misclassifying the majority classes.

Fig. 1
figure 1

Histograms (number of compounds vs. targeted property bin) of targeted properties of the OQMD database show heavily skewed distributions. We show that conventional machine-learning approaches: (a) produce inaccurate inferences in sparse regions of the property-space and (b) are overconfident in the accuracy of such predictions. The proposed approach overcomes these shortcomings

Explainable ML methods without compromising the model accuracy

One might think that increasing model complexity (and thereby model’s representative power) can address the challenges of underrepresented and distributionally skewed data. However, this can only superficially solve some of these problems.6 Furthermore, increasing the complexity of ML models may increase the overall accuracy of the system at the cost of making the model very hard to interpret. Understanding why a ML model made a certain prediction or recommendation is crucial, since it is this understanding that provides the confidence to make a decision that will lead to new hypotheses and ultimately new scientific insights. Several of the existing approaches define explainability as the inverse of complexity and achieve explainability at the cost of accuracy.7,8,9 This introduces a risk of producing explainable but misleading predictions. With the advent of highly predictive but opaque ML models, it has become more important than ever to understand and explain the predictions of such models and to devise explainable scientific ML techniques without sacrificing predictive power.

Better evaluation and uncertainty quantification techniques for building trust in ML

For a credible use of ML in material science applications, we need the ability to rigorously quantify the ML performance. Traditionally, the quality of an ML model is measured by the accuracy on test data using cross-validation. Considering the scarcity of densely sampled data in several material science problems,10,11 high accuracy on the test data can hardly provide confidence on the quality and generality of ML systems. A natural solution is to use a model’s own reported confidence (or uncertainty) score for quantifying trust in the prediction. However, a model’s confidence score alone may not be very reliable. For example, in computer vision, well-crafted perturbations to images can cause classifiers to make mistakes (such as, identifying a panda as a gibbon or confusing a cat with a computer) with very high confidence.12 As we will show later, this problem also persists in the Materials Informatics pipeline (especially with distributional skewness). Nevertheless, knowing when a classifier’s (or regressor’s) prediction can be trusted is useful in several other applications for building assured ML solutions. Therefore, we need to augment current validation techniques with additional components to quantify generalization performance of scientific ML algorithms and devise reliable uncertainty quantification methods to establish trust in these predictive models.

Literature survey

In the recent past, the materials science community has used ML methods for building predictive models for several applications.13,14,15,16,17,18,19,20,21,22,23,24,25,26 Seko et al.21 considered the problem of building ML models to predict the melting temperatures of binary inorganic compounds. The problem of predicting the formation enthalpy of crystalline compounds using ML models was considered recently.14,15,27 Predictive modeling for crystal structure formation at a certain composition are also being developed.16,28,29,30 The problem of bandgap energy prediction of certain classes of crystals31,32 and mechanical property prediction of metal alloys was also considered in the literature.24,25 Ward et al.4 proposed a general-purpose ML framework to predict diverse properties of crystalline and amorphous materials, such as bandgap energy and glass-forming ability.

Thus far, the research on applying ML methods for material science applications has predominantly focused on improving overall accuracy of predictive modeling. However, imbalanced learning, explainability, and reliability of ML methods in material science have not received significant attention. As mentioned earlier, these aspects pose a real problem in deriving correct and reliable scientific inferences and the universal acceptance of ML solutions in material science, and deserves to be tackled head on.

Note that some of these issues have been studied in past in ML and quantitative structure–activity relationship (QSAR) community.33,34,35,36 The problem of learning from imbalanced data has received a significant attention in ML community for applications, such as, sentiment analysis, text mining, video mining, etc.1 It has also received some attention in QSAR applications.37 Ranking descriptors based on their relative importance38 is a widely used approach to explain ML models and has been used in QSAR modeling.39 Other issues, such as, a proper way to select a representative test set from a parent database,34,35 and the issue of applicability domain,34,35,36 which is closely related to the correct way for evaluating model performances have also been considered for QSAR applications.

We draw motivation from these approaches and study imbalanced learning, explainability, and reliability of ML methods issues in the context of Materials Informatics applications.

Our contributions

In this paper, we take some first steps in addressing the challenge of building reliable and explainable ML solutions for Materials Informatics applications. The main contributions of the paper are twofold. First, we identify some shortcoming with training, testing, and uncertainty quantification steps in existing Materials Informatics pipelines while learning from underrepresented and distributionally skewed (or imbalanced) data. Our finding raises serious concerns regarding the reliability of existing Materials Informatics pipelines. Second, to overcome these challenges, we propose a general-purpose explainable and reliable machine-learning method for enabling reliable learning from underrepresented and distributionally skewed data. We propose the following solutions: (1) learning architecture to bias the training process to achieve the goals of imbalanced domains; (2) sampling approaches to manipulate the training data distribution so as to allow the use of standard ML models; and (3) reliable evaluation metrics and uncertainty quantification methods to better capture the application bias. To improve the explainability, as opposed to other existing approaches, which train an independent regression model per property, we employ a simple and computationally cheap partitioning scheme. This scheme first partitions the data into subclasses of materials based on their property values and train separate simpler regression models for each group. Note that our approach differs in its motivation (and operation) from similar concept utilized by Ward et al.4 and Zhu et al.40 Our motivation behind partitioning is to enhance the “explainability”, as opposed to the previous approaches,4 where a computationally expensive exhaustive search was performed to find artificial groups to enhance the accuracy of predictions. In our case, our explainability enhancing partitioning scheme in fact hurts our predictive performance (or accuracy). To compensate for this performance loss, we utilize transfer learning by exploiting correlation among different material properties to improve the regression performance. We show that the proposed transfer learning technique can overcome the performance loss due to simplicity of the models. To further improve the interpretability of the ML system, we add a rationale generator component to our framework. The goal of the rationale generator is twofold: (1) provide explanations corresponding to an individual prediction and (2) provide explanations corresponding to the regression model. For individual prediction, the rationale generator provides explanations in terms of prototypes (or similar but known compounds). This helps a material scientist to use his/her domain knowledge to verify if similar known compounds or prototypes satisfy the requirements or constraints imposed. On the other hand, for regression models, the rationale generator provides global explanations regarding the whole material sub-classes. This is achieved by providing feature importance for every material sub-class. Finally, we propose a new evaluation metric and a trust score to better quantify confidence and establish trust in the ML predictions. We demonstrate the applicability of our technique by using it for two applications: (1) predicting five physically distinct properties of crystalline compounds, and (2) identifying potentially stable solar cell materials.

Results and discussions

First, we discuss the proposed ML method with a focus on reliability and explainability using the data from the OQMD. Next, we demonstrate the application of our approach in two material science problems.

General-purpose reliable and explainable ML framework

To solve the problem of reliable learning and inference from distributionally skewed data, we propose a general purpose ML framework (see Fig. 2). Instead of developing yet another ML algorithm to improve accuracy for a specific application, our objective is to develop generic methods to improve reliability, explainablity, and accuracy in the presence of imbalanced data. The proposed framework is agnostic to the type of training data, can utilize a variety of already-developed ML algorithms, and can be reused for a broad variety of material science problems. The framework is composed of three main components: (1) training procedure for learning from imbalanced data, (2) rationale generator for model-level and decision-level explainability, and (3) reliable testing and uncertainty quantification techniques to evaluate the prediction performance of ML pipelines.

Fig. 2
figure 2

An illustration of proposed ML pipeline for material property prediction

Training procedure

Building an ML model for materials properties prediction can be posed as a regression problem, where the goal is to predict continuous valued property values from a set of material attributes/features. The challenge in our target task is that due to the presence of distributional skewness, ML models do not generalize well (specifically in domains which are not well represented using available labeled data (or minority classes)).1 To solve this problem, we propose a generic ML-training process that is applicable to a broad range of materials science applications that suffer from distributionally skewed data. We will explain the proposed training process with the help of following running example: A material scientist is interested in learning an ML model targeting a specific class of material properties, e.g., stable wide bandgap materials in a certain targeted range. In most of the cases, we have domain knowledge about the range of property values for specific classes of materials, e.g., conductors have bandgap energies equal to zero, typical semiconductors have bandgap energies in the range of \(1.0\)\(1.5\) eV, whereas wide bandgap materials have bandgap energies >\(2.0\) eV. These requirements introduce a partition of the property space in multiple material classes. This partition can also be introduced artificially by imposing constraints on the gradient of the property values so that compounds with similar property value are in the same class. Given \(N\) training data samples with \(M\) distinct target properties \({\{{X}_{i},({Y}_{i}^{1},\ldots ,{Y}_{i}^{M})\}}_{i=1}^{N}\), where, \({X}_{i}\) is feature/attribute vector and \({Y}_{i}^{j}\) is \({j}{\text{th}}\) property value corresponding to compound \(i\), the steps in the proposed training procedure are as follows:

  1. 1.

    Partition the property space in \(K\) regions/classes and obtain transformed training data samples \({\{{X}_{i},({Z}_{i}^{1},\ldots ,{Z}_{i}^{M})\}}_{i=1}^{N}\), where \({Z}_{i}^{j}\in \{1,\ldots ,K\}\).

  2. 2.

    For each property in \(j\in \{1,\ldots ,M\}\), perform sub-sampling on sample compounds in \(K\) distinct classes, and obtain an evenly distributed training set: \({\{{X}_{i},{Z}_{i}^{j}\}}_{i=1}^{{N}_{j}}\). Other sophisticated sampling techniques41 or generative modeling approaches can also be used.

  3. 3.

    Train \(M\) multi-class classifiers (one per property) on balanced datasets \({\{{X}_{i},{Z}_{i}^{j}\}}_{i=1}^{{N}_{j}}\) to predict which class a compound belongs to.

  4. 4.

    For every \((j,k)\) pair, train a regressor on \({\{{X}_{i},{Y}_{i}^{j}\}}_{i=1}^{{N}_{j}}\) to predict property values \({\hat{Y}}_{i}^{j}\).

  5. 5.

    Finally, utilize correlation among properties to improve the model accuracy by employing transfer learning (explained next).

At the test time, to predict the jth property of the test compound, the ML algorithm first identifies the class the test compound belongs to by using the trained jth multi-class classifier. Next, depending on the predicted class \(k\) for property \(j\), \({(j,k)}{\text{th}}\) regressor is used, along with transfer learning step, to predict property values of the test compound. Next, we provide details and justifications for each of these steps in our ML pipeline.

Steps \(1\)\(3\) transform a regression problem into a multi-class classification problem on sub-sampled training data. The change that is carried out has the goal of balancing the distribution of the least represented material classes with the more frequent observations. Note that the proposed framework is general enough to utilize other sophisticated imbalanced learning strategies (such as, ensemble learning, data pre-processing, and cost-based learning) to further improve the performance. Furthermore, instead of having a single model trained on the entire training set, having smaller and simpler models for different classes of materials helps to gain better understanding of sub-domains using the rationale generator (explained later).

Next, we explain the proposed transfer learning technique which exploits correlations presented among different material properties to improve the regression performance. We devise a simple knowledge transfer scheme to utilize the marginal estimates/predictions from step \(4\) where regressors were trained independently for different properties. Note that, for each compound \(i\), we get an independent estimate \({\hat{{\bf{Y}}}}_{{\bf{i}}}\approx \{{\hat{Y}}_{i}^{1},\ldots ,{\hat{Y}}_{i}^{M}\}\) from step 4. In step 5, we augment the original attribute vector \({X}_{i}\) with independent estimates \({\hat{{\bf{Y}}}}_{{\bf{i}}}\) to get a modified attribute vector \({\hat{X}}_{i}=[{X}_{i},{\hat{{\bf{Y}}}}_{{\bf{i}}}]\). Now, for every \((j,k)\) pair, we train a regressor on \({\{{\hat{X}}_{i},{Y}_{i}^{j}\}}_{i=1}^{{N}_{j}}\) to predict property values \({\hat{Y}}_{i}^{j}\). The modified attribute vector takes information from other properties into account. As these physical properties are highly correlated, we will show later that this simple knowledge transfer scheme significantly improves the regression performance.

Rationale generator

The goal of rationale generator is to provide: (a) decision level explanations, and (b) model level explanations. Decision level explanations provide reasoning such as: what made an ML algorithm make a specific decision/prediction? On the other hand, model level explanations are focused on providing understandings at the class level, e.g., which chemical attributes help in discriminating among insulators, semi-conductors, and conductors?

Decision level explanations: The proposed ML pipeline explains its predictions for previously unseen compounds by providing similar known examples (or prototypes). Explanation by examples is motivated by the observation that studies of human reasoning have shown that the use of examples (analogy) is fundamental to the development of effective strategies for better decision-making.42 Example-based explanations are widely used in the effort to improve user explainability of complex ML models. In our context, for every unseen test example, in addition to predicted property values, we provide similar experimentally known compounds with corresponding similarity to the test compound in the feature space. Our feature space is heterogeneous (both continuous and categorical features), thus, Euclidean distance is not reliable. Thus, we propose to quantify similarity using Gower’s metric.43 Gower’s metric can be used to measure similarity between data containing a combination of logical, numerical, categorical or text entries. The distance is always a number between \(0\) (similar) and \(1\) (maximally dissimilar). Furthermore, as a consequence of breaking a large regression problem into a multi-class classification followed by a simpler regression problem, we can also provide a logical sequence of decisions taken to reach a prediction, which cannot be obtained by having a single/global model.

Model level explanations: Knowing which chemical attributes are important in a model’s prediction (feature importance) and how they are combined can be very powerful in helping material scientists understand and trust automatic ML systems. Due to the structure of our pipeline (regression + classification), we can provide a more fine grained feature importance explanations compared to having a single regression model. Specifically, we break the feature importance of attributes to predict a material property into: (1) feature importance for discriminating among different material classes (inter-class) and (2) feature importance for regression on a material sub-domain (intra-class). This provides a more in depth explanation of the property prediction process.

Robust model performance evaluation and uncertainty quantification

The distributionally skewed training data biases the learning system towards solutions that may not be in accordance with the user’s end goal. Most existing learning systems work by searching the space of possible models with the goal of optimizing some criteria (or numerical score). These metrics are usually related to some form of average performance over the whole train/test data and can be misleading in cases where sampled train/test data is not representative of the true distribution. More specifically, commonly used evaluation metrics (such as mean squared error, R-squared error, etc.) assume an unbiased (or uniform) sampling of the test data and break down in the presence of distributionally skewed test data (shown later). Therefore, we propose to perform class-specific evaluations (by partitioning the property space into multiple classes of interest) which better characterizes the predictive performance of ML models in the presence of distributionally skewed data. We also recommend visualizing predicted and actual property values for the test data in combination with the numeric scores (which is averaged over all test examples) to build a better intuition about the predictive performance.

Note that having a robust evaluation metric only partially solves the problem as ML models are susceptible to over-confident extrapolations. As we will show later, in imbalanced learning scenarios, ML models make overconfident extrapolations which have higher probability of being wrong (e.g., predicting conductor to be an insulator with \(99 \%\) confidence). In other words, a model’s own confidence score cannot be trusted. To overcome this problem, we use a set of labeled experimentally known compounds as side information to help determine a model’s trustworthiness for a particular unseen test example. The trust score is defined as follows:

$$T({X}_{i})=1-\frac{d\left({X}_{i},{\{{X}_{j}\}}_{j\in {c}_{i}}\right)}{d\left({X}_{i},{\{{X}_{j}\}}_{j\in {c}_{i}}\right)+d\left({X}_{i},{\{{X}_{j}\}}_{j\notin {c}_{i}}\right)}.$$
(1)

The trust score \(T\) takes into account the average Gower distance \(d\) from the test sample \({X}_{i}\) to other samples in the same class \({c}_{i}\) vs. the average Gower distance to nearby samples in other classes. \(T\) ranges from \(0\) to \(1\) where a higher \(T\) value indicates a more trustworthy model.

Example applications

In this section, we discuss two distinct applications for our reliable and explainable ML pipeline to demonstrate its versatility: predicting five physically distinct properties of crystalline compounds and identifying potentially stable solar cell materials. In both the cases, we use the same general framework, i.e., the same attributes and ML pipeline. Through these examples, we discuss all aspects of creating reliable and explainable ML models: building a reliable machine-learning model from distributionally skewed training data, generating explanations to gain better understanding of the data/model, evaluating model accuracy (section “Predicting properties of crystalline compounds”) and employing the model to predict new materials (section “Stable solar cell material prediction”).

Predicting properties of crystalline compounds

DFT provides a means of predicting properties of chemical compounds, based on quantum mechanical modeling. However, the utility of DFT is limited by its computational complexity. An alternative approach is to use ML to train a surrogate model on a representative set of (input,output) pairs from prior DFT calculations. The surrogate then emulates DFT, producing approximate answers at dramatically lower computational coscurrently available through the OQMD: bandgap energyt (several orders of magnitude faster), enabling rapid screening of candidate materials. A potential drawback of this approach is that it requires many (potentially hundreds of thousands) DFT calculations in order to generate a suitable training set. Fortunately, several such training sets already exist.

Data set: We follow the lead of Ward et al.4 and use OQMD for training purposes. OQMD contains the results of DFT calculations on ~\(300,\!000\) diverse compounds. Of these, we select only the lowest-energy compound for each composition. This yields a training set containing \(228,\!573\) unique examples. We use the same set of \(145\) attributes/features to represent each compound as Ward et al.4 Using these features, we consider the problem of developing reliable and explainable ML models to predict five physically distinct properties currently available through the OQMD: bandgap energy (eV), volume/atom (\(\rm{\AA }^{3}\) per atom), energy/atom (eV per atom), thermodynamic stability (eV per atom), and formation energy (eV per atom).44 Units for these properties are omitted in the rest of the paper for ease of notation. A description of the \(145\) attributes (inputs) and \(5\) properties (outputs) are provided in the Supplementary Materials.

Method: We quantify the predictive performance of our approach using five-fold cross-validation.45 The data set is divided into five subsets. Each time, one of the five subsets is used as the test set and the other four subsets are put together to form a training set. Then the average error across all five trials is computed. Following the procedure mentioned in section “Training procedure”, we partition the property space for each property in \(K=3\) classes. The decision boundary thresholds for class separation (with class distributions) are as follows: bandgap energy (\(0.0,0.4\)) with (\(94 \% ,5 \% ,1 \%\)), volume/atom (\(20.0,40.0\)) with (\(42 \% ,55 \% ,3 \%\)), energy/atom (\(-8.0,-4.0\)) with (\(1 \% ,63 \% ,26 \%\)), stability (\(0.0,1.5\)) with (\(8 \% ,89 \% ,3 \%\)) and formation energy (\(0.0,1.0\)) with (\(40 \% ,53 \% ,7 \%\)). We also tried different combinations of thresholds, and trends in the obtained results were found to be consistent. In practice, these thresholds can be provided by domain experts depending on a specific application (as done in section “Stable solar cell material prediction”). Sub-sampling ratios for sample compounds (for obtaining evenly distributed training set) were determined using cross-validation. We train extreme gradient boosting (XGB) classifiers46 to do multi-class (\(K=3\)) classification using the softmax objective for each property. Next, we train gradient boosting regressors (GBRs)47 for each property–class pair independently (and refer to them as marginal regressors). Using these marginal regressors, we create augmented feature vectors for correlation-based predictions. Finally, we train another set of GBR regressors for each property–class pair on augmented data (and refer to them as joint regressors as they exploit correlation present among properties to improve the prediction performance).

Results: For the conventional scheme, we train \(M\) independent GBR regressors to directly predict properties from the features corresponding to the compounds. In Table 1, we report different error metrics to quantify the regression performance using five-fold cross-validation. Note that these metrics report an accumulated/average error score on the test sets (which comprises of compounds from all partitions of properties). These results are comparable to state of the art4 and suggest that conventional regressors have excellent regression performance (low MAE/MSE and high R2 score). Relying on the inference made by this evaluation method, we may be tempted to use these regression models in practice for different applications (such as, screening or discovery of solar cell materials). However, next we show that these metrics provide misleading inferences in the presence of distributionally skewed data. In Table 2a, we perform class-specific evaluations (i.e., we partition the property space for each property in \(K=3\) classes and use the test data belonging to each class separately). Surprisingly, Table 2a shows that conventional regressors perform particularly poorly with minority classes for bandgap energy and stability prediction where the data distribution is highly skewed (see Fig. 1). Unfortunately, the test data is also distributionally skewed, and thus, traditional methods for assessing and ensuring generalizability of ML models provide misleading conclusions (as shown in Table 1). On the other hand, class-specific evaluations better characterize the predictive performance of ML models in the presence of distributionally skewed data.

Table 1 Results for conventional technique with overall prediction scores
Table 2 Class-specific prediction score comparison

In Table 2b, we show the effect of transforming a single complex regression model into an ensemble of smaller and simpler models to gain a better understanding of sub-domains (Steps \(1\)\(4\) in the section “Training procedure”). We notice that the performance of these transformed simpler models are worse compared to having a single complex model (as given in Table 2a). This suggests that there is a trade-off between simplicity/explainability and accuracy.

Finally, Table 2c shows how this performance loss due to the simplicity of models can be overcome using the transfer learning (or correlation based fusion) step in our pipeline. We observe that the proposed transfer-learning technique can very well exploit correlations in the property space, which results in a significant performance gain compared to the conventional regression approach. Surprisingly, we did not observe any gain when using transfer learning with the conventional technique. In fact, we observed that the models showed severe over-fitting behavior to the predicted properties. Note that this gain is achieved in spite of having smaller (thus simpler) models in our ML pipeline. This suggests that a user can achieve high accuracy similar to a single complex model without sacrificing explainability. We also observed that the sub-sampling step in our pipeline had a positive impact on the regression performance of minority classes.

Furthermore, our pipeline also quantifies uncertainties in its predictions providing a confidence score to the user. We show an illustration of the uncertainty quantification of bandgap energy and stability predictions on \(50\) test samples in Fig. 3. It can be seen that regressors perform poorly in regions with high uncertainty.

Fig. 3
figure 3

Uncertainty quantification of the regressor (ground truth is in blue, predictions are in red, and gray shaded area represents uncertainty). (a) Bandgap energy and (b) stability. In several cases, regressors perform poorly in regions with high uncertainty

We would also like to point out that in cases where the data from a specific class is heavily under-represented, none of the model design strategies will improve the performance and generating new data may be the only possible solution (e.g., bandgap energy prediction for minority classes). In such cases, relying solely on a cross-validation score or confidence score may not provide reliable inference (shown later). To overcome this challenge, explainable ML can be a potentially viable solution.

Next, we show the output of the rationale generator in our pipeline. Specifically, we provide (1) model-level explanations, as well as, (2) decision-level explanations for each sub-class of materials. For model-level explanations, our pipeline provides feature importance for both classification and regression steps using techniques discussed in ref. 38 Feature importance provides a score that indicates how useful (or valuable) each feature was in the construction of the model. The more an attribute is used to make key decisions with (classification/regression) model, the higher its relative importance. This importance is calculated explicitly for each attribute in the data set, allowing attributes to be ranked and compared to each other. In Fig. 4, we show the feature importance for our three-class classifier for bandgap energy. It shows the attributes that help in discriminating among three classes on compounds (insulators, semi-conductors, and conductors) based on their bandgap energy values. Note that the rationale generator picked attributes related to the melting temperature, electro-negativity, and volume per atom of constituent elements to be the most important features in determining the bandgap energy level of the compounds. This is reasonable as all these attributes are known to be highly correlated with the bandgap energy level of crystalline compounds. For example, the melting temperature of constituent elements is positively correlated with inter-atomic forces (which are in turn negatively correlated with inter-atomic distances). Decreased inter-atomic spacing increases the potential seen by the electrons in the material, which in turn increases the bandgap energy. Therefore, band structure changes as a function of inter-atomic forces which is correlated with melting temperature. Similarly, in multi-element material system, as the electro-negativity difference between different atoms increases, so does the energy difference between bonding and anti-bonding orbitals. Therefore, the bandgap energy increases as the electro-negativities of constituent elements increase. Thus, the bandgap energy has a strong correlation with electro-negativity of constituent elements. Finally, mean volume per atom of constituent elements is also correlated with the inter-atomic distance in a material system. As explained above, inter-atomic distance is negatively correlated with the bandgap energy, and so is the mean volume per atom of constituent elements. Similar feature importance results for class-specific predictors can also be obtained (see Supplementary Material).

Fig. 4
figure 4

Feature importance for three-class classification of bandgap energy. The rationale generator favors attributes related to melting temperature, electro-negativity, and volume per atom for explaining bandgap-energy predictions. These attributes are all known to be highly correlated with the bandgap energy level of crystalline compounds

In Table 3, we show four test compounds with ground truths (class, bandgap energy value), predictions (class, bandgap energy value), and corresponding confidence scores. It can be seen that both classifier and regressor make wrong and over-confident predictions on minority classes (i.e., classes \(1\) and \(2\)). In other words, a higher confidence score from the model for minority class does not necessarily imply higher probability that the classifier (or regressor) is correct. In fact, the average confidence score for wrong predictions is \(0.80\), which highlights that the model made wrong predictions with very high confidence. For compounds in minority classes, the ML model may simply not be the best judge of its own trustworthiness. On the other hand, the proposed trust score (as given in Eq. (1)) consistently outperforms the classifier’s/regressor’s own confidence score. A higher/lower trust score from the model implies a higher/lower probability that the classifier (or regressor) is correct. The average trust score for wrong predictions is a small value of \(0.50\). Furthermore, as our trust score is computed using distances from experimentally known compounds from Inorganic Crystal Structure Database (ICSD),48 it also provides some confidence on compounds’ amenability to be synthesized.

Table 3 Bandgap energy prediction and uncertainty quantification

Stable solar cell material prediction

To show how our ML pipeline can be used for discovering new materials, we simulate a search for stable compounds with a bandgap energy within a desired range. To evaluate the ability of our approach to locate compounds that are stable and have bandgap energies within the target range, we setup an experiment where a model was fit on the training data set and, then, was tasked with selecting which \(30\) compounds in the test data were most likely to be stable and have a bandgap energy in the desired range for solar cells: 0.9–1.7 eV.

Data set: Same as before, for the training data, we selected a subset of \(228,\!573\) compounds from OQMD that represents the lowest-energy compounds at each unique composition. We use same \(145\) attributes as before. Using these attributes/features, we consider the problem of developing reliable and explainable ML models to predict two physically distinct properties of stable solar cells: bandgap energy, and stability. Note that this experiment is more challenging and practical as compared to Ward et al.4 where the training data set was considered to be compounds that were reported to be possible to be made experimentally in the ICSD (a total of \(25,\!085\) entries), so that only bandgap energy, and not stability, needed to be considered. We choose a test data set from Meredig et al.15 to be as-yet-undiscovered ternary compounds (4,500 entries). These compounds are not yet in the OQMD.

Method: Following the procedure mentioned in the section “Training procedure”, we partition the property space for each property in \(K=3\) classes. The decision boundary thresholds for class separation are as follows: bandgap energy (\(0.9,1.7\)), and stability (\(0.0,1.5\)). Similarly to section “Predicting properties of crystalline compounds” II-B1, we use XGB classifiers46 (with default parameters) to do multiclass (\(K=3\)) classification and GBRs47 to do marginal and joint regression. We use models’ own confidence and trust score to rank the potentially stable solar cells.

Results: We used the proposed ML pipeline to search for new stable compounds (i.e., those not yet in the OQMD). Specifically, we use trained models to predict bandgap energy and stability of compositions that were suggested by Meredig et al.15 to be as-yet-undiscovered ternary compounds. We found that out of these \(4500\) compounds, \(221\) compounds are likely to be stable and have favorable bandgap energies to be solar cells. A subset with the trust scores are shown in Table 4. Similar experimentally known prototypes (as shown in Table 4) can also serve as an initial guess on the 3D crystal structure of the predicted compounds. These recommendations appear reasonable as four of the six suggested compounds (Cs4CrSe4, Cs3Sb3S, Cs3VSe4, Na9AgO5) can be classified as I–III–VI semiconductors, which are semiconductors that contain an alkali metal, a transition metal, and a chalcogen; I–III–VI semiconductors are a known promising class of photovoltaic materials as many have direct bandgap energies of \(\sim \!1.5\) eV, making them well-matched to the solar spectrum. The best known I–III–VI photovoltaic is copper–indium–gallium–selenide (CIGS), which has solar cell power conversion efficiencies on par with silicon-based solar cells. The other two identified compounds Th2CO2 and Pm1.33PtSe3 are unique in that they contain actinide and lanthanide elements. However, from a practical perspective, the scarcity and radioactivity of these elements may make it challenging to explore them experimentally. A detailed list of potentially stable solar cell compounds is provided in the Supplementary Material.

Table 4 Compositions of materials predicted using proposed ML pipeline to be stable candidates for solar cell applications with experimentally known prototypes and their distances from predicted candidates

Some open issues

There are still some issues yet to be resolved for a successful application of ML in material science. First, in cases where the data from a specific class is heavily under-represented, none of the model design strategies will improve the performance, and generating new data may be the only possible solution. Solving this problem will require answering the following question: How many training samples are sufficient to learn a reliable model and where to sample if they are inadequate? Second, predictive models built based on chemical attributes make recommendations (e.g., potential solar cell materials) in the form of chemical attributes. However, verifying these recommendations using DFT (or experiments) has its own challenges (e.g., identifying appropriate crystal structure (or synthesis recipes)). A potentially viable solution is to bias the recommendation process towards compounds with favorable synthesis conditions. Finally, explainable ML methods based on feature importance still require a materials scientist to make sense of model/decision explanations using domain knowledge, which may suffer from human bias. Solving these problems will require making significant advances on current explainable ML techniques. Interactive ML and casual inference techniques can further help in resolving some of these issues.

Methods

All machine-learning models were created using the Scikit-learn47 and XGBoost46 machine-learning libraries. The Materials Agnostic Platform for Informatics and Exploration (Magpie)4 was used to compute the attributes. Scikit-learn, XGBoost, and Magpie are available under open-source licenses. The software, training data sets and input files used in this work are provided in the Supplementary Information associated with this manuscript.