Introduction

Over the last 30 years, advances in computational materials science have led to tremendous successes in materials design, with dozens of computationally designed novel compounds,1,2 and on-demand availability of ab initio predicted properties.3 However, the materials discovery pipeline remains bottlenecked by the challenges of experimental synthesis, which can require months of trial-and-error before a novel compound can be made. At present, it remains difficult to design how to synthesize predicted materials in a laboratory, or whether or not it is even possible.4

Current approaches toward understanding and predicting materials synthesis have involved in situ X-ray diffraction (XRD) investigations,5,6 ab initio thermodynamic modeling,7,8,9 classical thermodynamics perspectives,4 and machine-learning guided synthesis parameters search.10,11 Recently, exciting applications of machine-learning methods to retrosynthesis in organic chemistry are proving to be impactful,12,13,14 inspiring the application of similar methods to predict inorganic materials synthesis. These machine-learning investigations of organic chemistry synthesis reactions have been enabled by organic chemistry reaction databases, such as Reaxys, which include >12 million single-step reactions. There is currently no analogous database that comprehensively catalogs the synthesis reactions of inorganic materials syntheses. However, even limited databases of materials synthesis reactions can yield valuable insights on the relationships between synthesis parameters and reaction products, as for example exemplified by Kim et al.15,16,17 and others.11,18

To build a comprehensive inorganic materials synthesis database, synthesis procedures must be classified with high-resolution at multiple levels: at a high-level, the synthesis methodology; at an intermediate-level, individual experimental steps; and at a detailed-level, specific processing parameters. In principle, we could analyze sentence grammar and keywords to build a rule-based classification algorithm to identify different types of synthesis procedures. However, this is impractical, due to both the notorious ambiguity of natural language19,20,21 and the complexity of solid-state chemistry terminology. Statistical classification algorithms, such as deep-learning neural networks22,23 can achieve good text classification performances24 with large amounts of training data.25 However, no large annotated text data sets to train on exist in materials science or chemistry.

Recent advances in machine-learning research have demonstrated that semi-supervised learning methods can solve similar classification problems with much lessannotated data than supervised learning methods.26,27,28 Here, we present a semi-supervised machine-learning approach (that uses a small amount of labeled data and a large amount of unlabeled data) for the accurate classification of synthesis procedures as described in written natural language. Using a body of 2,284,577 articles, we applied latent Dirichlet allocation (LDA)29 to identify the experimental steps implied in sentences in an unsupervised manner. The “experimental steps” are grouped as topics and LDA provides a probabilistic topic distribution for each sentence. To this topic distribution, we apply the random decision forests (RF) algorithm30—a supervised machine-learning method—to classify different types of synthesis procedures: solid-state synthesis, hydrothermal synthesis, sol–gel precursor synthesis, or none of the above. We demonstrate that the RF models can achieve high classification performance with training data sets as small as a few hundred paragraphs, which can be readily prepared by manual annotation efforts. By combining these unsupervised and supervised approaches, our machine-learning algorithm accurately captures the features and subtleties of different synthesis procedures, with high classification performance, with results that can be presented in a way that is readily understood and interpretable by humans. Finally, we construct a machine-learned flowchart of synthesis procedures, which demonstrates that our method can build a “machine intuition” of materials synthesis procedures beyond classification.

Results

Unsupervised learning of synthesis processes

Humans can categorize sentences into topics by recognizing familiar keywords. However, this objective can be difficult to train a computer to achieve, because it is impractical to code explicit rules for keywords of an English vocabulary that is both large (> 10,000) and open for new materials science/chemistry terms. Furthermore, in natural language various synonyms can often be used to represent the same topic, which introduces ambiguity and complexity into hard-coded rules. LDA29,31,32 is an unsupervised topic modeling algorithm that observes common keywords over a large number of papers, then automatically clusters these synonymous keywords together into “topics”. We applied LDA to identify topics of synthesis from the scientific literature, and we demonstrate that the topical grouping is closely related to conventional experimental classification of synthesis steps.

We first use LDA to identify topic–word distributions, which are a set of multinomial probability distributions over a cluster of keywords conditioned on certain topics. To demonstrate, in Table 1 we list two topics learned by LDA. (A complete list of all 200 topics can be found in Table S1.) We first show in Table 1 some representative sentences that we consider to discuss similar topics. From a collection of thousands of unlabeled sentences, LDA learns topic–word distributions using a Bayesian inference method. As shown in the second column of Table 1, the keywords (words of highest probability) of topics match the vocabulary often used by chemists to discuss each topic, making it possible for chemists to interpret the learned topics. For example, in Table 1, we interpret topic T1 as “(ball-)milling”, and topic T2 as “high temperature sintering”. We emphasize that the topic names, “(ball-)milling” and “sintering”, are assigned by us for the sake of convenience, and the choice of names does not affect the topic–word distributions learned by LDA.

Table 1 Two topics (topic–word distributions) selected from 200 topics learned by LDA using sentences in our database

The distribution of topics in a sentence infers a “document–topic” distribution, which is quantified by the probability that each topic appears in a sentence. For example, in a sentence excerpted from our database, “the dried powders were calcined twice at 850 °C for 2 h and then ball milled again for 8 h.”,33 39 and 60% of the words discuss the LDA-learned topics T1 and T2, respectively. LDA then interprets this sentence as having two topics, corresponding to the experimental steps “ball milling” and “sintering”. More examples can be found in Table S2. Using document–topic distributions, a computer is able to quantitatively identify topics relevant to experimental steps in sentences, which are then used as input features for synthesis procedure classifiers.

Supervised classification of synthesis methodologies

LDA has now been used to automatically identify various topic–word distributions, which we labeled as specific experimental steps, for example, sintering, grinding, etc. These individual steps are subprocesses of an overall synthesis methodology, such as solid-state synthesis, hydrothermal, sol–gel precursor synthesis, etc. Based on the topic distributions learned by LDA, the machine is next trained to classify which of these three synthesis methodologies a synthesis paragraph corresponds to.

To build the classifier, we use the random forest (RF) algorithm,30,34 which is a supervised machine-learning algorithm that uses an ensemble of decision-making trees to make classifications. We constructed a training set of synthesis paragraphs that was annotated by synthesis experts, which consists of 1000 training paragraphs for each of the three types of synthesis (solid-state, hydrothermal, and sol–gel precursor synthesis) as well as 3000 randomly sampled negative paragraphs from the database that do not contain any of the above three synthesis procedures. To provide input features for RF, we use the “topic n-gram”,35 which represents the sequence of LDA-derived topics in adjacent sentences within a paragraph. We used the scikit-learn Python package36 to construct learning curves to understand how much training data is needed by the RF algorithm.

Figure 1a gives the learning curves of the RF algorithm, showing the F1 score versus the amount of training data. The RF algorithm reaches high F1 scores of ~90% when the training data set size is >3000, but surprisingly, the models can consistently converge to >80% F1 scores even when the training data set is as small as a few hundred paragraphs. These training data sets are small enough that they can be readily prepared by manual annotation efforts, indicating that LDA + RF methods are practicable machine-learning methods for classification problems of similar complexity. As summarized in Fig. 1b, the recall and precision scores are also >90%, signifying that our RF classification model is robust against false-positive and false-negative classification errors.

Fig. 1
figure 1

a Learning curves of the RF model demonstrating F1 score improves with more training data. The red plus and blue cross symbols represent model F1 scores tested on training data sets and test data sets, respectively. The shaded areas denote the standard deviations of the curve. The performance converges to high F1 scores with training data sets as small as a few hundred paragraphs. b Precision/Recall/F1 scores of the RF model. The model was trained using 5000 training paragraphs and cross-validated using 1000 test paragraphs. Training paragraphs were randomly drawn from the annotated data set several times to calculate the standard deviation

The RF algorithm consists of an ensemble of similar decision trees, which ultimately vote together on the final synthesis classification. Using hyperparameter optimization, we determined that 20 RF trees give the best model performance (See Methods section and Fig. S2). To visualize how our model classifies different types of synthesis procedures, we show in Fig. 2a one out of the 20 learned decision trees in our RF model. In Fig. 2a, the decision tree starts from the topmost node, and branches into one of two child nodes according to whether certain topic n-grams exist in a paragraph, as defined by the criterion of each node. We highlight a representative branch from Fig. 2a in yellow, and show the enlarged branch in Fig. 2b. For a paragraph that has topic “cooling-1” after topic “autoclaving” in two consecutive sentences, the decision tree changes its classification of the synthesis method from “none of the above” to the “hydrothermal” category. Because this “hydrothermal” node does not have any child nodes, no more decisions will be made and the decision tree predicts the paragraph as having a hydrothermal synthesis procedure.

Fig. 2
figure 2

a One of 20 decision trees learned by RF. b One particular branch. Starting from the topmost node, branch is made when certain topic pairs exist in a paragraph. When no branch can be made, a terminal node predicts the type of synthesis. A RF classifier consists of many trees and selects the majority of predictions

In many ways, the RF algorithm classifies materials synthesis procedures similarly to how a solid-state chemist would—by looking for patterns of experimental procedures. For example, “shake-and-bake” is a common pattern for solid-state synthesis. If a paragraph is organized as “mix the precursors and then sinter the mixture”, then one would likely classify it as solid-state synthesis. This same classification decision can be found in our computer-generated decision trees, where each node contains a pattern of experimental steps (represented by LDA topic results), such as (“[ball-]milling” → “sintering”) in the third node of Fig. 2b. Moreover, our model represents patterns of synthesis as topic pairs, and we can study how words affect the detection of such patterns. As demonstrated in Fig. 2b, when a paragraph contains more keywords of topics “(ball-)milling”, “(hot-)pelletizing”, and “annealing” than keywords of topics “sol formation” and “solution heating”, such as “milling”, “pressed”, and “annealed”, chances are that our model predicts solid-state synthesis instead of sol–gel precursor synthesis. In general, the decision trees largely resemble the underlying procedures of materials synthesis methods, explaining why the RF algorithm can automatically pick out human-understandable features and weigh them accordingly.

Constructing a flowchart of synthesis procedures

In materials synthesis procedures, experimental steps do not appear randomly—they usually follow a certain procedural order, in patterns that are specific to different types of synthesis methodologies. Similarly, LDA-learned topics do not appear in random sequences in the written synthesis paragraphs. By data-mining the transition probability from one LDA topic to another between adjacent sentences, we can construct a Markov chain representation of how various experimental steps proceed into others. We visualize these Markov chains as synthesis flowcharts, shown in Fig. 3, using a directed graph consisting of nodes and directed edges, where a node represents an experimental step, and an edge represents a transition from one experimental step to another one.

Fig. 3
figure 3

Machine-learned flowchart showing the transition between experimental steps for different types of synthesis. The topics associated with the nodes can be found in Table 2 and Table S1. Edges represent transitions from one step to another, and the arrows show transition directions. Double-lined edges represent transitions in both directions. A darker edge indicates a more-probable transition

The computer-generated flowchart demonstrated in Fig. 3 largely summarizes three types of synthesis procedures. In Fig. 3, core experimental steps of syntheses are found, for example, the experimental steps “mixing”, “(ball-)milling”, “(hot-)pelletizing”, and “sintering” (plus “cooling-2” and “annealing”) are all found in the solid-state synthesis category, which matches a chemist’s intuition of solid-state synthesis. The algorithm also learns important ordering information, for example, “(hot-)pelletizing” usually follows “(ball-)milling”, but “(ball-)milling” never follows “(hot-)pelletizing”. The edges between “sintering” and “(hot-)pelletizing” or “(ball-)milling” are found in both directions, indicating it is a common practice to regrind and pelletize sintered products in solid-state synthesis. In addition, the algorithm automatically captures subtleties regarding syntheses, for example, that “solution heating” is an intermediate step between “sol formation” and “sintering”, which physically is because gel-like precursor states are formed when the particle density in the colloid is increased by evaporating liquid solvent; whereas that “pH adjustment” is an optional step between “aqueous mixing” and “autoclaving”, as sometimes, but not always, the formation of the final product depends on specific pH values. Figure 3 reproduces common experimental processes from different synthesis procedures, because LDA allows computers to understand individual experimental steps, and the Markov chain construction enables general procedural orderings to be learned as they were recorded in synthesis paragraphs.

Discussion

Much of the technical content in solid-state chemistry papers is locked-up in the ambiguities of written natural language. Topic modeling algorithms can teach computers to automatically elucidate structure and meaning from these complicated written texts. In this work, we combined unsupervised (LDA) and supervised (RF) machine-learning algorithms to accurately categorize different types of inorganic materials synthesis procedures by topic keywords. LDA can, without any human supervision, automatically learn keywords associated with specific experimental steps in materials synthesis procedures, which produces topic representations of sentences written in natural language. Using these topic representations, we used RF algorithms to classify different synthesis methods with high accuracy, using a relatively modest number of manually annotated synthesis paragraphs. Finally, a Markov chain representation of synthesis processes enables the construction of flowcharts, which capture many of the subtleties involved in inorganic materials synthesis. Because little annotation effort is required, our machine-learning classifier can be readily scaled up to categorize and interpret the millions of solid-state chemistry papers from the scientific literature, which can then be data-mined and analyzed using large-scale informatics tools.

LDA helps achieve high classification performance by reducing the ambiguity of natural language. Oftentimes in English, one meaning can be expressed using different synonyms. This ambiguity of English is also very common in the synthesis literature. For example, “grinding” and “milling” are often used interchangeably in experiment descriptions. LDA is designed to solve the ambiguity problem by identifying the same topic (for example, topic “(ball-)milling” in Table 2) in different ways of expression. A major advantage of LDA is that it can learn topic representations without human input. This is in contrast to other NLP methods, such as named-entity recognition (NER) or sentence dependency parsing used in similar works,15,37 which are supervised classification models that require training on all different synonyms with the same meaning. This training is challenging owing to the limited availability of data sets in materials science with labeled text, meaning there are not enough cases for supervised learning. Another risk of neural networks trained to classify paragraphs is that the large number of parameters could lead to overfitting, and they would be unable to classify paragraphs that use synonyms for synthesis process that were not included in the training set.

Table 2 List of topics relevant to solid-state, hydrothermal and sol–gel synthesis procedures

One well-known limitation of LDA is that it has poor performance when modeling topics in short sentences or paragraphs.38 We observed some incorrect classification results for short paragraphs, but these occurrences are rare, as it is nearly impossible to describe a full synthesis procedure in only a few words, and it is easy to filter all short paragraphs by the length of word sequences.

From the perspective of building an inorganic materials synthesis database, we argued that three levels of information are required: high-level classification of synthesis methodologies, intermediate-level experimental steps, and detailed-level processing parameters. We have shown that LDA is well-poised to learn the high-level synthesis methodologies and the intermediate-level experimental steps. However, LDA should be less capable of identifying the detailed-level processing parameters because it is designed to model topics (collections of common objects, ideas, facts31), whereas processing parameters appear as single words or phrases and need to be extracted using word-level algorithms, such as NER. Nevertheless, LDA is capable of constraining the problem domain by clustering39 and smoothing40 documents, and thus promoting performance of NER tasks.41,42

Good examples of mining materials synthesis parameters from journal articles have been previously shown by Kim et al.,15,16 where they used NER to extract synthesis parameters and applied LDA as a post-processing analysis to cluster the chemistry of materials. These algorithms are trained and evaluated on materials synthesis paragraphs without a specific domain. However, online journal articles describe a large variety of synthesis methodologies, such as the solid-state, hydrothermal and sol–gel precursor syntheses studied in this work, where different domain knowledge is implicitly assumed, such as the vocabulary of describing experimental steps (Table 2) and the organization of these steps (Fig. 3). Proper consideration of the subtle domain knowledge is essential for machine learning to understand the synthesis literature in a higher resolution. Our semi-supervised approach allows paragraphs to be automatically clustered into small sub-domains of synthesis methodology, which provides a foundation for codifying domain knowledge and creating a more sophisticated analysis of synthesis information.

Our semi-supervised machine-learning algorithms benefit from high-classification performance while being trained on data sets small enough to be manually annotated by individual experts. Although this work has been a case study specifically for classifying materials synthesis paragraphs, the applicability of our method is general. For example, our method can also be used for extracting materials characterization information, which is a valuable text source for identifying the phases of synthesized materials. There are undoubtedly further opportunities to apply topic modeling methods to extract other important data and concepts from scientific articles published in materials science and other fields. We believe that this work gives a blueprint for how written information, contained in the large body of published literature, can be extracted and made machine-interpretable.

Methods

Scientific articles used in this work are journal publications published by Springer, Wiley, Elsevier, the Royal Society of Chemistry, and the Electrochemical Society from which we received permissions to download large amounts of articles. For each publisher, we manually identified all materials science related journals available for download. A web scraping engine was built using scrapy (https://scrapy.org/). Only full-text articles published after 2000 were downloaded, including metadata such as journal name, article title, article abstract, authors, etc. All data were stored in a document-oriented database implemented using a MongoDB (https://www.mongodb.com/) database instance. Because downloaded articles are in HTML/XML format, which contains irrelevant markups and stylesheets, we developed a customized library for parsing article markup strings into text paragraphs while keeping the structures of paper and sections headings. The current snapshot of the database contains 2,284,577 papers, from which we used 3,210,525 paragraphs in the experimental sections of each paper to conduct this research. The experimental sections were identified by using case-insensitive keyword matching in section headings. (These keywords are “experiment”, “synthesis”, and their morphological derivations.)

Plain text paragraphs were segmented into sentences and tokenized into words using ChemDataExtractor tokenizer,43 which is purposely trained on scientific corpus to handle abbreviations, chemical formulas, etc. Lemmatization preprocessing35 was not practiced to keep the meanings of different word forms such as verb fired and noun fire. Common English stop-words serving as grammatical function words such as the, be, on, that were removed from each sentence.

We used the Mallet package32 to train LDA topic models. Two parameters α and β, which control the Dirichlet prior distribution of the topic distributions and the words distributions, respectively, were set to α = 5/N and β = 0.01, where N is the number of topics. Inappropriate settings of the number of topics downgrade the quality of topics learned by LDA. By maximizing LDA model probability likelihood,29 we found that setting the number of topics N = 200 produces the best performance of the LDA model without overfitting, as demonstrated by Fig. S1.

We used the RF module in the scikit-learn Python package36 to train classification models. The “topic n-gram” feature is created as indicator variables for n-topic tuples in consecutive sentences (Ti, Ti+1, …, Ti+n−1). Each Ti is a topic in the i-th sentence with probability > 0.05. n denotes the length of the tuple, and we used 1 ≤ n ≤ 3 in our study.

The training data set was annotated by synthesis experts in our research group and consists of 1000 training paragraphs for each of the three types of synthesis (solid-state, hydrothermal, and sol–gel precursor synthesis) as well as 3000 randomly sampled negative paragraphs from the database that do not contain any of the above three synthesis procedures. We annotated the data set according to a list of self-consistent definitions developed by us. These definitions can be found in the supplementary material. In total, 6000 annotated paragraphs were obtained. When developing this annotated data set, we found it important to use as few annotators as possible, as the use of a large number of annotators led to inconsistencies in annotation due to variations in interpretations on what each delineates each synthesis method. Part of this ambiguity of the annotation task is intrinsic. In solid chemistry, there are no formal definitions of different synthesis methodologies and hybrids of different methods are sometimes used. The issues with annotation are described in detail in the supplementary material. We used 10-fold cross-validations to test the robustness of our model. We ran cross-validation 20 times to estimate standard deviations of performance scores. In each run, the training data set contains 5000 samples, and the test data set contains 1000 samples. We did not use a development data set because we found that the model performance is nearly independent of the hyperparameters, once the number of trees ≥ 15 and the maximum depth of trees ≥ 15, as demonstrated by the grid search hyperparameter optimization in Fig. S2. Thus, we set the number of trees to 20 and the maximum depth of trees to 20 in all RF training.

To generate Fig. 3, we obtained sentence topics with probability > 0.05 in our annotated data set of paragraphs, and counted the topic pairs in adjacent sentences, such as “mixing → sintering”. By collecting all topic pairs, we can compute the probability that one topic pair follows another. This allows us to order a collection of topics into a Markov chain, which can be visualized using a directed graph, where each node is a topic and each edge is a topic pair. We weighted the edges by normalized frequencies of topic pairs observed in paragraphs. Edges with lower occurrence frequencies were plotted with a more transparent stroke in Fig. 3, and edges with occurrence frequencies lower than 0.3 were removed from the figure.