Improving chemical reaction yield prediction using pre-trained graph neural networks

Han, Jongmin; Kwon, Youngchun; Choi, Youn-Suk; Kang, Seokho

doi:10.1186/s13321-024-00818-z

Research
Open access
Published: 01 March 2024

Improving chemical reaction yield prediction using pre-trained graph neural networks

Jongmin Han¹,
Youngchun Kwon²,
Youn-Suk Choi² &
…
Seokho Kang¹

Journal of Cheminformatics volume 16, Article number: 25 (2024) Cite this article

1077 Accesses
1 Citations
3 Altmetric
Metrics details

Abstract

Graph neural networks (GNNs) have proven to be effective in the prediction of chemical reaction yields. However, their performance tends to deteriorate when they are trained using an insufficient training dataset in terms of quantity or diversity. A promising solution to alleviate this issue is to pre-train a GNN on a large-scale molecular database. In this study, we investigate the effectiveness of GNN pre-training in chemical reaction yield prediction. We present a novel GNN pre-training method for performance improvement.Given a molecular database consisting of a large number of molecules, we calculate molecular descriptors for each molecule and reduce the dimensionality of these descriptors by applying principal component analysis. We define a pre-text task by assigning a vector of principal component scores as the pseudo-label to each molecule in the database. A GNN is then pre-trained to perform the pre-text task of predicting the pseudo-label for the input molecule. For chemical reaction yield prediction, a prediction model is initialized using the pre-trained GNN and then fine-tuned with the training dataset containing chemical reactions and their yields. We demonstrate the effectiveness of the proposed method through experimental evaluation on benchmark datasets.

Introduction

A chemical reaction is a process in which reactants are changed into products through chemical transformations. The percentage of products obtained relative to the reactants consumed is referred to as the chemical reaction yield. The prediction of the chemical reaction yields provides clues for exploring high-yield chemical reactions without the need for conducting direct experiments. This is crucial for accelerating synthesis planning in organic chemistry by significantly reducing time and cost. Machine learning has been actively utilized for the fast and accurate prediction of chemical reaction yields in a data-driven manner [1,2,3,4,5,6,7,8].

Recently, deep learning has shown remarkable performance in predicting chemical reaction yields by effectively modeling the intricate relationships between chemical reactions and their yields using neural networks. Schwaller et al. [6, 7] represented a chemical reaction as a series of simplified molecular-input line-entry system (SMILES) strings and built a bidirectional encoder representations from transformers (BERT) as the prediction model. Kwon et al. [8] represented a chemical reaction as a set of molecular graphs and built a graph neural network (GNN) that operates directly on the molecular graphs as the prediction model. The use of GNNs led to a significant improvement in the predictive performance owing to their high expressive power on molecular graphs [9, 10].

Despite its effectiveness, the predictive performance of a GNN can suffer when it is trained on an insufficient training dataset in terms of quantity or diversity. For example, a GNN may not generalize well to query reactions involving substances that are not considered in the training dataset. Although the performance can be significantly improved by securing a large-scale training dataset, this is difficult in practice because of the high cost associated with conducting direct experiments to acquire the yields for a large number of chemical reactions.

To alleviate this issue, a promising solution is to pre-train a GNN on a large-scale molecular database and use it to adapt to chemical reaction yield prediction. Various pre-training methods have been studied in the literature, which can be categorized into contrastive learning and pre-text task approaches [11, 12]. The contrastive learning approach pre-trains a GNN by learning molecular representations such that different views of the same molecule are mapped close together, and views of different molecules are mapped far apart [13,14,15,16,17,18]. Most existing methods based on this approach have utilized data augmentation techniques to generate different views of each molecule. Data augmentation may potentially alter the properties of the molecules being represented [19, 20]. The pre-text task approach acquires the pseudo-labels of molecules and pre-trains a GNN to predict them [21,22,23,24,25]. Existing methods have attempted to define appropriate pre-text tasks in various ways to effectively learn molecular representations. The process of acquiring pseudo-labels can be costly and time-consuming depending on how the pre-text task is defined. Since both approaches have their own advantages and drawbacks, it is important to choose the most suitable pre-training method that best aligns with the objective of a specific downstream task that needs to be addressed.

In this study, we propose a novel pre-training method, MolDescPred, to improve the performance in predicting chemical reaction yields. MolDescPred is based on the pre-text task approach to pre-train a GNN. Given a molecular database containing a substantial number of molecules, we calculate the molecular descriptors for the molecules and reduce their dimensionality by applying principal component analysis (PCA). Each molecule is then pseudo-labeled with a vector of its principal component scores. The GNN is then pre-trained to predict the pseudo-label of its input molecule. For chemical reaction yield prediction, a prediction model is initialized using the pre-trained GNN and then is fine-tuned with a training dataset composed of chemical reactions and their corresponding yields. Through experiments on benchmark datasets, we demonstrate the effectiveness of the proposed method compared to existing methods, especially when the training dataset is insufficient.

Method

Problem definition

For chemical reaction yield prediction, we aim to build an accurate prediction model f which takes a chemical reaction $(\mathcal {R}, \mathcal {P})$ as the input to predict the yield y by learning from the training dataset $\mathcal {D}=\{(\mathcal {R}_i, \mathcal {P}_i, y_i)\}_{i=1}^N$. Given a query chemical reaction $(\mathcal {R}_*, \mathcal {P}_*)$, the prediction model f can be used to make a prediction for the yield $y_*$ as:

$$\begin{aligned} \hat{y}_*=f(\mathcal {R}_*, \mathcal {P}_*). \end{aligned}$$

(1)

It should be noted that additional information, such as the operating conditions for chemical reactions, can be utilized as extra input for the model f. If we denote this additional information by $\mathcal {Z}$, the problem can be formulated as learning the model f from the dataset $\mathcal {D}'=\{(\mathcal {R}_i, \mathcal {P}_i,\mathcal {Z}_i, y_i)\}_{i=1}^N$. The input and output of the model f can be described as:

$$\begin{aligned} \hat{y}_*=f(\mathcal {R}_*, \mathcal {P}_*, \mathcal {Z}_*). \end{aligned}$$

(2)

The data representation used for the prediction model f is as follows. In a chemical reaction $(\mathcal {R}, \mathcal {P})$, $\mathcal {R}$ and $\mathcal {P}$ denote the sets of reactants and products, respectively. The set $\mathcal {R}=\{\mathcal {G}^{\mathcal {R},1},\ldots ,\mathcal {G}^{\mathcal {R},m}\}$ contains m reactant molecules represented as molecular graphs, where m can vary for each reaction. The set $\mathcal {P}=\{\mathcal {G}^{\mathcal {P}}\}$ contains a single molecular graph representing a product molecule. Each molecular graph $\mathcal {G}=(\mathcal {V}, \mathcal {E})$ represents the topology of a molecule. Here, $\mathcal {V}$ and $\mathcal {E}$ are the sets of nodes and edges associated with heavy atoms and their chemical bonds within the molecule. Hydrogen atoms are implicitly handled as node features of their neighboring heavy atoms. Each node vector $\textbf{v}^j\in \mathcal {V}$ denotes the node features regarding the j-th heavy atom in a molecule, including the atom type, formal charge, degree, hybridization, number of adjacent hydrogens, valence, chirality, associated ring sizes, whether it accepts or donates electrons, whether it is aromatic, and whether it is in a ring. Each edge vector $\textbf{e}^{j,k}\in \mathcal {E}$ denotes the edge features regarding the chemical bond between j-th and k-th heavy atoms, including the bond type, stereochemistry, whether it is in a ring, and whether it is conjugated.

The objective of this study is to improve the performance of the prediction model f, especially in scenarios where the training dataset $\mathcal {D}$ lacks sufficient quantity or diversity. To achieve this, the proposed method MolDescPred employs a three-phase procedure for training the prediction model, as illustrated in Fig. 1. In the first phase, we define a pre-text task based on molecular descriptors using a large molecular database. In the second phase, we pre-train a GNN from the pre-text task. In the third phase, we incorporate the pre-trained GNN as part of the model f and fine-tune the model f using the training dataset $\mathcal {D}$. We provide a detailed description of each phase in the following subsections.

Pre-text task based on molecular descriptors

Molecular descriptors are numerical representations of the chemical information of a molecule derived through logical and mathematical procedures [26]. Molecular descriptors have been commonly used as inputs for prediction models in a wide range of molecular property prediction tasks [27,28,29,30]. In contrast, we utilize molecular descriptors to define a pre-text task for pre-training a GNN. Specifically, molecular descriptors embedded in a reduced dimensionality are used as pseudo-labels for the molecules. Fig. 2 illustrates the procedure of acquiring the pseudo-labels for defining a pre-text task.

Given a molecular database containing a substantial number of molecules, denoted as $\mathcal {S}=\{\mathcal {G}_i\}_{i=1}^M$, we calculate the molecular descriptors using the Mordred calculator [31]. It was originally designed to generate 1,826 molecular descriptors per molecule, including 1,613 2D and 213 3D descriptors, by leveraging a wide range of chemical and structural properties. The detailed information about the descriptors can be found in [31]. These descriptors can be efficiently calculated at high speed, with high scalability to large molecules. We exclude the 3D descriptors, assuming that molecular geometry information is not available for use in the database. For each molecular graph $\mathcal {G}$, a p-dimensional vector of molecular descriptors $\textbf{d} \in \mathbb {R}^p$ is obtained as:

$$\begin{aligned} \textbf{d} = (d_{1}, \ldots , d_{p}) = \text {Mordred}(\mathcal {G}). \end{aligned}$$

(3)

The molecular descriptor vector $\textbf{d}$ is high-dimensional and contains redundant information and noise. Thus, we apply PCA to reduce the dimensionality while preserving most of the original information [32]. The primary idea of PCA is to create new features, formed through linear combinations of the original molecular descriptors, with the objective of ensuring that these new features explain most of the variance in the molecular descriptors and are uncorrelated with each other. The objective is accomplished by eigendecomposition of the covariance matrix of the molecular descriptors calculated on $\mathcal {S}$. This yields q eigenvectors $\textbf{u}_1, \ldots , \textbf{u}_q$, called principal components, corresponding to the largest eigenvalues $\lambda _1, \ldots , \lambda _q$. The j-th eigenvalue $\lambda _j$ represents the variance explained by the j-th principal component $\textbf{u}_j$. To obtain a reduced q-dimensional vector ($q<p$), we project the original vector $\textbf{d}$ onto the q principal components as:

$$\begin{aligned} \textbf{z} = (z_{1}, \ldots , z_{q}) = (\textbf{u}_1^T\textbf{d}, \ldots , \textbf{u}_q^T\textbf{d}), \end{aligned}$$

(4)

where $z_{j}$ is the principal component score of $\textbf{d}$ obtained using the j-th principal component.

We establish a pre-text task by assigning each vector $\textbf{z}_i$ as a pseudo-label to the corresponding molecular graph $\mathcal {G}_i$. Subsequently, the pre-training dataset is formed as $\tilde{\mathcal {S}}=\{(\mathcal {G}_i, \textbf{z}_i)\}_{i=1}^M$.

Pre-training of graph neural network

GNNs have shown remarkable performance in various prediction tasks in chemistry [9, 10]. GNNs are designed to operate directly on molecular graphs, enabling them to learn informative representations by effectively capturing complex relationships within molecular graphs. Among the various GNN architectures, we employ the graph isomorphism network (GIN) owing to its high expressive power when applied to molecular graphs and its widespread usage in the literature for the pre-training of GNNs [11, 33]. Specifically, we adapt a variant of the GIN proposed by Hu et al. [21] which incorporates edge features into the input representation.

The GNN processes an input molecular graph $\mathcal {G}=(\mathcal {V}, \mathcal {E})$ as follows. Each node vector $\textbf{v}^j\in \mathcal {V}$ and edge vector $\textbf{e}^{j,k}\in \mathcal {E}$ is embedded into the initial node and edge embeddings $\textbf{h}_v^{j, (0)}$ and $\textbf{h}_e^{j,k}$ using the initial node and edge embedding functions $\phi _n$ and $\phi _e$, respectively, as:

$$\begin{aligned}&\textbf{h}_v^{j, (0)} = \phi _n(\textbf{v}^j); \end{aligned}$$

(5)

$$\begin{aligned}&\textbf{h}_e^{j,k} = \phi _e(\textbf{e}^{j,k}). \end{aligned}$$

(6)

where $\phi _n$ and $\phi _e$ are parameterized as neural networks. Then, we use L message passing layers to iteratively update the node embeddings by aggregating information from the neighboring nodes. At the l-th layer ($l=1,\ldots ,L$), each node embedding $\textbf{h}_v^{j, (l)}$ is updated as:

$$\begin{aligned} \textbf{h}_v^{j, (l)} = \psi ^{(l)}\left( \textbf{h}_v^{j, (l-1)} + \sum _{k|\textbf{e}^{j,k}\in \mathcal {E}} \text {ReLU}(\textbf{h}_v^{j, (l-1)}+\textbf{h}_e^{j,k})\right) . \end{aligned}$$

(7)

where $\psi ^{(l)}$ is the l-th node embedding function parameterized as a neural network. The final node embeddings $\textbf{h}_v^{j, (L)}$ are combined via average pooling to extract a graph embedding $\textbf{h}_g$ as:

$$\begin{aligned} \textbf{h}_g = \frac{1}{|\mathcal {V}|}\sum _{j|\textbf{v}^j\in \mathcal {V}}\textbf{h}_v^{j,(L)}. \end{aligned}$$

(8)

Finally, the graph embedding $\textbf{h}_g$ is processed using a projection function r to obtain a graph-level molecular representation vector $\textbf{h}$ as:

$$\begin{aligned} \textbf{h} = r(\textbf{h}_g) \end{aligned}$$

(9)

In the pre-training of the GNN based on the pre-text task, we use an auxiliary prediction head to further process the graph-level molecular representation vector $\textbf{h}$ to obtain the prediction of the pseudo-label $\hat{\textbf{z}}$. It should be noted that the prediction head is used only during the pre-training phase. Fig. 3 illustrates the model architecture for the pre-training of the GNN.

Given the pre-training dataset for the pre-text task $\tilde{\mathcal {S}}=\{(\mathcal {G}_i, \textbf{z}_i)\}_{i=1}^M$, the GNN and prediction head are jointly trained using the loss function $\tilde{\mathcal {L}}$ defined as:

$$\begin{aligned} \tilde{\mathcal {L}}(\textbf{z}, \hat{\textbf{z}}) = \frac{1}{q}\sum _{j=1}^q \lambda _j{(z_{j}-\hat{z}_{j})}^2, \end{aligned}$$

(10)

where $\lambda _j$ denotes the eigenvalue obtained using the PCA.

Fine-tuning of prediction model

To build the prediction model f for chemical reaction yield prediction, we adapt the model architecture and learning objective presented in Kwon et al.’s study [8], except that we use the GIN architecture for the GNN component in the model [34]. The model f takes a chemical reaction $(\mathcal {R}, \mathcal {P})$ and outputs the predictive mean $\hat{\mu }$ and variance $\hat{\sigma }^2$ for the yield y as:

$$\begin{aligned} (\hat{\mu }, \hat{\sigma }^2)=f(\mathcal {R}, \mathcal {P}). \end{aligned}$$

(11)

The prediction model f consists of two main components, as illustrated in Fig. 4. First, a GNN processes each molecular graph within the input chemical reaction to obtain a molecular representation vector. Second, a prediction head integrates all molecular representation vectors to make a final prediction. To leverage prior knowledge acquired by learning the pre-text task, we initialize the GNN using the parameters obtained from the pre-training phase.

For training of the model f, the parameters of the GNN component are initialized using the pre-trained GNN from the previous subsection, while the remaining parameters are randomly initialized. We are provided with a training dataset for the target task $\mathcal {D}=\{(\mathcal {R}_i, \mathcal {P}_i, y_i)\}_{i=1}^N$, which comprises N chemical reactions and their yields. The prediction model f is fine-tuned using the loss function $\mathcal {L}$ as described in the referenced study [8]:

$$\begin{aligned} \mathcal {L}(y, \hat{\mu }, \hat{\sigma }^2)=(1-\alpha ){(y-\hat{\mu })}^2+\alpha \left[ \frac{(y-\hat{\mu })^2}{{\hat{\sigma }}^2}+\log {\hat{\sigma }}^2 \right] , \end{aligned}$$

(12)

where the first and second terms are associated with the losses under the homoscedastic and heteroscedastic assumptions, respectively, and $\alpha$ is the hyperparameter that controls the relative strength of the two terms.

Experiments

Datasets

For pre-training, we used a subset of 10 million molecules extracted from the PubChem database, as provided by Chithrananda et al.’s study [35]. In the experiments, we excluded molecules that did not pass the sanity check in RDKit [36]. The molecules consisted of 25.18 heavy atoms on average, with a range of 1–891.

For chemical reaction yield prediction, we used two benchmark datasets, Buchwald-Hartwig [2] and Suzuki-Miyaura [37], which have been commonly used in previous studies to evaluate the performance of prediction models [6,7,8]. The Buchwald-Hartwig dataset was constructed through high-throughput experiments on the class of Pd-catalyzed Buchwald-Hartwig C-N cross-coupling reactions. It consisted of 3,955 chemical reactions and their experimentally measured yields. These reactions were generated by combining 15 aryl halides, 4 ligands, 3 bases, and 23 additives. Each chemical reaction involved 6 reactants $(m=6)$. Similarly, the Suzuki-Miyaura dataset was constructed through high-throughput experiments on the class of Suzuki-Miyaura cross-coupling reactions. The chemical reactions were generated by combinations of 15 couplings of electrophiles and nucleophiles, 12 ligands, 8 bases, and 4 solvents, resulting in a total of 5,760 chemical reactions along with their yields. The number of reactants in each chemical reaction m ranged from 6 to 14. The detailed operating conditions of the reactions, including temperature and pressure, were not reported in either of the benchmark datasets.

We evaluated the performance of the prediction model f in two different scenarios of insufficiency in the training dataset. In the quantity aspect, we utilized various training/test split ratios (70/30, 50/50, 30/70, 20/80, 10/90, 5/95, and 2.5/97.5) for both the Buchwald-Hartwig and Suzuki-Miyaura datasets. To obtain these splits, we used 10 random shuffles provided by Ahneman et al.’s study [2] for the Buchwald-Hartwig dataset and Schwaller et al.’s study [6] for the Suzuki-Miyaura dataset. In the diversity aspect, we used 4 out-of-sample training/test splits of the Buchwald-Hartwig dataset provided by Ahneman et al.’s study [2].

Implementation

In the phase of defining the pre-text task, we calculated 1,613 2D molecular descriptors for each molecule using the Mordred calculator [31]. The list of these 2D descriptors is provided in Additional file 1: Table S1. By eliminating descriptors with more than 10 missing values or all values being the same, 846 molecular descriptors remained $(p=846)$. All molecules with missing descriptors were excluded. Each molecular descriptor was standardized to have a mean of zero and a standard deviation of one. We then applied PCA to reduce the dimensionality of the molecular descriptors. We set the dimensionality q to 40, which corresponds to an explained variance of 70%. Additional file 1: Fig S1 shows the explained variance according to the reduced dimensionality determined by the number of principal components. Additional file 1: Fig S2 visualizes the principal components in relation to the original molecular descriptors, where each principal component involved a different mixture of all molecular descriptors. After dimensionality reduction, each dimension was clipped to -10 to 10 times its standard deviation and then re-standardized.

In the pre-training phase, we used a three-layer GIN architecture $(L=3)$ for the GNN. For the initial node and edge embedding functions $\phi _n$ and $\phi _e$, we used one-layer fully-connected neural networks with 300 ReLU units and 300 linear units, respectively. For the node embedding function $\psi ^{(l)}$, we used a two-layer fully-connected neural network, where each layer had 300 ReLU units. At the last message passing layer, we replaced the second layer of $\psi ^{(L)}$ with 300 linear units. For the projection function r, we used a one-layer fully-connected neural network with 1,024 PReLU units. For the auxiliary prediction head, we used a one-layer fully-connected neural network served as the output layer. The pre-training was performed for 10 epochs using the Adam optimizer with a batch size of 128, a learning rate of $5\cdot 10^{-4}$, and a weight decay of $10^{-5}$.

In the fine-tuning phase of the prediction model f, we used the pre-trained GNN obtained in the previous phase as the initialization of the GNN component in the prediction model f. The fine-tuning was performed using the Adam optimizer with a batch size of 128 and a weight decay of $10^{-5}$. The learning rate was initially set to $5\cdot 10^{-4}$ and decayed to $5\cdot 10^{-5}$ and $5\cdot 10^{-6}$ at the 400-th and 450-th epochs, respectively, over the entire 500 epochs.

For the inference of the prediction model f, we used Monte-Carlo dropout [38], following the referenced study [8]. Given a query chemical reaction, we generated 30 different predictions by conducting multiple stochastic forward passes through the model f with dropout activated. The final prediction for the query was obtained by averaging them.

Baseline methods

We conducted an exhaustive evaluation of MolDescPred by comparing its effectiveness with the methods presented in previous studies on chemical reaction yield prediction. For these methods, we used the configurations specified in their respective studies.

Multiple Fingerprint Features (MFF) [4] represents a chemical reaction as a vector by concatenating 24 different molecular fingerprints, each generated using RDKit [36]. As a prediction model, it builds a random forest that takes this vector representation as input to predict the corresponding reaction yield.
YieldBERT [6] represents a chemical reaction as a reaction SMILES string and fine-tunes a pre-trained reaction BERT model released by Schwaller et al.’s study [39] for chemical reaction yield prediction.
YieldBERT-DA [7] is an improved version of YieldBERT, which applies data augmentation based on molecule permutations and SMILES randomization.
YieldMPNN [8] represents a chemical reaction as a set of molecular graphs, similar to our study. It builds a prediction model based on a message passing neural network (MPNN) architecture [34]. Despite not utilizing any prior knowledge from pre-training, YieldMPNN performed better than YieldBERT and YieldBERT-DA.

For comparison of MolDescPred to existing pre-training methods, we evaluated different pre-training methods for initializing the GNN component in the prediction model. Compared with MolDescPred, the only difference was the manner in which the GNN was pre-trained. The following pre-training methods were compared. For all the existing methods, the GIN was used as the GNN architecture because they demonstrated superior performance with the GIN in the experimental results in the previous studies. The unspecified configurations for training and inference were set identical to the MolDescPred.

From-Scratch initializes all parameters of the model f randomly without any pre-training. This method is similar to YieldMPNN, but it replaces the MPNN with GIN as the GNN architecture. The training configuration for this method is identical to that of YieldMPNN.
MolCLR [13] pre-trains a GNN based on the contrastive learning approach. For data augmentation, it applies three graph transformation operations to generate different views of a molecular graph: atom masking, bond deletion, and sub-graph removal. The GNN learns molecular representations such that different views of the same molecular graph (i.e., positive pairs) are close and views of the different molecular graphs (i.e., negative pairs) are far apart. Because contrastive learning requires a large batch size to accommodate a large number of negative pairs, we set the batch size to 512.
DGI [14] pre-trains a GNN based on the contrastive learning approach. The GNN takes a molecular graph as an input to produce node embeddings and a molecular representation vector. A discriminator is introduced to classify whether a pair of a node embedding and a molecular representation vector are associated with the same molecular graph. The GNN and discriminator are jointly trained such that the GNN learns molecular representations by maximizing the mutual information between the local node embeddings and a global molecular representation vector. Similar to MolCLR, we set the batch size to 512.
ContextPred [21] pre-trains a GNN based on the pre-text task approach. For each node in a molecular graph, it defines a context graph as a sub-graph surrounding the neighborhood of the node. The main GNN encodes a molecular graph to obtain node embeddings that aggregates information across the neighborhood of the corresponding nodes. An auxiliary GNN, called a context GNN, is introduced to encode each context graph to obtain the context embedding. The main GNN and context GNN are jointly trained. The learning objective is the binary classification of whether a node embedding from the main GNN and a context embedding from the context GNN are associated with the same node in the molecular graph.
AttrMasking [21] pre-trains a GNN based on the pre-text task approach. It randomly masks the node features in a molecular graph and assigns the masked node features as the node-level pseudo-label to the molecular graph. The GNN learns to predict the ground-truth of the masked node features in the input molecular graph.

In computational aspects, the existing methods require an auxiliary model to be maintained or involve additional repetitive operations. MolCLR utilizes graph transformation operations to create different views of each molecular graph and forward passes for these views at each training epoch. DGI requires the maintenance of the discriminator. ContextPred employs the auxiliary GNN. AttrMasking generates pseudo-labels at each training epoch. These requirements introduce extra computational costs during the pre-training phase. On the other hand, MolDescPred generates pseudo-labels before pre-training and trains only a single GNN with a prediction head to predict the fixed pseudo-labels during pre-training.

Results and discussion

In the random split experiments, we conducted experiments for each training/test split ratio using 10 different random shuffles. In the out-of-sample split experiments, we repeated the experiment for each training/test split 5 times with different random seeds. We evaluated the predictive performance of each method in terms of the root mean squared error (RMSE), mean absolute error (MAE), and coefficient of determination (R$^2$) calculated on the test datasets. We report the average and standard deviation of the results over repetitions. The best and second best cases are highlighted in bold and underlined font, respectively.

Tables 1, 2, and 3 compare the predictive performances of the baseline and proposed methods in terms of RMSE, MAE, and R$^2$, respectively. Figure 5 summarizes the RMSE comparison results using bar plots. In an overall comparison on various splits of benchmark datasets, the performance of MolDescPred was either superior or comparable to that of the baseline methods. For the random splits of the Buchwald-Hartwig and Suzuki-Miyaura datasets, MolDescPred performed the best and the second best on average, respectively. Especially, the improvement in performance was more significant when the size of the training dataset was smaller. When it comes to the out-of-sample splits of the Buchwald-Hartwig dataset, MolDescPred outperformed the baseline methods in 3 out of 4 splits. These results demonstrate that MolDescPred performed well under the insufficiency of the training dataset in terms of quantity and diversity.

All the existing GNN pre-training methods outperformed From-Scratch, indicating that the use of pre-training was helpful in improving the prediction performance. Among these methods, MolCLR achieved superior performance for the random splits of both the Buchwald-Hartwig and Suzuki-Miyaura datasets, but its performance slightly deteriorated on the out-of-sample splits of the Buchwald-Hartwig dataset. AttrMasking showed good performance in some of the out-of-sample splits. It should be noted that not all pre-training methods led to meaningful performance improvement and some of them significantly underperformed YieldMPNN, implying that it is important to select an appropriate pre-training method for a specific target prediction task. Figure. 6 shows the distribution of reaction-wise error decreases achieved by MolDescPred compared to From-Scratch and MolCLR, each of which is measured by the difference between the absolute error of MolDescPred and that of the compared method. The rightward skew of each distribution, characterized by a larger blue region compared to the red region, indicates that MolDescPred led to performance improvements in a greater number of chemical reactions within the test dataset.

Among the methods presented in the previous studies, YieldMPNN performed the best. YieldMPNN outperformed From-Scratch, which differs only in the GNN architecture, by a large margin in most cases. However, YieldMPNN performed worse than MolDescPred, especially on the random splits with small training datasets and out-of-sample splits. MFF showed low overall performance compared to the other methods, but the performance gap narrowed when using a smaller training dataset. Notably, MFF achieved the best performance on the 2.5/97.5 split of the Suzuki-Miyaura dataset.

To investigate the effect of the GNN architecture in the proposed method, we evaluated a variant of the proposed method, MolDescPred-MPNN, by using the MPNN as the GNN architecture. It can be considered as the application of the proposed pre-training to YieldMPNN. MolDescPred-MPNN yielded better performance than YieldMPNN in the random split experiments. While it performed significantly worse than MolDescPred on the Buchwald-Hartwig dataset, it surpassed MolDescPred on the Suzuki-Miyaura dataset. However, MolDescPred-MPNN performed worse than YieldMPNN in the out-of-sample split experiments. This indicates that the proposed method was more effective when used with the GIN.

Table 1 Comparison of predictive performance in terms of RMSE

Full size table

Table 2 Comparison of predictive performance in terms of MAE

Full size table

Table 3 Comparison of predictive performance in terms of R$^2$

Full size table

To investigate the effect of the dimensionality of the pseudo-labels in the proposed method, we conducted a sensitivity analysis with respect to the explained variance determined by the number of principal components q. Figure 7 shows box plots comparing the RMSE reduction rate relative to the 70% explained variance case across various explained variances. The detailed comparison results across different levels of explained variance can be found in Additional file 1: Table S2. In the random splits of the Buchwald-Hartwig and Suzuki-Miyaura datasets, no significant differences in performance were observed. In the out-of-sample splits of the Buchwald-Hartwig dataset, while there was no clear tendency, MolDescPred demonstrated comparable performance at 70% explained variance. Therefore, it can be concluded that the current experimental setting where the dimensionality corresponds to 70% explained variance can be a reasonable choice.

Conclusion

In this study, we presented a GNN pre-training method, MolDescPred, to improve the performance of chemical reaction yield prediction. The proposed method defined a pre-text task by leveraging molecular descriptors. For a molecular database, we pseudo-labeled each molecule with its molecular descriptors in a reduced dimensionality obtained through PCA. Using the database, a GNN was pre-trained to predict the pseudo-label of a molecule. The pre-trained GNN served as the initialization for the GNN component of the chemical reaction yield prediction model. By fine-tuning on the target training dataset, the prediction model achieved improved performance in predicting the yields of chemical reactions. Through experimental investigations on benchmark datasets for chemical reaction yield prediction, we demonstrated the superior performance of the proposed method over the baseline methods. The proposed method was more effective when the training dataset was insufficient in terms of quantity and diversity.

In contrast to other pre-training methods that involve repetitions of complex and expensive computations, the proposed method pre-trains a GNN to perform a simple prediction task as the pre-text task. Because the molecular descriptors can be efficiently computed on a large scale, the proposed method can be easily implemented in practical applications. One important consideration is that the molecular descriptors used to define the pre-text task are not equally beneficial for the target prediction tasks. While some descriptors may provide valuable information, others may be less useful. Guided by this intuition, a potential avenue for future work to further enhance the efficiency and effectiveness of the proposed method is to investigate ways for dynamically selecting the most advantageous molecular descriptors for specific target prediction tasks.

Availability of data and materials

We implemented the proposed method based on PyTorch in Python. The source code used in this study is available online at http://github.com/hjm9702/reaction_yield_pretrained_gnn/. The benchmark datasets are publicly accessible from https://github.com/rxn4chemistry/rxn_yields/.

References

Meuwly M (2021) Machine learning for chemical reactions. Chem Rev 121(16):10218–10239
Article CAS PubMed Google Scholar
Ahneman DT, Estrada JG, Lin S, Dreher SD, Doyle AG (2018) Predicting reaction performance in C-N cross-coupling using machine learning. Science 360(6385):186–190
Article ADS CAS PubMed Google Scholar
Chuang KV, Keiser MJ (2018) Comment on predicting reaction performance in C-N cross-coupling using machine learning. Science 362(6416): eaat8603. https://doi.org/10.1126/science.aat8603
Article PubMed Google Scholar
Sandfort F, Strieth-Kalthoff F, Kühnemund M, Beecks C, Glorius F (2020) A structure-based platform for predicting chemical reactivity. Chem 6(6):1379–1390
Article CAS Google Scholar
Yada A, Nagata K, Ando Y, Matsumura T, Ichinoseki S, Sato K (2018) Machine learning approach for prediction of reaction yield with simulated catalyst parameters. Chem Lett 47(3):284–287
Article CAS Google Scholar
Schwaller P, Vaucher AC, Laino T, Reymond JL (2021) Prediction of chemical reaction yields using deep learning. Mach Learn Sci Technol 2(1):015016
Article Google Scholar
Schwaller P, Vaucher AC, Laino T, Reymond JL (2020) Data Augmentation Strategies to Improve Reaction Yield Predictions and Estimate Uncertainty. In: Proceedings of NeurIPS Workshop on Machine Learning for Molecules
Kwon Y, Lee D, Choi YS, Kang S (2022) Uncertainty-aware prediction of chemical reaction yields with graph neural networks. J Cheminform 14: 2. https://doi.org/10.1186/s13321-021-00579-z
Article PubMed PubMed Central Google Scholar
Wieder O, Kohlbacher S, Kuenemann M, Garon A, Ducrot P, Seidel T et al (2020) A compact review of molecular property prediction with graph neural networks. Drug Discov Today Technol 37:1–12
Article PubMed Google Scholar
Hwang D, Yang S, Kwon Y, Lee KH, Lee G, Jo H et al (2020) Comprehensive study on molecular supervised learning with graph neural networks. J Chem Inform Model 60(12):5936–5945
Article CAS Google Scholar
Xia J, Zhu Y, Du Y, Li SZ (2022) Pre-Training Graph Neural Networks for Molecular Representations: Retrospect and Prospect. In: Proceedings of ICML Workshop on AI for Science
Xie Y, Xu Z, Zhang J, Wang Z, Ji S (2022) Self-supervised learning of graph neural networks: a unified review. IEEE Trans Pattern Anal Mach Intell 45(2):2412–2429
Article Google Scholar
Wang Y, Wang J, Cao Z, Farimani AB (2022) Molecular contrastive learning of representations via graph neural networks. Nat Mach Intell 4:279–287
Article Google Scholar
Veličković P, Fedus W, Hamilton WL, Liò P, Bengio Y, Hjelm RD (2019) Deep Graph Infomax. In: Proceedings of International Conference on Learning Representations
Sun M, Xing J, Wang H, Chen B, Zhou J (2021) MoCL: Data-driven Molecular Fingerprint via Knowledge-aware Contrastive Learning from Molecular Graph. In: Proceedings of the ACM SIGKDD Conference on Knowledge Discovery & Data Mining.3585–3594
Li S, Zhou J, Xu T, Dou D, Xiong H (2022) GeomGCL: geometric graph contrastive learning for molecular property prediction. Proc AAAI Conf Artif Intell 36:4541–4549
CAS Google Scholar
You Y, Chen T, Shen Y, Wang Z (2021) Graph Contrastive Learning Automated. In: Proceedings of the 38th International Conference on Machine Learning. 139; 12121–12132
Xia J, Wu L, Chen J, Hu B, Li SZ (2022) SimGRACE: A Simple Framework for Graph Contrastive Learning without Data Augmentation. In: Proceedings of the ACM Web Conference. 1070–1079
Trivedi P, Lubana ES, Yan Y, Yang Y, Koutra D(2022) Augmentations in Graph Contrastive Learning: Current Methodological Flaws & Towards Better Practices. In: Proceedings of the ACM Web Conference; 1538–1549
You Y, Chen T, Sui Y, Chen T, Wang Z, Shen Y (2020) Graph contrastive learning with augmentations. Adv Neural Inform Process Syst 33:5812–5823
Google Scholar
Hu W, Liu B, Gomes J, Zitnik M, Liang P, Pande V, et al (2020) Strategies for Pre-training Graph Neural Networks. In: Proceedings of International Conference on Learning Representations
Fang X, Liu L, Lei J, He D, Zhang S, Zhou J et al (2022) Geometry-enhanced molecular representation learning for property prediction. Nat Mach Intell 4:127–134
Article Google Scholar
Zhang Z, Liu Q, Wang H, Lu C, Lee CK (2021) Motif-based graph self-supervised learning for molecular property prediction. Adv Neural Inform Process Syst 34:15870–15882
Google Scholar
Rong Y, Bian Y, Xu T, Xie W, WEI Y, Huang W, et al (2020) Self-supervised graph transformer on large-scale molecular data. Adv Neural Inform Process Syst 33:12559–12571
Google Scholar
Li P, Wang J, Qiao Y, Chen H, Yu Y, Yao X et al (2021) An effective self-supervised framework for learning expressive molecular global representations to drug discovery. Brief Bioinform 22(6):109
Article CAS Google Scholar
Todeschini R, Consonni V (2008) Handbook of molecular descriptors. John Wiley & Sons, Hoboken. WILEY-VCH. https://onlinelibrary.wiley.com/doi/book/10.1002/9783527613106
Google Scholar
Wigh DS, Goodman JM, Lapkin AA (2022) A review of molecular representation in the age of machine learning. Wiley Interdiscip Rev Comput Mol Sci 12(5):e1603
Article Google Scholar
Jiang D, Wu Z, Hsieh CY, Chen G, Liao B, Wang Z et al (2021) Could graph neural networks learn better molecular representation for drug discovery? A comparison study of descriptor-based and graph-based models. J Cheminform. https://doi.org/10.1186/s13321-020-00479-8
Article PubMed PubMed Central Google Scholar
Shen J, Nicolaou CA (2019) Molecular property prediction: recent trends in the era of artificial intelligence. Drug Discov Today Technol 32:29–36
Article PubMed Google Scholar
Pinheiro GA, Mucelini J, Soares MD, Prati RC, Silva JLFD, Quiles MG (2020) Machine learning prediction of nine molecular properties based on the SMILES representation of the QM9 quantum-chemistry dataset. J Phys Chem A 124(47):9854–9866
Article CAS PubMed Google Scholar
Moriwaki H, Tian YS, Kawashita N, Takagi T (2018) Mordred: A Molecular Descriptor Calculator. J Cheminform 10: 4. https://doi.org/10.1186/s13321-018-0258-y
Article PubMed PubMed Central Google Scholar
Jolliffe IT, Cadima J (2016) Principal component analysis: a review and recent developments. Philos Trans Royal Soc A: Math Phys Eng Sci 374(2065):20150202
Article ADS MathSciNet Google Scholar
Xu K, Hu W, Leskovec J, Jegelka S (2019) How Powerful are Graph Neural Networks? In: Proceedings of International Conference on Learning Representations
Gilmer J, Schoenholz SS, Riley PF, Vinyals O, Dahl GE(2017) Neural Message Passing for Quantum Chemistry. In: Proceedings of International Conference on Machine Learning. 1263–1272
Chithrananda S, Grand G, Ramsundar B (2020) ChemBERTa: Large-Scale Self-Supervised Pretraining for Molecular Property Prediction. In: Proceedings of NeurIPS Workshop on Machine Learning for Molecules
RDKit: Open-Source Cheminformatics;. Available from: http://www.rdkit.org/
Perera D, Tucker JW, Brahmbhatt S, Helal CJ, Chong A, Farrell W et al (2018) A platform for automated nanomole-scale reaction screening and micromole-scale synthesis in flow. Science 359(6374):429–434
Article ADS CAS PubMed Google Scholar
Gal Y, Ghahramani Z (2016) Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning. In: Proceedings of International Conference on Machine Learning. 1050–1059
Schwaller P, Probst D, Vaucher AC, Nair VH, Kreutter D, Laino T et al (2021) Mapping the space of chemical reactions using attention-based neural networks. Nat Mach Intell 3:144–152
Article Google Scholar

Download references

Acknowledgements

The authors thank the anonymous reviewers for their valuable comments.

Funding

This work was supported by Samsung Advanced Institute of Technology and the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT; Ministry of Science and ICT) (No. RS-2023–00,207,903).

Author information

Authors and Affiliations

Department of Industrial Engineering, Sungkyunkwan University, 2066 Seobu-ro, Jangan-gu, Suwon, Republic of Korea
Jongmin Han & Seokho Kang
Samsung Advanced Institute of Technology, Samsung Electronics Co. Ltd., 130 Samsung-ro, Yeongtong-gu, Suwon, Republic of Korea
Youngchun Kwon & Youn-Suk Choi

Authors

Jongmin Han
View author publications
You can also search for this author in PubMed Google Scholar
Youngchun Kwon
View author publications
You can also search for this author in PubMed Google Scholar
Youn-Suk Choi
View author publications
You can also search for this author in PubMed Google Scholar
Seokho Kang
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

J.H. and Y.K. designed and implemented the methodology. Y.K. analyzed the results. Y.-S.C and S.K. supervised the research. J.H. and S.K. wrote the manuscript. All authors reviewed and approved the final manuscript.

Scientific contribution

This study shows that incorporating GNN pre-training improves the performance of the prediction model for chemical reaction yield prediction. Compared to existing methods, the proposed method requires pre-training only a single GNN for predicting fixed pseudo-labels, thereby eliminating the need for maintaining an auxiliary model or involving additional repetitive operations. The effectiveness of the proposed method is particularly pronounced when the training dataset is limited in quantity and diversity, making it practically advantageous for real-world applications.

Corresponding authors

Correspondence to Youn-Suk Choi or Seokho Kang.

Ethics declarations

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Additional file

Additional file 1:

Table S1. List of 2D molecular descriptors from the Mordred calculator. Table S2. Comparison of RMSE across various explained variances. Figure S1. Explained variance according to the number of principal components. Figure S2. Heat map visualization of principal components.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article

Han, J., Kwon, Y., Choi, YS. et al. Improving chemical reaction yield prediction using pre-trained graph neural networks. J Cheminform 16, 25 (2024). https://doi.org/10.1186/s13321-024-00818-z

Download citation

Received: 21 October 2023
Accepted: 19 February 2024
Published: 01 March 2024
DOI: https://doi.org/10.1186/s13321-024-00818-z

Improving chemical reaction yield prediction using pre-trained graph neural networks

Abstract

Introduction