Elsevier

Fluid Phase Equilibria

Volume 527, 1 January 2021, 112829
Fluid Phase Equilibria

Towards a universal digital chemical space for pure component properties prediction

https://doi.org/10.1016/j.fluid.2020.112829Get rights and content

Highlights

  • The concept of a universal translator was used to learn molecular features and stored them in a universal digital chemical space by translating and projecting the chemical representation SMILES to a high-dimensional space that can be collapsed into different molecular fingerprints.

  • This universal digital chemical space can serve as a unified regressor for predicting different pure component properties. Successful applications to computed electrical and thermodynamic properties were demonstrated

Abstract

Computer-aided molecular design requires the ability to predict different molecular properties of interesting from using molecular structure. Traditional quantitative structural property relations were developed by extracting molecular features for predicting various properties. Hence domains of molecular features are different for predictions of different properties. In this work, the concept of a universal translator was used to develop a universal digital chemical space by translating and projecting the chemical representation SMILES to a high-dimensional space that can be collapsed into different molecular fingerprints. We demonstrated different kinds of pure component properties, such as electrical and thermodynamic properties can be predicted by a simple input of molecular structure, SMILES. This method eliminates the need to manually extract different molecular features for predicting different properties. The ability of model to predict sigma profiles also pave the way of prediction phase equilibria of mixtures using molecular structure only.

Introduction

The ability to predict chemical and physical properties is a critical component for the molecular design of processes and products [1], [2]. In the field of chemical thermodynamics, a common approach is the group contribution method. In this approach, a molecule is broken down into a number of functional groups. Pure component properties are then predicted by a linear combination of parameters specific to these groups (for example, the Joback method [3]). Binary or multicomponent properties such as activity coefficients can also be predicted using interaction parameters between groups (e.g. the UNIFAC method [4]). In quantitative structure property/activity relationships (QSPR/QSAR [5], [6]) a number of explicit features, known as molecular descriptors, are extracted from a molecule structure, which is then used to predict properties or activities of molecules. Machine learning had always been an integral part of QSPR. Most commonly, e.g. in the works of Chen and Wong [7], Járvás et al. [8], Faber et al. [9] and many others; molecular descriptors are still used as the regressors, but nonlinear nonparametric models were used for representing the responses.

More recently, advances in image processing and natural language processing were leveraged to perform implicit feature extraction. For example, simple 2-dimensional molecular structures were represented as images and used as the inputs of a convolutional network for properties prediction [10], [11]. In natural language processing, translation between languages is a challenging task. A word embedment [12] procedure encodes a certain text representation into a set of numbers and projects them into a high dimensional space. This high dimensional space can be decoded back into a different textural representation with essentially the same meaning. Our previous work [13] used the word embedment technique in natural language processing to encode a text-based description, the Simplified Molecular Input Line Entry Specification (SMILES) [14] into the molecular fingerprint Molecular ACCess System (MACCS [15]). The encoded representation was found to be able to predict sigma-profiles in the VT2005 database [16]. The sigma profile is the screening charge distribution of a molecule when embedded in a perfect conductor [17], an important molecular characteristic of solvation [18], [19]. A similar word embedding approach was used to predict bioactivity [20], and polymer properties [21].

In the above approaches, if a specific set of properties was used to determine the encoded feature, the feature will be a molecular descriptor vector catered specifically for the set of properties. An analog of this approach in multivariate linear regression is known as canonical correlation analysis (CCA), which is dedicated to finding latent variables that are most relevant to output predictions [22]. Obviously, such descriptors and models cannot be used in molecular design problems when different properties are the keys. An alternate method is to find a set of encoded features that is common to many different descriptors and use them to build models for predicting different properties. An analog of this approach in multivariate linear regression is known as principal component correlation (PCR), which is dedicated to finding latent variables that describe the correlation between different regressor variables regardless of their predictive ability [23]. The assumption was if all relevant physical meanings were captured, any function of the original variable space can be represented by the latent space.

In terms of natural language processing, if we assume that concepts expressed by different languages are essentially the same, a unified high dimensional space can be established as the core of a universal translator. A similar situation exists for molecular feature extraction. There are many ways to encode the structure of molecules. For example, the Open Babel is a chemical toolbox that allows the translation between different molecular fingerprints [24]. In this work, we attempt to demonstrate that a high dimensional “universal-digital-chemical-space” (UDCS) can be constructed by SMILES into different molecular fingerprints. Furthermore, such a UDCS can serve as a “universal regressor” to construct models for different types of pure components properties, including geometrical, electronic, and thermodynamic properties.

Section snippets

Databases and properties

Ramakrishnan et al. [25] reported computations of 14 properties of 133,885 small organic molecules consisting of 9 or less heavy atoms CHONF (GDB-9). They included 4 molecular (rotation constant A, B, C; zero point vibration energy, zpve), 5 electrical properties (dipole moment μ, isotropic polarizability, HOMO and LUMO energy levels ɛHOMO, ɛLUMO, electronic spatial extent Re2) and 2 fixed-point thermodynamic properties (H298, enthalpy at 298.15 K and c298, heat capacity at 298.15 K).

We have

Construction of UDCS

To investigate the effects of the size of the network, the aforementioned UDCS were trained using M=500, N=200; M=500,N=100; and M=200, N=100. The models were named X1 to X3, respectively. During the training, 140,944 compounds (80%, randomly from the three data sets) were selected from our database and the 35,237 compounds were used as test samples

Table 1 shows the binary accuracy of the translations from SMILES to various molecular fingerprints. Binary accuracy is defined as the number of

Conclusions

In this work, we have demonstrated the feasibility of the development of a universal digital chemical space (UDCS) that is able to translate SMILES to various molecular fingerprints as well as predict different properties of pure components. In this way, the need of extracting specific molecular features for predicting different properties in traditional machine learning of quantitative structural properties relations is eliminated. Prediction results of computed geometrical, electronic and

CRediT authorship contribution statement

Jie-Jiun Chang: Methodology, Software, Validation, Formal analysis, Writing - original draft. David Shan-Hill Wong: Conceptualization, Methodology, Writing - review & editing, Supervision, Funding acquisition. Chen-Hsuan Huang: Conceptualization, Writing - original draft, Writing - review & editing, Supervision, Funding acquisition. Jia-Lin Kang: Conceptualization, Methodology, Writing - review & editing, Supervision, Funding acquisition. Hsuan-Hao Hsu: Software, Validation, Formal analysis,

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgement

Jie-Jiun Chang and David Shan-Hill Wong would like to acknowledge the support of Ministry of Science and Technology through project MOST108-2221-E-007 -069-MY2, 108-2218-E-007-007. Jia-Ling Kang would like to acknowledge the support of Ministry of Science and Technology through project MOST 108-2636-E-224-001. Shang-Tai Lin and other authors from National Taiwan University would like to acknowledge the support of Ministry of Science and Technology through project 107-2221-E-002-112-MY3.

References (44)

  • D. Rogers et al.

    Application of genetic function approximation to quantitative structure-activity relationships and quantitative structure-property relationships

    J. Chem. Inf. Comput. Sci.

    (1994)
  • K. Varmuza et al.

    Statistical Modelling Of Molecular Descriptors in QSAR/QSPR

    (2012)
  • D.S. Chen et al.

    Neural network correlations of detonation properties of high energy explosives

    Propellants Explos. Pyrotech.

    (1998)
  • F.A. Faber et al.

    Prediction errors of molecular machine learning models lower than hybrid DFT error

    J. Chem. Theory Comput.

    (2017)
  • Goh, G. B., Siegel, C., Vishnu, A., Hodas, N. O., & Baker, N. (2017). Chemception: A deep neural network with minimal...
  • C.A. Grambow et al.

    Accurate thermochemistry with small data sets: A bond additivity correction and transfer learning approach

    J. Phys. Chem. A

    (2019)
  • Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space....
  • D Weininger

    SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules

    J. Chem. Inf. Model.

    (1998)
  • J.L. Durant et al.

    Reoptimization of MDL keys for use in drug discovery

    J. Chem. Inf. Comput. Sci.

    (2002)
  • Mullins, E., Oldland, R., Liu, Y. A., Wang, S., Sandler, S. I., Chen, C. C., ... & Seavey, K. C. (2006). VT 2005 Sigma...
  • A. Klamt

    Conductor-like screening model for real solvents: a new approach to the quantitative calculation of solvation phenomena

    J. Phys. Chem.

    (1995)
  • S.T. Lin et al.

    A priori phase equilibrium prediction from a segment contribution solvation model

    Ind. Eng. Chem. Res.

    (2002)
  • Cited by (0)

    View full text