Molecular discovery by optimal sequential search

Li, Genyuan

doi:10.1007/s10910-019-01062-9

Molecular discovery by optimal sequential search

Original Paper
Published: 31 August 2019

Volume 57, pages 2110–2141, (2019)
Cite this article

Journal of Mathematical Chemistry Aims and scope Submit manuscript

Genyuan Li ORCID: orcid.org/0000-0003-4573-6188¹

154 Accesses
Explore all metrics

Abstract

In the development of a new compound in chemistry and molecular biology, especially a new medicine in pharmaceutical industry, we often need to find candidate(s), a molecule or molecules, with the best desired property (e.g., binding affinity in medicine) from a large set of molecules with the same scaffold but m distinct functional substitutes at each of its n different sites. The total number $N_{\mathrm{lib}}$ of molecules in this library is $m^n$. In some cases, $N_{\mathrm{lib}}$ can be a very large number (e.g., millions). This is a challenging task because it is costly and often infeasible to synthesize and test all of these molecules. A new algorithm referred to as optimal sequential search is developed to overcome this difficulty. Especially, this algorithm is chemically intuitive which only uses the information of molecule composition, and accessible to practical chemists. The algorithm can be applied to small, medium and large size molecule libraries. With syntheses and property measurements for a limited number of molecules, the top best candidate molecules can be effectively captured from the whole library. Three examples with library size 64, 160,000 and 1,048,576, respectively, are used for illustration. For the first small library, syntheses and property measurements of 17 molecules are sufficient to capture the top 7 best candidate molecules; for the two medium and large libraries, syntheses and property measurements of about one thousand molecules can capture most or a large part of the top 500, especially the top 100 best candidate molecules. However, the algorithm needs to perform multiple (e.g., hundreds of) iterative syntheses and property measurements. The time cost may not be acceptable if the algorithm is performed manually. To make the algorithm practical, automation of the sequential searching process is the following task.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

From UK-2A to florylpicoxamid: Active learning to identify a mimic of a macrocyclic natural product

Article Open access 17 April 2024

Generative AI Models for Drug Discovery

Robocrystallographer: automated crystal structure text descriptions and analysis

Article 20 September 2019

References

A. Carnero, High throughput screening in drug discovery. Clin. Transl. Oncol. 8(7), 482–490 (2006)
Article CAS PubMed Google Scholar
J.B. Taylor, D.J. Triggle, Comprehensive Medicinal Chemistry II (Elsevier, Amsterdam, 2007)
Google Scholar
J. Bajorath, Computer-aided drug discovery. F1000Research 4(F1000 Faculty Rev), 630 (2015). https://doi.org/10.12688/f1000research.6653.1
Article CAS Google Scholar
B.K. Shoichet, Virtual screening of chemical libraries. Nature 432(7019), 862–865 (2004)
Article CAS PubMed PubMed Central Google Scholar
G. Maggiora, M. Vogt, D. Stumpfe et al., Molecular similarity in medicinal chemistry. J. Med. Chem. 57(8), 3186–3204 (2014)
Article CAS PubMed Google Scholar
D.B. Kitchen, H. Decornez, J.R. Furr et al., Docking and scoring in virtual screening for drug discovery: methods and applications. Nat. Rev. Drug Discov. 3(11), 935–949 (2004)
Article CAS PubMed Google Scholar
J. Bajorath, Integration of virtual and high-throughput screening. Nat. Rev. Drug Discov. 1(11), 882–894 (2002)
Article CAS PubMed Google Scholar
V. Kholodovych, J.R. Smith, D. Knight, S. Abramson, J. Kohn, W.J. Welsh, Accurate predictions of cellular response using QSPR: a feasibility test of rational design of polymeric biomaterials. Polymer 45, 7367–7379 (2004)
Article CAS Google Scholar
D.R. Jones, M. Schonlau, W.J. Welsh, Efficient global optimization of expensive black-box functions. J. Glob. Optim. 13, 455–492 (1998)
Article Google Scholar
M.A. Mohamad, T.P. Sapsis, A sequential sampling strategy for extreme event statistics in nonlinear dynamical systems, in Proceedings of the National Academy of Sciences of the United States of America (2018)
E. Li, F. Ye, H. Wang, Alternative Kriging-HDMR optimization method with expected improvement sampling strategy. Eng. Comput. 34(6), 1807–1828 (2017)
Article Google Scholar
C.E. Rasmussen, C.K.I. Williams, Gaussian Processes for Machine Learning (MIT Press, Cambridge, MA, 2006)
Google Scholar
D. Duvenaud, H. Nickisch, C.E. Rasmussen, Additive Gaussian processes, in Advances in Neural Information Processing Systems, 24 (NIPS 2011)
N.C. Wu, L. Dai, C.A. Olson, L.O. Lloyd-Smith, R. Sun, Adaptation in protein fitness landscapes is facilitated by indirect paths. eLife 5, e16965 (2016). https://doi.org/10.7554/eLife.16965
Article CAS PubMed PubMed Central Google Scholar
W. Rowe, M. Platt, D.C. Wedge, P.J. Day, D.B. Kell, J. Knowles, Analysis of a complete DNA-protein affinity landscape. J. R. Soc. Interface 7, 397–408 (2010)
Article CAS PubMed Google Scholar
T. Siggers, A.B. Chang, A. Teixeira, D. Wong, K.J. Williams, B. Ahmed, J. Ragoussis, I.A. Udalova, S.T. Smale, M.L. Bulyk, Principles of dimer-specific gene regulation revealed by a comprehensive characterization of NF-$\kappa $B family DNA binding. Nat. Immunol. 13(1), 95–102 (2012)
Article CAS Google Scholar
NF-$\kappa $B Dataset. http://thebrain.bwh.harvard.edu/nfkb/
C. Cattani, M. Scalia, G. Mattioli, Entropy distribution and information content in DNA sequences, in Conference: International Conference on Potential Theory and Complex Analysis, Kiev, 8–11 Maggio (2006)
P. Lió, Wavelets in bioinformatics and computational biology: state of art and perspectives. Bioinform. Rev. 19(1), 2–9 (2003)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Chemistry, Princeton University, Princeton, NJ, 08544, USA
Genyuan Li

Authors

Genyuan Li
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Genyuan Li.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Genyuan Li–Retired from Princeton University.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (txt 8 KB)

Supplementary material 2 (xls 4816 KB)

Appendix: Deduction of the formula of the expected improvement E(I) for maximization

Define improvement $I(x_*)$ for an one variable function $y=f(x)$ as

$$\begin{aligned} I(x_*)=\max (Y-f_{\mathrm{max}}, 0). \end{aligned}$$

(26)

The expected improvement $E(I(x_*))$ is then

$$\begin{aligned} E(I(x_*))={\mathbb {E}}[I(x_*)]\equiv {\mathbb {E}}[\max (Y-f_{\mathrm{max}},0)] \end{aligned}$$

(27)

i.e.,

$$\begin{aligned} E(I(x_*))= & {} \int _{-\infty }^{\infty } \max (y-f_{\mathrm{max}},0)\phi (y){\mathrm d}y \nonumber \\= & {} \int _{-\infty }^{f_{\mathrm{max}}} 0\phi (y){\mathrm d}y + \int _{f_{\mathrm{max}}}^{\infty } (y-f_{\mathrm{max}})\phi (y){\mathrm d}y \nonumber \\= & {} \int _{f_{\mathrm{max}}}^{\infty }(y-f_{\mathrm{max}})\frac{1}{\sqrt{2\pi }s} \exp \left[ -\frac{(y-f_*)^2}{2s^2}\right] {\mathrm d}y, \end{aligned}$$

(28)

where $\phi (y)$ is the probability density function (pdf) of normal distribution ${{\mathcal {N}}}(f_*, s^2)$. Setting

$$\begin{aligned} u=\frac{y-f_*}{s}, \qquad y=su+f_*, \qquad {\mathrm d}y = s{\mathrm d}u \end{aligned}$$

(29)

gives

$$\begin{aligned} E(I(x_*))= & {} \int _{(f_{\mathrm{max}}-f_*)/s}^{\infty } (su+f_*-f_{\mathrm{max}}) \frac{1}{\sqrt{2\pi }}e^{-\frac{u^2}{2}}{\mathrm d}u \nonumber \\= & {} (f_*-f_{\mathrm{max}})\int _{(f_{\mathrm{max}}-f_*)/s}^{\infty } \frac{1}{\sqrt{2\pi }}e^{-\frac{u^2}{2}}{\mathrm d}u + s\int _{(f_{\mathrm{max}}-f_*)/s}^{\infty } \frac{u}{\sqrt{2\pi }}e^{-\frac{u^2}{2}}{\mathrm d}u \nonumber \\= & {} (f_*-f_{\mathrm{max}})\left[ \int _{-\infty }^{\infty } \frac{1}{\sqrt{2\pi }}e^{-\frac{u^2}{2}}{\mathrm d}u -\int _{-\infty }^{(f_{\mathrm{max}}-f_*)/s} \frac{1}{\sqrt{2\pi }}e^{-\frac{u^2}{2}}{\mathrm d}u\right] \nonumber \\+ & {} s\int _{(f_{\mathrm{max}}-f_*)/s}^{\infty } \frac{u}{\sqrt{2\pi }}e^{-\frac{u^2}{2}}{\mathrm d}u \nonumber \\= & {} (f_*-f_{\mathrm{max}})\left[ 1-\Phi \left( \frac{f_\mathrm{max}-f_*}{s}\right) \right] + s\int _{(f_{\mathrm{max}}-f_*)/s}^{\infty } \frac{u}{\sqrt{2\pi }}e^{-\frac{u^2}{2}}{\mathrm d}u, \nonumber \\ \end{aligned}$$

(30)

where $\Phi $ denotes the standard normal cumulative distribution. Note that

$$\begin{aligned} \phi (u)=\frac{1}{\sqrt{2\pi }}e^{-\frac{u^2}{2}}, \qquad {\mathrm d}\phi (u)=-\frac{u}{\sqrt{2\pi }}e^{-\frac{u^2}{2}}{\mathrm d}u. \end{aligned}$$

(31)

The second term in Eq. (30) becomes

$$\begin{aligned} s\int _{(f_\mathrm{max}-f_*)/s}^{\infty } \!\!\!\!\!\!\!&\frac{u}{\sqrt{2\pi }}e^{-\frac{u^2}{2}}{\mathrm d}u =-s \int _{(f_{\mathrm{max}}-f_*)/s}^{\infty } {\mathrm d}\phi \nonumber \\= & {} s\left[ \phi (u)\right] _{\infty }^{(f_{\mathrm{max}}-f_*)/s} =s\phi \left( \frac{f_{\mathrm{max}}-f_*}{s}\right) . \end{aligned}$$

(32)

Substituting Eq. (32) into (30) yields

$$\begin{aligned} E(I(x_*))=(f_*-f_{\mathrm{max}})\left[ 1-\Phi \left( \frac{f_\mathrm{max}-f_*}{s}\right) \right] +s\phi \left( \frac{f_\mathrm{max}-f_*}{s}\right) . \end{aligned}$$

(33)

Note that in the whole process, only the output y is involved. For $f(\mathbf{x})$ of $n(>1)$-dimensional variables, the result is the same. Equation (33) is then also valid for $f(\mathbf{x})$, i.e.,

$$\begin{aligned} E(I(\mathbf{x}_*))=(f_*-f_{\mathrm{max}})\left[ 1-\Phi \left( \frac{f_\mathrm{max}-f_*}{s}\right) \right] +s\phi \left( \frac{f_\mathrm{max}-f_*}{s}\right) . \end{aligned}$$

(34)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Li, G. Molecular discovery by optimal sequential search. J Math Chem 57, 2110–2141 (2019). https://doi.org/10.1007/s10910-019-01062-9

Download citation

Received: 11 June 2019
Accepted: 26 August 2019
Published: 31 August 2019
Issue Date: October 2019
DOI: https://doi.org/10.1007/s10910-019-01062-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Molecular discovery by optimal sequential search

Abstract

Access this article

Similar content being viewed by others

From UK-2A to florylpicoxamid: Active learning to identify a mimic of a macrocyclic natural product

Generative AI Models for Drug Discovery

Robocrystallographer: automated crystal structure text descriptions and analysis

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Electronic supplementary material

Supplementary material 1 (txt 8 KB)

Supplementary material 2 (xls 4816 KB)

Appendix: Deduction of the formula of the expected improvement E(I) for maximization

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Molecular discovery by optimal sequential search

Abstract

Access this article

Similar content being viewed by others

From UK-2A to florylpicoxamid: Active learning to identify a mimic of a macrocyclic natural product

Generative AI Models for Drug Discovery

Robocrystallographer: automated crystal structure text descriptions and analysis

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Electronic supplementary material

Supplementary material 1 (txt 8 KB)

Supplementary material 2 (xls 4816 KB)

Appendix: Deduction of the formula of the expected improvement E(I) for maximization

Appendix: Deduction of the formula of the expected improvement E(I) for maximization

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation