Skip to main content
Log in

Molecular discovery by optimal sequential search

  • Original Paper
  • Published:
Journal of Mathematical Chemistry Aims and scope Submit manuscript

Abstract

In the development of a new compound in chemistry and molecular biology, especially a new medicine in pharmaceutical industry, we often need to find candidate(s), a molecule or molecules, with the best desired property (e.g., binding affinity in medicine) from a large set of molecules with the same scaffold but m distinct functional substitutes at each of its n different sites. The total number \(N_{\mathrm{lib}}\) of molecules in this library is \(m^n\). In some cases, \(N_{\mathrm{lib}}\) can be a very large number (e.g., millions). This is a challenging task because it is costly and often infeasible to synthesize and test all of these molecules. A new algorithm referred to as optimal sequential search is developed to overcome this difficulty. Especially, this algorithm is chemically intuitive which only uses the information of molecule composition, and accessible to practical chemists. The algorithm can be applied to small, medium and large size molecule libraries. With syntheses and property measurements for a limited number of molecules, the top best candidate molecules can be effectively captured from the whole library. Three examples with library size 64, 160,000 and 1,048,576, respectively, are used for illustration. For the first small library, syntheses and property measurements of 17 molecules are sufficient to capture the top 7 best candidate molecules; for the two medium and large libraries, syntheses and property measurements of about one thousand molecules can capture most or a large part of the top 500, especially the top 100 best candidate molecules. However, the algorithm needs to perform multiple (e.g., hundreds of) iterative syntheses and property measurements. The time cost may not be acceptable if the algorithm is performed manually. To make the algorithm practical, automation of the sequential searching process is the following task.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18

Similar content being viewed by others

References

  1. A. Carnero, High throughput screening in drug discovery. Clin. Transl. Oncol. 8(7), 482–490 (2006)

    Article  CAS  PubMed  Google Scholar 

  2. J.B. Taylor, D.J. Triggle, Comprehensive Medicinal Chemistry II (Elsevier, Amsterdam, 2007)

    Google Scholar 

  3. J. Bajorath, Computer-aided drug discovery. F1000Research 4(F1000 Faculty Rev), 630 (2015). https://doi.org/10.12688/f1000research.6653.1

    Article  CAS  Google Scholar 

  4. B.K. Shoichet, Virtual screening of chemical libraries. Nature 432(7019), 862–865 (2004)

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  5. G. Maggiora, M. Vogt, D. Stumpfe et al., Molecular similarity in medicinal chemistry. J. Med. Chem. 57(8), 3186–3204 (2014)

    Article  CAS  PubMed  Google Scholar 

  6. D.B. Kitchen, H. Decornez, J.R. Furr et al., Docking and scoring in virtual screening for drug discovery: methods and applications. Nat. Rev. Drug Discov. 3(11), 935–949 (2004)

    Article  CAS  PubMed  Google Scholar 

  7. J. Bajorath, Integration of virtual and high-throughput screening. Nat. Rev. Drug Discov. 1(11), 882–894 (2002)

    Article  CAS  PubMed  Google Scholar 

  8. V. Kholodovych, J.R. Smith, D. Knight, S. Abramson, J. Kohn, W.J. Welsh, Accurate predictions of cellular response using QSPR: a feasibility test of rational design of polymeric biomaterials. Polymer 45, 7367–7379 (2004)

    Article  CAS  Google Scholar 

  9. D.R. Jones, M. Schonlau, W.J. Welsh, Efficient global optimization of expensive black-box functions. J. Glob. Optim. 13, 455–492 (1998)

    Article  Google Scholar 

  10. M.A. Mohamad, T.P. Sapsis, A sequential sampling strategy for extreme event statistics in nonlinear dynamical systems, in Proceedings of the National Academy of Sciences of the United States of America (2018)

  11. E. Li, F. Ye, H. Wang, Alternative Kriging-HDMR optimization method with expected improvement sampling strategy. Eng. Comput. 34(6), 1807–1828 (2017)

    Article  Google Scholar 

  12. C.E. Rasmussen, C.K.I. Williams, Gaussian Processes for Machine Learning (MIT Press, Cambridge, MA, 2006)

    Google Scholar 

  13. D. Duvenaud, H. Nickisch, C.E. Rasmussen, Additive Gaussian processes, in Advances in Neural Information Processing Systems, 24 (NIPS 2011)

  14. N.C. Wu, L. Dai, C.A. Olson, L.O. Lloyd-Smith, R. Sun, Adaptation in protein fitness landscapes is facilitated by indirect paths. eLife 5, e16965 (2016). https://doi.org/10.7554/eLife.16965

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  15. W. Rowe, M. Platt, D.C. Wedge, P.J. Day, D.B. Kell, J. Knowles, Analysis of a complete DNA-protein affinity landscape. J. R. Soc. Interface 7, 397–408 (2010)

    Article  CAS  PubMed  Google Scholar 

  16. T. Siggers, A.B. Chang, A. Teixeira, D. Wong, K.J. Williams, B. Ahmed, J. Ragoussis, I.A. Udalova, S.T. Smale, M.L. Bulyk, Principles of dimer-specific gene regulation revealed by a comprehensive characterization of NF-\(\kappa \)B family DNA binding. Nat. Immunol. 13(1), 95–102 (2012)

    Article  CAS  Google Scholar 

  17. NF-\(\kappa \)B Dataset. http://thebrain.bwh.harvard.edu/nfkb/

  18. C. Cattani, M. Scalia, G. Mattioli, Entropy distribution and information content in DNA sequences, in Conference: International Conference on Potential Theory and Complex Analysis, Kiev, 8–11 Maggio (2006)

  19. P. Lió, Wavelets in bioinformatics and computational biology: state of art and perspectives. Bioinform. Rev. 19(1), 2–9 (2003)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Genyuan Li.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Genyuan Li–Retired from Princeton University.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (txt 8 KB)

Supplementary material 2 (xls 4816 KB)

Appendix: Deduction of the formula of the expected improvement E(I) for maximization

Appendix: Deduction of the formula of the expected improvement E(I) for maximization

Define improvement \(I(x_*)\) for an one variable function \(y=f(x)\) as

$$\begin{aligned} I(x_*)=\max (Y-f_{\mathrm{max}}, 0). \end{aligned}$$
(26)

The expected improvement \(E(I(x_*))\) is then

$$\begin{aligned} E(I(x_*))={\mathbb {E}}[I(x_*)]\equiv {\mathbb {E}}[\max (Y-f_{\mathrm{max}},0)] \end{aligned}$$
(27)

i.e.,

$$\begin{aligned} E(I(x_*))= & {} \int _{-\infty }^{\infty } \max (y-f_{\mathrm{max}},0)\phi (y){\mathrm d}y \nonumber \\= & {} \int _{-\infty }^{f_{\mathrm{max}}} 0\phi (y){\mathrm d}y + \int _{f_{\mathrm{max}}}^{\infty } (y-f_{\mathrm{max}})\phi (y){\mathrm d}y \nonumber \\= & {} \int _{f_{\mathrm{max}}}^{\infty }(y-f_{\mathrm{max}})\frac{1}{\sqrt{2\pi }s} \exp \left[ -\frac{(y-f_*)^2}{2s^2}\right] {\mathrm d}y, \end{aligned}$$
(28)

where \(\phi (y)\) is the probability density function (pdf) of normal distribution \({{\mathcal {N}}}(f_*, s^2)\). Setting

$$\begin{aligned} u=\frac{y-f_*}{s}, \qquad y=su+f_*, \qquad {\mathrm d}y = s{\mathrm d}u \end{aligned}$$
(29)

gives

$$\begin{aligned} E(I(x_*))= & {} \int _{(f_{\mathrm{max}}-f_*)/s}^{\infty } (su+f_*-f_{\mathrm{max}}) \frac{1}{\sqrt{2\pi }}e^{-\frac{u^2}{2}}{\mathrm d}u \nonumber \\= & {} (f_*-f_{\mathrm{max}})\int _{(f_{\mathrm{max}}-f_*)/s}^{\infty } \frac{1}{\sqrt{2\pi }}e^{-\frac{u^2}{2}}{\mathrm d}u + s\int _{(f_{\mathrm{max}}-f_*)/s}^{\infty } \frac{u}{\sqrt{2\pi }}e^{-\frac{u^2}{2}}{\mathrm d}u \nonumber \\= & {} (f_*-f_{\mathrm{max}})\left[ \int _{-\infty }^{\infty } \frac{1}{\sqrt{2\pi }}e^{-\frac{u^2}{2}}{\mathrm d}u -\int _{-\infty }^{(f_{\mathrm{max}}-f_*)/s} \frac{1}{\sqrt{2\pi }}e^{-\frac{u^2}{2}}{\mathrm d}u\right] \nonumber \\+ & {} s\int _{(f_{\mathrm{max}}-f_*)/s}^{\infty } \frac{u}{\sqrt{2\pi }}e^{-\frac{u^2}{2}}{\mathrm d}u \nonumber \\= & {} (f_*-f_{\mathrm{max}})\left[ 1-\Phi \left( \frac{f_\mathrm{max}-f_*}{s}\right) \right] + s\int _{(f_{\mathrm{max}}-f_*)/s}^{\infty } \frac{u}{\sqrt{2\pi }}e^{-\frac{u^2}{2}}{\mathrm d}u, \nonumber \\ \end{aligned}$$
(30)

where \(\Phi \) denotes the standard normal cumulative distribution. Note that

$$\begin{aligned} \phi (u)=\frac{1}{\sqrt{2\pi }}e^{-\frac{u^2}{2}}, \qquad {\mathrm d}\phi (u)=-\frac{u}{\sqrt{2\pi }}e^{-\frac{u^2}{2}}{\mathrm d}u. \end{aligned}$$
(31)

The second term in Eq. (30) becomes

$$\begin{aligned} s\int _{(f_\mathrm{max}-f_*)/s}^{\infty } \!\!\!\!\!\!\!&\frac{u}{\sqrt{2\pi }}e^{-\frac{u^2}{2}}{\mathrm d}u =-s \int _{(f_{\mathrm{max}}-f_*)/s}^{\infty } {\mathrm d}\phi \nonumber \\= & {} s\left[ \phi (u)\right] _{\infty }^{(f_{\mathrm{max}}-f_*)/s} =s\phi \left( \frac{f_{\mathrm{max}}-f_*}{s}\right) . \end{aligned}$$
(32)

Substituting Eq. (32) into (30) yields

$$\begin{aligned} E(I(x_*))=(f_*-f_{\mathrm{max}})\left[ 1-\Phi \left( \frac{f_\mathrm{max}-f_*}{s}\right) \right] +s\phi \left( \frac{f_\mathrm{max}-f_*}{s}\right) . \end{aligned}$$
(33)

Note that in the whole process, only the output y is involved. For \(f(\mathbf{x})\) of \(n(>1)\)-dimensional variables, the result is the same. Equation (33) is then also valid for \(f(\mathbf{x})\), i.e.,

$$\begin{aligned} E(I(\mathbf{x}_*))=(f_*-f_{\mathrm{max}})\left[ 1-\Phi \left( \frac{f_\mathrm{max}-f_*}{s}\right) \right] +s\phi \left( \frac{f_\mathrm{max}-f_*}{s}\right) . \end{aligned}$$
(34)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Li, G. Molecular discovery by optimal sequential search. J Math Chem 57, 2110–2141 (2019). https://doi.org/10.1007/s10910-019-01062-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10910-019-01062-9

Keywords

Navigation