Multiple linear regression models for predicting the n‑octanol/water partition coefficients in the SAMPL7 blind challenge

Lopez, Kenneth; Pinheiro, Silvana; Zamora, William J.

doi:10.1007/s10822-021-00409-2

Multiple linear regression models for predicting the n‑octanol/water partition coefficients in the SAMPL7 blind challenge

Published: 12 July 2021

Volume 35, pages 923–931, (2021)
Cite this article

Download PDF

Journal of Computer-Aided Molecular Design Aims and scope Submit manuscript

Multiple linear regression models for predicting the n‑octanol/water partition coefficients in the SAMPL7 blind challenge

Download PDF

3942 Accesses
6 Citations
Explore all metrics

Abstract

A multiple linear regression model called MLR-3 is used for predicting the experimental n-octanol/water partition coefficient (log P_N) of 22 N-sulfonamides proposed by the organizers of the SAMPL7 blind challenge. The MLR-3 method was trained with 82 molecules including drug-like sulfonamides and small organic molecules, which resembled the main functional groups present in the challenge dataset. Our model, submitted as “TFE-MLR”, presented a root-mean-square error of 0.58 and mean absolute error of 0.41 in log P units, accomplishing the highest accuracy, among empirical methods and also in all submissions based on the ranked ones. Overall, the results support the appropriateness of multiple linear regression approach MLR-3 for computing the n-octanol/water partition coefficient in sulfonamide-bearing compounds. In this context, the outstanding performance of empirical methodologies, where 75% of the ranked submissions achieved root-mean-square errors < 1 log P units, support the suitability of these strategies for obtaining accurate and fast predictions of physicochemical properties as partition coefficients of bioorganic compounds.

Octanol–water partition coefficient measurements for the SAMPL6 blind prediction challenge

Article 19 December 2019

Mehtap Işık, Dorothy Levorse, … John D. Chodera

COSMO-RS predictions of logP in the SAMPL7 blind challenge

Article 14 June 2021

Judith Warnau, Karin Wichmann & Jens Reinisch

A deep learning approach for the blind logP prediction in SAMPL6 challenge

Article 30 January 2020

Samarjeet Prasad & Bernard R. Brooks

Introduction

The relevance of lipophilicity in the pharmaceutical sciences has been known for over a century [1]. Lipophilicity is the affinity of a molecule for a lipophilic environment. The logarithm of the n-octanol/water partition/distribution coefficient of neutral and ionizable compounds −log P_N and log D_pH, respectively- are the gold standards of quantitative descriptors of lipophilicity [2]. Thus, log P_N has been used for predicting the ability of bioorganic compounds to cross cell membranes [3]. Nowadays, it is still being used for assessing the impact on pharmacokinetic parameters and potency [4], metabolism and excretion[5, 6], and toxicity [7] of research compounds.

To predict the log P_N there are a plethora of computational methods [2] and SAMPL challenges aim to evaluate them through blind predictions of physical properties [8]. In the framework of the SAMPL6 log P_N challenge, several approaches were submitted: physical models, which made their predictions from molecular conformations using quantum mechanics (QM) and molecular mechanics (MM) methods, whereas empirical methods participated using two major categories, group contribution and quantitative structure–property relationships (QSPR) methods [9].

Multiple Linear Regression (MLR) analysis is a simple algorithm widely used in chemoinformatics. This method establishes a correlation between independent variables and the dependent variable [10]. Several MLR models have been built to predict the n-octanol/water partition coefficient of bioorganic compounds, which encompasses different approaches based on calculated molecular descriptors [11], QM electronic descriptors [12], molecular holograms containing atom type information [13], volume and surface area descriptors [14, 15], hydrophobic area and chain descriptors [15]. Accordingly, there are successful cases for the prediction of log P_N of organic compounds employing MLR approaches, especially for molecules within a close chemical space such as substituted aromatic drugs [15], polychlorinated diphenyl ethers [16], blocked tripeptides [17], fragment-like small molecules in the SAMPL6 log P_N challenge [12], and sulfonamides [18].

Here, we report the results obtained by 3 different multiple linear regression (MLR) models to reproduce the experimental values of log P_N for 22 sulfonamides in the SAMPL7 log P_N blind prediction challenge. The performance of MLR models is discussed together with an analysis of the compound with the largest deviation between experimental and calculated log P_N value. The method MLR-3 identified as “TFE-MLR” was the approach submitted for ranking purposes.

Methods

Dataset

The SAMPL7 blind challenge consisted of predicting the partition coefficient between water and n-octanol (log P_N) of 22 N-acylsulfonamides synthesized by Ballatore Lab [19] (see Fig. 1). The set consisted of amide, oxetane, thietane, thietane-1-oxide, thietane-1,1-oxide, isoxazole and triazole N-acylsulfonamides derivatives. Most compounds in the dataset were achiral and just SM35, SM36 and SM37 had a chiral center. SMILES strings of the neutral molecules were provided by the organizers on the SAMPL7 website [20].

Multiple linear regression models

Taking into consideration the chemistry of the SAMPL7 dataset (see Fig. 1), a total of 87 small molecules (see Table S1 and SI TFE-MLR_trainingset.xlsx) were chosen, based on the chemical space needed for the challenge, to build multiple linear regression models for predicting the experimental log P_N of 22 N-sulfonamides proposed by the organizers of the challenge [20]. The chemical space of these molecules, including drug-like sulfonamides and smaller molecules, resembled the main functional groups present in the challenge dataset and the drug-like sulfonamides (see Fig. 2).

The SMILES codes and experimental log P_N values were obtained from publicly available data in PubChem [21], DrugBank [22], and other specific sources [23,24,25]. These SMILES codes were transformed to sdf files using ChemmineOB package in R [26]. From the 87 molecules, five molecules were chosen randomly with the condition to be drug-like sulfonamides. This was done to mimic the nature of the blind challenge in terms of chemical space. In addition, we have sought to maintain a considerable number of observations to build up the model (~ 95%, 82 molecules) -taking into consideration the small size of our set.

For the training set, multiple linear regression models (MLR) were used to find the existing relationship between a selected number of descriptors (d_i) and the experimental n‑octanol/water partition coefficients (log P_N,exptl.).

$${\text{log }{\text{P}}}_{\text{N,exptl.}}\text{=}\sum_{\text{i=1}}^{\text{n}}{{\text{c}}}_{\text{i}}{{\text{d}}}_{\text{i}}\text{ + }{\text{b}}$$

(1)

In Eq. 1, b stands for the intercept [27] and c_i for the coefficients, which were estimated by regression analysis. The MLR models and the statistical analysis were done in R.

Training models used both functional group and molecular property-based descriptors (see Table 1). In the former case, a straightforward functional group count was used as a descriptor; whereas in the latter case, molecular properties related to lipophilicity [11] were generated to obtain a better description of this physicochemical property. All used descriptors were computed using the packages ChemmineR [28] and ChemmineOB [26], however, the number of occurrence of a functional group was computed employing a modified in-house function of the packages mentioned before. Intercorrelations between descriptors were analyzed (see Fig. S1) as well as individual correlations for each descriptor to the experimental log P_N values for the training set (see Table 1).

Table 1 List of descriptors used in the present study and their coefficient of determination (R²) against experimental log P_N values for the training set

Full size table

For the purpose of this study, 3 different models were tested to select the approach that best reproduces experimental values of n-octanol/water partitions coefficients of neutral compounds. First, the approach labeled MLR-1, used the count of structural features represented by descriptors from 1 to 19 (see Table 1). Next, the second approach (MLR-2) added descriptors related to intramolecular interactions as hydrogen bond acceptors sites (HBA1), hydrogen bond acceptors atoms (HBA2), and hydrogen bond donor atoms (HBD), descriptors from 1 to 22 (see Table 1). Finally, the last model (MLR-3) appended two computed atomic contributions, the polar surface area (PSA) and molar refractivity (MR), descriptors from 1 to 24 (see Table 1). The performance of all approaches was compared through statistical analysis (see Table 2).

Table 2 Statistical parameters of MLR approaches for predicting experimental log P_N values for the training set (n = 82).^a

Full size table

For the test set, 5 sulfonamide-bearing drugs (see Fig. 3) were randomly chosen from the original set (see Table S1). A statistical comparison between the experimental log P_N values for the test set and the forecasted value by our MLR methods, as well as other common approaches [15, 29, 30] (see Table S2) was made to further check the suitability of the 3 MLR models mentioned above (see Table 3). Besides, k-fold cross-validation with k = 5 was performed to validate the 3 models mentioned above (see Table S4).

Table 3 Statistical parameters of the comparison between experimental and predicted log P_N values for the test set using the 3 MLR approaches and other common approaches

Full size table

Supported by statistical analysis and based on predictive power in both training and test set (see Fig. 4), the quantitative structure–property relationship (QSPR) approach summited to account for predicting the SAMPL7 experimental log P_N values was the method labeled MLR-3. Details of the model including the list of descriptors, their coefficients, and model parameters are listed in Table S3.

An additional test set, called DB40 (see SI TFE-MLR_DB40.xlsx), was built by filtering the sulfonyl moiety in the DrugBank database (1102 molecules) and after removing molecules already used in the set of 87 small molecules mentioned above. Thus, a final set of 40 approved drugs and/or drug-like molecules was tested with the MLR-3 method (see Fig. S5) to verify the applicability of the method in other biologically active sulfonyl-bearing drugs. Finally, by combining the training, test, SAMPL7, and DB40 datasets (149 molecules) and paying special attention to the worst predictions, the general performance of the MLR-3 method is also presented.

Results and discussion

The method presented in this work for predicting the log P_N of 22 N-sulfonamides in the SAMPL7 challenge dataset corresponds to the “TFE-MLR” submission. The quantitative structure–property relationship (QSPR) approach was based on the multiple linear regression model called MLR-3, as further described in the methods section.

The functional group descriptors with the main individual correlations (see Table 1) were the number of carbon atoms (R² = 0.50), number of aromatic rings (R² = 0.34), number of aliphatic rings (R² = 0.30), count of tertiary amine, fluoroalkyl, and primary amine groups (R² = 0.15, R² = 0.13, and R² = 0.11, respectively). The presence of representative functional groups in the training set has a direct impact on the prediction of n-octanol/water partition coefficient, as this strategy has been exploited in a wide variety of formalisms, from atomic to fragmental strategies [31,32,33]. On the other hand, molar refractivity (R² = 0.41), hydrogen bond acceptor site, and hydrogen bond acceptors atoms (R² = 0.11 and R² = 0.10, respectively) were the molecular property-based descriptors that best correlated the training model. Both descriptors have been employed to compute n-octanol/water partition coefficients in previous works [27], where molar refractivity was used as a surrogate of molecular size, whilst hydrogen bond counts reflected intermolecular interactions. For the sake of clarity, despite hydrogen bond descriptors (HBA1 and HBA2) [28] are correlated (see Fig S1), they give differentiated information (for details see Fig S2), i.e., HBA2 takes into account electron pairs on nitrogen atoms able to delocalize, whereas HBA1 does not.

We decided to submit the approach called MLR-3 because it presented the most suitable statistical parameters (see Table 2) supported by cross-validation analysis (see Table S4). In addition, a preliminary prediction of 5 biologically active sulfonamide-bearing drugs chosen as prediction set was surprisingly accurate using this model (see Fig. 3 and Table 3). In fact, our model outperforms the results obtained with common algorithms for log P_N, e.g., ChemAxon [29] and DataWarrior [30], and MLR models trained with specific compounds as substituted aromatic drugs, e.g., VLifeMDS [15].

Table 4 shows the predicted log P_N values for the 22 N-sulfonamides in the SAMPL7 challenge dataset. The root-mean-square error (RMSE) between fitted values using the MLR-3 model and experimental data is 0.58 log P units. As noted in Figures S3 and S4, our model has the lowest RMSE in the empirical methods category, contemplating the outstanding performance of these methods, where six out of the eight empirical ranked methods have a RMSE < 1 in log P units. The second best RMSE of the ranked submissions was Chemprop [34], which consists of message passing neural networks (MPNN) created by an MIT research group. Chemprop’s submitters used this MPNN with a processed version of the OPERA [35] log P data set. This model has been used for different prediction purposes: properties, antibiotic probability, and SARS-Cov inhibition [34]. The third lowest RMSE was GROVER (graph representation from self-supervised message passing transformer) [36], which incorporates MPNN into the transformation to give more expressive encoders and flexibility. ffsampled_deeplearning_cl1 entry also used a MPNN. This algorithm was based on a previously reported NN [37]. ClassicalGSG used NN for the prediction employing as inputs: molecular features generated with a method called Geometric Scattering for Graphs (GSG) and classical molecular dynamics [38]. Finally, TFE_Attentive_FP used a graph neural network with a novel architecture called Attentive FP. This NN architecture includes an attention mechanism that focuses on the most important parts of the inputs to achieve better predictions [39].

Table 4 Calculated submission ID “TFE MLR”—and experimental n-octanol/water partition coefficient -log P_N—determined for the 22 sulfonamides included in the SAMPL7 dataset

Full size table

Among the 17 participants/organizations allowed ranked submissions, which include physical (QM and MM) and empirical categories, our approach MLR-3 (submission id: “TFE-MLR”) is ranked at the 1^st position as determined by the root-mean-squared error and mean absolute error (see Fig. S3). Comparing to physical methods, two Quantum Mechanics (QM) ranked methods (COSMO-RS and IEFPCM MST) and none Molecular Mechanics (MM) achieved an RMSE around 1 log P units (in ranked submissions). The less time-consuming, cheaper computational cost, and good performance make the simple multiple linear regression models, as well as other empirical approaches (e.g., machine learning), attractive strategies to compute lipophilic descriptors as log P_N. Despite most well-performing methods for computing log P_N in the SAMPL7 blind challenge belonged to empirical methodologies [40], it must be kept in mind that it presents important disadvantages regarding strategies based on molecular mechanics and/or quantum chemistry. For instance, have a high dependence on the training set as this limits the coverage of molecules that can be predicted [41] (e.g., our approach was trained for predicting partition coefficients for drug-like sulfonamides compounds) and to the best of our knowledge, empirical methods are not able to assign a partition coefficient to a specific conformation of the molecule under analysis, these facts limit subsequent applications, e.g., the study of bioactive conformations, that MM and/or QM approaches can face.

For the sake of consistency with the results obtained for the training and test set, Table 5 reports statistical parameters of predicted log P_N values for the 22 N-sulfonamides in the SAMPL7 challenge dataset using the 3 MLR approaches described in the methods section. As expected, the submitted model obtained the highest accuracy among the MLR approaches tested. In addition, MLR-3 had a better performance with the SAMPL7 set (RMSE = 0.58 log P units) than with our training set (RSME = 0.64 log P units, see Table 2).

Table 5 Statistical parameters of the comparison between experimental and predicted log P_N values for the 22 N-sulfonamides in the SAMPL7 challenge dataset using the 3 MLR approaches

Full size table

Analyzing the difference between the predicted and experimental values, the notable outlier is the compound SM36 (see Table 4), which shows an error in the predicted log P_N that roughly diverges 3 times the model uncertainty (RMSE = 0.64). In fact, SM36 is the only compound in our method with an absolute error larger than 1 log units. For the sake of comparison, it is worth noting that the three most accurate empirical methods (MLR-3, Chemprop, and GROVER) evidence the same tendency, the overestimation of the log P_N of compound SM36 which amounts, on average, to 1.61 log units. Exclusion of SM36 improves significantly the ability of prediction of our approach, reducing the RMSE by 26% and increasing the R² by 51% (see Fig. 5).

The experimental log P_N reported for SM36 is low considering the chemical structure of this compound. For instance, SM35 has a phenyl group in the sulfonamide moiety instead of the methyl group in SM36 (see Fig. 1). Thus, it is expected a higher log P_N value for SM36 because benzene rings are significant lipophilic fragments [42,43,44], however, it was not the experimental observation for the pair SM35-SM36. Comparison of analogous situations in pairs of molecules: SM29-SM30, SM32-SM33, SM41-SM42, and SM44-SM45 reveals the conventional increase in the experimental log P_N resulting from the substitution of methyl for phenyl groups. Figure 6 depicts the experimental tendency observed in the log P_N for phenyl/methyl analogs, which was the predicted situation employing our method except for the pair SM35–SM36.

Because the model was intended as a local model to accurately determine the n-octanol/water log P_N for the SAMPL7 dataset, it provides an approach to test the reliability of other compounds that comply with the domain of application of our model, this means biologically active sulfonyl-bearing drugs. For this reason, we decided to test the DB40 set (see methods section for details) whose prediction power was less than that of SAMPL7 (RSME = 1.13, see Fig. S5), however, it can still represent an acceptable estimate considering that the variability of experimental values can often amount to 0.6 units of log P [45].

Finally, we have used 149 compounds belonging to the training, test, SAMPL7, and DB40 datasets to further check the reliability of the MLR-3 model, where we have detected only six outliers exceed for 3 times the model uncertainty (see Fig. 7, top). Here, rosuvastatin (DB40 set) represents the largest absolute deviation (3.08), followed by other two compounds of the DB40 set, vardenafil (2.12) and tirofiban (2.10), next a compound of the training set, brinzolamide (1.97), then another compound of DB40 set, meloxicam (1.89), and finally, a compound of the SAMP7 set, SM36 (1.88). In the case of rosuvastatin and brinzolamide, the predicted log P_N value are 3.21 and 0.17, respectively (see SI TFE-MLR_DB40.xlsx and TFE-MLR_trainingset.xlsx), whereas DrugBank Database [22] reports experimental log P_N values of 0.13 and − 1.80, respectively, but without available reference. Nevertheless, conducting a more exhaustive search in the literature it is reported experimental log P_N values of 2.52 for rosuvastatin [46] and 0.82 for brinzolamide [47], which are in better agreement with the predicted value. Indeed, we implement those verified experimental values in the DB40 set, and taking into account that it is imperative of being able to verify the sources of the experimental values, we decided to omit the values for vardenafil and tirofiban. Thus, a new set of 147 compounds was tested with our model (see Fig. 7, bottom) which reduces the RSME between predicted and experimental data to ∼ − 0.10 (log P units). The remaining outliers are SM36, whose peculiarities were explained above, and meloxicam which is a compound that our method was unable to properly determine its log P_N value for own limitations of our local model, presumably due to lack of a correct description of crucial functional groups as enolic groups which can present several tautomeric forms and also favor conformations with specific intramolecular hydrogen bonds [48].

Overall, the results support the appropriateness of our multiple linear regression model for computing lipophilic descriptors as the n-octanol/water partition coefficient in drug-like sulfonamides compounds. Furthermore, the outstanding performance of empirical methodologies, where 75% of the ranked submissions achieved root-mean-square errors < 1 log P units, reinforce the suitability of these strategies for obtaining fast and accurate predictions of physicochemical properties of bioorganic compounds.

Conclusions

Fast and accurate predicting of the n-octanol/water partition coefficient in compounds of pharmacological relevance is of utmost importance for evaluating their molecular quality. Within the framework of the blind partition coefficient challenge SAMPL7, we have explored the performance of a multiple linear regression model called MLR-3 for predicting the n-octanol/water partition coefficient of 22 sulfonamides. Taking into consideration the small number of molecules in our training set and the simplicity of the descriptors used, the results obtained have been encouraging and support the efficiency of the straightforward strategy presented here for computing n-octanol/water log P_N. Even though the selection of training molecules was appropriate for the aim of this study, we are aware of the limitations of our model in terms of the application domain. In this context, future studies will be focused on the use of a more extensive and diverse set of experimental data to apply the approach developed here to other kinds of bioorganic compounds for the sake of having a generalized model.

References

Waring MJ (2010) Lipophilicity in drug discovery. Expert Opin Drug Discov 5:235–248. https://doi.org/10.1517/17460441003605098
Article CAS PubMed Google Scholar
Yang X, Wang Y, Byrne R et al (2019) Concepts of artificial intelligence for computer-assisted drug discovery. Chem Rev 119:10520–10594
Article CAS Google Scholar
Lobo S (2020) Is there enough focus on lipophilicity in drug discovery? Expert Opin Drug Discov 15:261–263
Article Google Scholar
Miller RR, Madeira M, Wood HB et al (2020) Integrating the impact of lipophilicity on potency and pharmacokinetic parameters enables the use of diverse chemical space during small molecule drug optimization. J Med Chem 63:12156–12170. https://doi.org/10.1021/acs.jmedchem.9b01813
Article CAS PubMed Google Scholar
Kakehashi H, Shima N, Ishikawa A et al (2020) Effects of lipophilicity and functional groups of synthetic cannabinoids on their blood concentrations and urinary excretion. Forensic Sci Int. https://doi.org/10.1016/j.forsciint.2019.110106
Article PubMed Google Scholar
Chmiel T, Mieszkowska A, Kempińskakupczyk D et al (2019) The impact of lipophilicity on environmental processes, drug delivery and bioavailability of food components. Microchem J 146, 2-48
Article Google Scholar
Chatzopoulou M, Emer E, Lecci C et al (2020) Decreasing HepG2 cytotoxicity by lowering the lipophilicity of Benzo[d]oxazolephosphinate Ester Utrophin modulators. ACS Med Chem Lett 11:2421–2427. https://doi.org/10.1021/acsmedchemlett.0c00405
Article CAS PubMed Google Scholar
https://www.samplchallenges.org/
Işık M, Bergazin TD, Fox T et al (2020) Assessing the accuracy of octanol–water partition coefficient predictions in the SAMPL6 Part II log P challenge. J Comput Aided Mol Des 34:335–370. https://doi.org/10.1007/s10822-020-00295-0
Article CAS PubMed PubMed Central Google Scholar
Peter SC, Dhanjal JK, Malik V, et al (2018) Quantitative structure-activity relationship (QSAR): Modeling approaches to biological applications. In: Encyclopedia of Bioinformatics and Computational Biology: ABC of Bioinformatics
Eros D, Kovesdi I, Orfi L et al (2012) Reliability of logP predictions based on calculated molecular descriptors: a critical review. Curr Med Chem 9:1819–1829. https://doi.org/10.2174/0929867023369042
Article Google Scholar
Patel P, Kuntz DM, Jones MR et al (2020) SAMPL6 logP challenge: machine learning and quantum mechanical approaches. J Comput Aided Mol Des 34:495–510. https://doi.org/10.1007/s10822-020-00287-0
Article CAS PubMed Google Scholar
Plante J, Werner S (2018) JPlogP: an improved logP predictor trained using predicted data. J Cheminform 10:1–10. https://doi.org/10.1186/s13321-018-0316-5
Article CAS Google Scholar
Chen HF (2009) In silico log P prediction for a large data set with support vector machines, radial basis neural networks and multiple linear regression. Chem Biol Drug Des 74:142–147. https://doi.org/10.1111/j.1747-0285.2009.00840.x
Article CAS PubMed Google Scholar
Bahmani A, Saaidpour S, Rostami A (2017) A Simple, robust and efficient computational method for n-octanol/water partition coefficients of substituted aromatic drugs. Sci Rep 7:1–14. https://doi.org/10.1038/s41598-017-05964-z
Article CAS Google Scholar
Yang P, Chen J, Chen S et al (2003) QSPR models for physicochemical properties of polychlorinated diphenyl ethers. Sci Total Environ 305:65–76. https://doi.org/10.1016/S0048-9697(02)00467-9
Article CAS PubMed Google Scholar
Yin J (2011) LogP prediction for blocked tripeptides with amino acids descriptors (HMLP) by multiple linear regression and support vector regression. Procedia Environ Sci 8:173–178. https://doi.org/10.1016/j.proenv.2011.10.028
Article CAS Google Scholar
Raevsky OA, Perlovich GL, Kazachenko VP et al (2009) Octanol/water partition coefficients of sulfonamides: experimental determination and calculation using physicochemical descriptors. J Chem Eng Data 54:3121–3124. https://doi.org/10.1021/je900189v
Article CAS Google Scholar
Francisco KR, Varricchio C, Paniak TJ et al (2021) Structure property relationships of N-acylsulfonamides and related bioisosteres. Eur J Med Chem. https://doi.org/10.1016/j.ejmech.2021.113399
Article PubMed Google Scholar
https://github.com/samplchallenges/SAMPL7/tree/master/physical_property/logP
Kim S, Chen J, Cheng T et al (2021) PubChem in 2021: new data content and improved web interfaces. Nucleic Acids Res. https://doi.org/10.1093/nar/gkaa971
Article PubMed PubMed Central Google Scholar
Wishart DS, Knox C, Guo AC et al (2006) DrugBank: a comprehensive resource for in silico drug discovery and exploration. Nucleic Acids Res. https://doi.org/10.1093/nar/gkj067
Article PubMed PubMed Central Google Scholar
Royal Society of Chemistry (2015) ChemSpider. Search and Share Chemistry. R. Soc. Chem.
Avdeef A (2003) Absorption and drug development:solubility, permeability and charge state. Wiley, New York
Book Google Scholar
Zaragoza-Dörwald F (2012) Lead optimization for medicinal chemists: pharmacokinetic properties of functional groups and organic compounds. Wiley-VCH Verlag GmbH, New York
Book Google Scholar
Horan K G (2017) ChemmineOB: R interface to a subset of OpenBabel functionalities. R package version 1.18.0, https://github.com/girke-lab/ChemmineOB. 2017
El Tayar N, Testa B, Carrupt PA (1992) Polar intermolecular interactions encoded in partition coefficients: an indirect estimation of hydrogen-bond parameters of polyfunctional solutes. J Phys Chem 96:1455–1459. https://doi.org/10.1021/j100182a078
Article Google Scholar
Cao Y, Charisi A, Cheng LC et al (2008) ChemmineR: A compound mining framework for R. Bioinformatics 24:1733–1734. https://doi.org/10.1093/bioinformatics/btn307
Article CAS PubMed PubMed Central Google Scholar
ChemAxon, Budapest, Hungary, http://www.chemaxon.com
Sander T, Freyss J, Von Korff M, Rufener C (2015) DataWarrior: an open-source program for chemistry aware data visualization and analysis. J Chem Inf Model 55:460–473. https://doi.org/10.1021/ci500588j
Article CAS PubMed Google Scholar
Leo A, Hansch C, Elkins D (1971) Partition coefficients and their uses. Chem Rev 71:525. https://doi.org/10.1021/cr60274a001
Article CAS Google Scholar
Ghose AK, Crippen GM (1987) Atomic physicochemical parameters for three-dimensional-structure-directed quantitative structure-activity relationships. 2. Modeling dispersive and hydrophobic interactions. J Chem Inf Comput Sci 27:21–35. https://doi.org/10.1021/ci00053a005
Article CAS PubMed Google Scholar
Wang R, Fu Y, Lai L (1997) A new atom-additive method for calculating partition coefficients. J Chem Inf Comput Sci 37:615–621. https://doi.org/10.1021/ci960169p
Article CAS Google Scholar
http://chemprop.csail.mit.edu/
https://github.com/kmansouri/OPERA
Rong Y, Bian Y, Xu T, et al (2020) GROVER: self-supervised message passing transformer on large-scale molecular data. arXiv 1–13
Schütt KT, Kessel P, Gastegger M et al (2019) SchNetPack: a deep learning toolbox for atomistic systems. J Chem Theory Comput 15:448–455. https://doi.org/10.1021/acs.jctc.8b00908
Article CAS PubMed Google Scholar
https://github.com/samplchallenges/SAMPL7/blob/master/physical_property/logP/analysis/logP_predictions/logp_DB3.csv.
Xiong Z, Wang D, Liu X et al (2020) Pushing the boundaries of molecular representation for drug discovery with the graph attention mechanism. J Med Chem 63:8749–8760. https://doi.org/10.1021/acs.jmedchem.9b00959
Article CAS PubMed Google Scholar
Danielle T, Orcid B, Orcid NT, et al (2021) Evaluation of logP , pKa , and log D predictions from the SAMPL7 blind challenge. J Comput Aided Mol Des 4, 1-32
Google Scholar
Artrith N, Butler KT, Coudert F-X et al (2021) Best practices in machine learning for chemistry. Nat Chem 13:505–508. https://doi.org/10.1038/s41557-021-00716-z
Article CAS PubMed Google Scholar
Fujita T, Iwasa J, Hansch C (1964) A new substituent constant, ir, derived from partition coefficients. J Am Chem Soc 86:5175–5180. https://doi.org/10.1021/ja01077a028
Article CAS Google Scholar
Wimley WC, Creamer TP, White SH (1996) Solvation energies of amino acid side chains and backbone in a family of host-guest pentapeptides. Biochemistry 35:5109–5124. https://doi.org/10.1021/bi9600153
Article CAS PubMed Google Scholar
Sangster J (1997) Octanol-water partition coefficients: fundamentals and physical chemistry. Wiley-VCH Verlag GmbH, New York
Google Scholar
Port A, Bordas M, Enrech R et al (2018) Critical comparison of shake-flask, potentiometric and chromatographic methods for lipophilicity evaluation (log Po/w) of neutral, acidic, basic, amphoteric, and zwitterionic drugs. Eur J Pharm Sci 122:331–340. https://doi.org/10.1016/j.ejps.2018.07.010
Article CAS PubMed Google Scholar
Pallicer JM, Calvet C, Port A et al (2012) Extension of the liquid chromatography/quantitative structure-property relationship method to assess the lipophilicity of neutral, acidic, basic and amphotheric drugs. J Chromatogr A 1240:113–122. https://doi.org/10.1016/j.chroma.2012.03.089
Article CAS PubMed Google Scholar
Brittain HG, Florey K (1992) Analytical profiles of drug substances and excipients: preface. Anal Prof Drug Subst Excip 21: 1-4
Cysewski P (2018) Intermolecular interaction as a direct measure of water solubility advantage of meloxicam cocrystalized with carboxylic acids. J Mol Model. https://doi.org/10.1007/s00894-018-3649-0
Article PubMed PubMed Central Google Scholar

Download references

Author information

Authors and Affiliations

School of Chemistry, University of Costa Rica, San Pedro, San José, Costa Rica
Kenneth Lopez & William J. Zamora
Advanced Computing Lab (CNCA), National High Technology Center (CeNAT-CONARE), Pavas, San José, Costa Rica
William J. Zamora
Institute of Exact and Natural Sciences, Federal University of Pará, Belém, Pará, 66075-110, Brazil
Silvana Pinheiro

Authors

Kenneth Lopez
View author publications
You can also search for this author in PubMed Google Scholar
Silvana Pinheiro
View author publications
You can also search for this author in PubMed Google Scholar
William J. Zamora
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to William J. Zamora.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary file1 (DOCX 11657 kb)

Supplementary file2 (XLSX 16 kb)

Supplementary file3 (XLSX 22 kb)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Lopez, K., Pinheiro, S. & Zamora, W.J. Multiple linear regression models for predicting the n‑octanol/water partition coefficients in the SAMPL7 blind challenge. J Comput Aided Mol Des 35, 923–931 (2021). https://doi.org/10.1007/s10822-021-00409-2

Download citation

Received: 05 April 2021
Accepted: 05 July 2021
Published: 12 July 2021
Issue Date: August 2021
DOI: https://doi.org/10.1007/s10822-021-00409-2

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Multiple linear regression models for predicting the n‑octanol/water partition coefficients in the SAMPL7 blind challenge

Abstract

Similar content being viewed by others

Octanol–water partition coefficient measurements for the SAMPL6 blind prediction challenge

COSMO-RS predictions of logP in the SAMPL7 blind challenge

A deep learning approach for the blind logP prediction in SAMPL6 challenge

Introduction