Introduction

The relevance of lipophilicity in the pharmaceutical sciences has been known for over a century [1]. Lipophilicity is the affinity of a molecule for a lipophilic environment. The logarithm of the n-octanol/water partition/distribution coefficient of neutral and ionizable compounds −log PN and log DpH, respectively- are the gold standards of quantitative descriptors of lipophilicity [2]. Thus, log PN has been used for predicting the ability of bioorganic compounds to cross cell membranes [3]. Nowadays, it is still being used for assessing the impact on pharmacokinetic parameters and potency [4], metabolism and excretion[5, 6], and toxicity [7] of research compounds.

To predict the log PN there are a plethora of computational methods [2] and SAMPL challenges aim to evaluate them through blind predictions of physical properties [8]. In the framework of the SAMPL6 log PN challenge, several approaches were submitted: physical models, which made their predictions from molecular conformations using quantum mechanics (QM) and molecular mechanics (MM) methods, whereas empirical methods participated using two major categories, group contribution and quantitative structure–property relationships (QSPR) methods [9].

Multiple Linear Regression (MLR) analysis is a simple algorithm widely used in chemoinformatics. This method establishes a correlation between independent variables and the dependent variable [10]. Several MLR models have been built to predict the n-octanol/water partition coefficient of bioorganic compounds, which encompasses different approaches based on calculated molecular descriptors [11], QM electronic descriptors [12], molecular holograms containing atom type information [13], volume and surface area descriptors [14, 15], hydrophobic area and chain descriptors [15]. Accordingly, there are successful cases for the prediction of log PN of organic compounds employing MLR approaches, especially for molecules within a close chemical space such as substituted aromatic drugs [15], polychlorinated diphenyl ethers [16], blocked tripeptides [17], fragment-like small molecules in the SAMPL6 log PN challenge [12], and sulfonamides [18].

Here, we report the results obtained by 3 different multiple linear regression (MLR) models to reproduce the experimental values of log PN for 22 sulfonamides in the SAMPL7 log PN blind prediction challenge. The performance of MLR models is discussed together with an analysis of the compound with the largest deviation between experimental and calculated log PN value. The method MLR-3 identified as “TFE-MLR” was the approach submitted for ranking purposes.

Methods

Dataset

The SAMPL7 blind challenge consisted of predicting the partition coefficient between water and n-octanol (log PN) of 22 N-acylsulfonamides synthesized by Ballatore Lab [19] (see Fig. 1). The set consisted of amide, oxetane, thietane, thietane-1-oxide, thietane-1,1-oxide, isoxazole and triazole N-acylsulfonamides derivatives. Most compounds in the dataset were achiral and just SM35, SM36 and SM37 had a chiral center. SMILES strings of the neutral molecules were provided by the organizers on the SAMPL7 website [20].

Fig. 1
figure 1

Structures of 22 N-sulfonamides in the SAMPL7 log PN challenge dataset

Multiple linear regression models

Taking into consideration the chemistry of the SAMPL7 dataset (see Fig. 1), a total of 87 small molecules (see Table S1 and SI TFE-MLR_trainingset.xlsx) were chosen, based on the chemical space needed for the challenge, to build multiple linear regression models for predicting the experimental log PN of 22 N-sulfonamides proposed by the organizers of the challenge [20]. The chemical space of these molecules, including drug-like sulfonamides and smaller molecules, resembled the main functional groups present in the challenge dataset and the drug-like sulfonamides (see Fig. 2).

Fig. 2
figure 2

Representation of a some small molecules of the training set which resembles the main functional groups in b molecules of the SAMPL7 dataset

The SMILES codes and experimental log PN values were obtained from publicly available data in PubChem [21], DrugBank [22], and other specific sources [23,24,25]. These SMILES codes were transformed to sdf files using ChemmineOB package in R [26]. From the 87 molecules, five molecules were chosen randomly with the condition to be drug-like sulfonamides. This was done to mimic the nature of the blind challenge in terms of chemical space. In addition, we have sought to maintain a considerable number of observations to build up the model (~ 95%, 82 molecules) -taking into consideration the small size of our set.

For the training set, multiple linear regression models (MLR) were used to find the existing relationship between a selected number of descriptors (di) and the experimental n‑octanol/water partition coefficients (log PN,exptl.).

$${\text{log }{\text{P}}}_{\text{N,exptl.}}\text{=}\sum_{\text{i=1}}^{\text{n}}{{\text{c}}}_{\text{i}}{{\text{d}}}_{\text{i}}\text{ + }{\text{b}}$$
(1)

In Eq. 1, b stands for the intercept [27] and ci for the coefficients, which were estimated by regression analysis. The MLR models and the statistical analysis were done in R.

Training models used both functional group and molecular property-based descriptors (see Table 1). In the former case, a straightforward functional group count was used as a descriptor; whereas in the latter case, molecular properties related to lipophilicity [11] were generated to obtain a better description of this physicochemical property. All used descriptors were computed using the packages ChemmineR [28] and ChemmineOB [26], however, the number of occurrence of a functional group was computed employing a modified in-house function of the packages mentioned before. Intercorrelations between descriptors were analyzed (see Fig. S1) as well as individual correlations for each descriptor to the experimental log PN values for the training set (see Table 1).

Table 1 List of descriptors used in the present study and their coefficient of determination (R2) against experimental log PN values for the training set

For the purpose of this study, 3 different models were tested to select the approach that best reproduces experimental values of n-octanol/water partitions coefficients of neutral compounds. First, the approach labeled MLR-1, used the count of structural features represented by descriptors from 1 to 19 (see Table 1). Next, the second approach (MLR-2) added descriptors related to intramolecular interactions as hydrogen bond acceptors sites (HBA1), hydrogen bond acceptors atoms (HBA2), and hydrogen bond donor atoms (HBD), descriptors from 1 to 22 (see Table 1). Finally, the last model (MLR-3) appended two computed atomic contributions, the polar surface area (PSA) and molar refractivity (MR), descriptors from 1 to 24 (see Table 1). The performance of all approaches was compared through statistical analysis (see Table 2).

Table 2 Statistical parameters of MLR approaches for predicting experimental log PN values for the training set (n = 82).a

For the test set, 5 sulfonamide-bearing drugs (see Fig. 3) were randomly chosen from the original set (see Table S1). A statistical comparison between the experimental log PN values for the test set and the forecasted value by our MLR methods, as well as other common approaches [15, 29, 30] (see Table S2) was made to further check the suitability of the 3 MLR models mentioned above (see Table 3). Besides, k-fold cross-validation with k = 5 was performed to validate the 3 models mentioned above (see Table S4).

Fig. 3
figure 3

Structures and experimental log PN of 5 biologically active sulfonamide-bearing drugs chosen as prediction set

Table 3 Statistical parameters of the comparison between experimental and predicted log PN values for the test set using the 3 MLR approaches and other common approaches

Supported by statistical analysis and based on predictive power in both training and test set (see Fig. 4), the quantitative structure–property relationship (QSPR) approach summited to account for predicting the SAMPL7 experimental log PN values was the method labeled MLR-3. Details of the model including the list of descriptors, their coefficients, and model parameters are listed in Table S3.

Fig. 4
figure 4

Comparison between experimental and predicted n-octanol/water log PN using the MLR-3 model for the training (blue) and test (orange) set

An additional test set, called DB40 (see SI TFE-MLR_DB40.xlsx), was built by filtering the sulfonyl moiety in the DrugBank database (1102 molecules) and after removing molecules already used in the set of 87 small molecules mentioned above. Thus, a final set of 40 approved drugs and/or drug-like molecules was tested with the MLR-3 method (see Fig. S5) to verify the applicability of the method in other biologically active sulfonyl-bearing drugs. Finally, by combining the training, test, SAMPL7, and DB40 datasets (149 molecules) and paying special attention to the worst predictions, the general performance of the MLR-3 method is also presented.

Results and discussion

The method presented in this work for predicting the log PN of 22 N-sulfonamides in the SAMPL7 challenge dataset corresponds to the “TFE-MLR” submission. The quantitative structure–property relationship (QSPR) approach was based on the multiple linear regression model called MLR-3, as further described in the methods section.

The functional group descriptors with the main individual correlations (see Table 1) were the number of carbon atoms (R2 = 0.50), number of aromatic rings (R2 = 0.34), number of aliphatic rings (R2 = 0.30), count of tertiary amine, fluoroalkyl, and primary amine groups (R2 = 0.15, R2 = 0.13, and R2 = 0.11, respectively). The presence of representative functional groups in the training set has a direct impact on the prediction of n-octanol/water partition coefficient, as this strategy has been exploited in a wide variety of formalisms, from atomic to fragmental strategies [31,32,33]. On the other hand, molar refractivity (R2 = 0.41), hydrogen bond acceptor site, and hydrogen bond acceptors atoms (R2 = 0.11 and R2 = 0.10, respectively) were the molecular property-based descriptors that best correlated the training model. Both descriptors have been employed to compute n-octanol/water partition coefficients in previous works [27], where molar refractivity was used as a surrogate of molecular size, whilst hydrogen bond counts reflected intermolecular interactions. For the sake of clarity, despite hydrogen bond descriptors (HBA1 and HBA2) [28] are correlated (see Fig S1), they give differentiated information (for details see Fig S2), i.e., HBA2 takes into account electron pairs on nitrogen atoms able to delocalize, whereas HBA1 does not.

We decided to submit the approach called MLR-3 because it presented the most suitable statistical parameters (see Table 2) supported by cross-validation analysis (see Table S4). In addition, a preliminary prediction of 5 biologically active sulfonamide-bearing drugs chosen as prediction set was surprisingly accurate using this model (see Fig. 3 and Table 3). In fact, our model outperforms the results obtained with common algorithms for log PN, e.g., ChemAxon [29] and DataWarrior [30], and MLR models trained with specific compounds as substituted aromatic drugs, e.g., VLifeMDS [15].

Table 4 shows the predicted log PN values for the 22 N-sulfonamides in the SAMPL7 challenge dataset. The root-mean-square error (RMSE) between fitted values using the MLR-3 model and experimental data is 0.58 log P units. As noted in Figures S3 and S4, our model has the lowest RMSE in the empirical methods category, contemplating the outstanding performance of these methods, where six out of the eight empirical ranked methods have a RMSE < 1 in log P units. The second best RMSE of the ranked submissions was Chemprop [34], which consists of message passing neural networks (MPNN) created by an MIT research group. Chemprop’s submitters used this MPNN with a processed version of the OPERA [35] log P data set. This model has been used for different prediction purposes: properties, antibiotic probability, and SARS-Cov inhibition [34]. The third lowest RMSE was GROVER (graph representation from self-supervised message passing transformer) [36], which incorporates MPNN into the transformation to give more expressive encoders and flexibility. ffsampled_deeplearning_cl1 entry also used a MPNN. This algorithm was based on a previously reported NN [37]. ClassicalGSG used NN for the prediction employing as inputs: molecular features generated with a method called Geometric Scattering for Graphs (GSG) and classical molecular dynamics [38]. Finally, TFE_Attentive_FP used a graph neural network with a novel architecture called Attentive FP. This NN architecture includes an attention mechanism that focuses on the most important parts of the inputs to achieve better predictions [39].

Table 4 Calculated submission ID “TFE MLR”—and experimental n-octanol/water partition coefficient -log PN—determined for the 22 sulfonamides included in the SAMPL7 dataset

Among the 17 participants/organizations allowed ranked submissions, which include physical (QM and MM) and empirical categories, our approach MLR-3 (submission id: “TFE-MLR”) is ranked at the 1st position as determined by the root-mean-squared error and mean absolute error (see Fig. S3). Comparing to physical methods, two Quantum Mechanics (QM) ranked methods (COSMO-RS and IEFPCM MST) and none Molecular Mechanics (MM) achieved an RMSE around 1 log P units (in ranked submissions). The less time-consuming, cheaper computational cost, and good performance make the simple multiple linear regression models, as well as other empirical approaches (e.g., machine learning), attractive strategies to compute lipophilic descriptors as log PN. Despite most well-performing methods for computing log PN in the SAMPL7 blind challenge belonged to empirical methodologies [40], it must be kept in mind that it presents important disadvantages regarding strategies based on molecular mechanics and/or quantum chemistry. For instance, have a high dependence on the training set as this limits the coverage of molecules that can be predicted [41] (e.g., our approach was trained for predicting partition coefficients for drug-like sulfonamides compounds) and to the best of our knowledge, empirical methods are not able to assign a partition coefficient to a specific conformation of the molecule under analysis, these facts limit subsequent applications, e.g., the study of bioactive conformations, that MM and/or QM approaches can face.

For the sake of consistency with the results obtained for the training and test set, Table 5 reports statistical parameters of predicted log PN values for the 22 N-sulfonamides in the SAMPL7 challenge dataset using the 3 MLR approaches described in the methods section. As expected, the submitted model obtained the highest accuracy among the MLR approaches tested. In addition, MLR-3 had a better performance with the SAMPL7 set (RMSE = 0.58 log P units) than with our training set (RSME = 0.64 log P units, see Table 2).

Table 5 Statistical parameters of the comparison between experimental and predicted log PN values for the 22 N-sulfonamides in the SAMPL7 challenge dataset using the 3 MLR approaches

Analyzing the difference between the predicted and experimental values, the notable outlier is the compound SM36 (see Table 4), which shows an error in the predicted log PN that roughly diverges 3 times the model uncertainty (RMSE = 0.64). In fact, SM36 is the only compound in our method with an absolute error larger than 1 log units. For the sake of comparison, it is worth noting that the three most accurate empirical methods (MLR-3, Chemprop, and GROVER) evidence the same tendency, the overestimation of the log PN of compound SM36 which amounts, on average, to 1.61 log units. Exclusion of SM36 improves significantly the ability of prediction of our approach, reducing the RMSE by 26% and increasing the R2 by 51% (see Fig. 5).

Fig. 5
figure 5

Comparison between experimental and the multiple linear regression method for determining the n-octanol/water log PN for the SAMPL7 dataset. Red point illustrates the outlier founded in our method. Top left, statistical analyses are shown for all compounds and bottom right, after exclusion of SM36

The experimental log PN reported for SM36 is low considering the chemical structure of this compound. For instance, SM35 has a phenyl group in the sulfonamide moiety instead of the methyl group in SM36 (see Fig. 1). Thus, it is expected a higher log PN value for SM36 because benzene rings are significant lipophilic fragments [42,43,44], however, it was not the experimental observation for the pair SM35-SM36. Comparison of analogous situations in pairs of molecules: SM29-SM30, SM32-SM33, SM41-SM42, and SM44-SM45 reveals the conventional increase in the experimental log PN resulting from the substitution of methyl for phenyl groups. Figure 6 depicts the experimental tendency observed in the log PN for phenyl/methyl analogs, which was the predicted situation employing our method except for the pair SM35–SM36.

Fig. 6
figure 6

Difference between experimental log PN of SAMPL7 phenyl/methyl N-sulfonamides analogs. Δ log P corresponds to the difference between log PPhenyl analogous–log PMethyl analogous

Because the model was intended as a local model to accurately determine the n-octanol/water log PN for the SAMPL7 dataset, it provides an approach to test the reliability of other compounds that comply with the domain of application of our model, this means biologically active sulfonyl-bearing drugs. For this reason, we decided to test the DB40 set (see methods section for details) whose prediction power was less than that of SAMPL7 (RSME = 1.13, see Fig. S5), however, it can still represent an acceptable estimate considering that the variability of experimental values can often amount to 0.6 units of log P [45].

Finally, we have used 149 compounds belonging to the training, test, SAMPL7, and DB40 datasets to further check the reliability of the MLR-3 model, where we have detected only six outliers exceed for 3 times the model uncertainty (see Fig. 7, top). Here, rosuvastatin (DB40 set) represents the largest absolute deviation (3.08), followed by other two compounds of the DB40 set, vardenafil (2.12) and tirofiban (2.10), next a compound of the training set, brinzolamide (1.97), then another compound of DB40 set, meloxicam (1.89), and finally, a compound of the SAMP7 set, SM36 (1.88). In the case of rosuvastatin and brinzolamide, the predicted log PN value are 3.21 and 0.17, respectively (see SI TFE-MLR_DB40.xlsx and TFE-MLR_trainingset.xlsx), whereas DrugBank Database [22] reports experimental log PN values of 0.13 and − 1.80, respectively, but without available reference. Nevertheless, conducting a more exhaustive search in the literature it is reported experimental log PN values of 2.52 for rosuvastatin [46] and 0.82 for brinzolamide [47], which are in better agreement with the predicted value. Indeed, we implement those verified experimental values in the DB40 set, and taking into account that it is imperative of being able to verify the sources of the experimental values, we decided to omit the values for vardenafil and tirofiban. Thus, a new set of 147 compounds was tested with our model (see Fig. 7, bottom) which reduces the RSME between predicted and experimental data to ∼ − 0.10 (log P units). The remaining outliers are SM36, whose peculiarities were explained above, and meloxicam which is a compound that our method was unable to properly determine its log PN value for own limitations of our local model, presumably due to lack of a correct description of crucial functional groups as enolic groups which can present several tautomeric forms and also favor conformations with specific intramolecular hydrogen bonds [48].

Fig. 7
figure 7

Comparison between experimental and predicted n-octanol/water log PN using the MLR-3 model for 149 (top) and 147 (bottom) molecules from the training (blue), test (orange), SAMPL7 (light blue), and DB40 (unfilled dots) sets (top). In the second graph (bottom), two values from the DrugBank dataset without providing the source were omitted and two experimental values were modified by those from confirmed experimental sources. Red points represent the outliers founded in both sets using our MLR-3 method (meloxicam and SM36 present the same deviation)

Overall, the results support the appropriateness of our multiple linear regression model for computing lipophilic descriptors as the n-octanol/water partition coefficient in drug-like sulfonamides compounds. Furthermore, the outstanding performance of empirical methodologies, where 75% of the ranked submissions achieved root-mean-square errors < 1 log P units, reinforce the suitability of these strategies for obtaining fast and accurate predictions of physicochemical properties of bioorganic compounds.

Conclusions

Fast and accurate predicting of the n-octanol/water partition coefficient in compounds of pharmacological relevance is of utmost importance for evaluating their molecular quality. Within the framework of the blind partition coefficient challenge SAMPL7, we have explored the performance of a multiple linear regression model called MLR-3 for predicting the n-octanol/water partition coefficient of 22 sulfonamides. Taking into consideration the small number of molecules in our training set and the simplicity of the descriptors used, the results obtained have been encouraging and support the efficiency of the straightforward strategy presented here for computing n-octanol/water log PN. Even though the selection of training molecules was appropriate for the aim of this study, we are aware of the limitations of our model in terms of the application domain. In this context, future studies will be focused on the use of a more extensive and diverse set of experimental data to apply the approach developed here to other kinds of bioorganic compounds for the sake of having a generalized model.