Introduction

Computational docking is extensively applied to predict protein–ligand binding, both for the prediction and understanding of inter-molecular interactions and for virtual screening in structure-based drug design [1,2,3]. Docking calculations require prior knowledge of the three-dimensional structure of the receptor protein. It is not uncommon that a suitable experimental structure of the target receptor is unavailable and one has to manage with a computationally modeled one. Most commonly, the modeled structures are obtained through comparative modeling, i.e., by using existing experimental structures of homologous proteins as templates for model construction [4]. Besides comparative modeling, de novo protein structure prediction is an increasingly promising approach to obtaining modeled receptor structures, given the recent significant progresses making use of deep learning [5]. Despite method advancements, the quality of modeled receptor structures cannot yet match that of high-resolution experimental structures. This is especially true when a homology model has been constructed based on a template protein sharing relatively low sequence identity with the target. In such a situation, the sequence alignment between the target and the template protein contains frequent insertions and deletions, causing the accuracy of the resulting structure model to deteriorate severely [4]. Given the large number of proteins that can potentially serve as drug targets but do not have crystal structures, it is of wide interest to consider computational docking of ligands with modeled receptor structures that are of relatively low accuracy.

Most current docking methods have not been tailored to consider homology models as input receptor structures. Instead, they have usually been optimized by considering benchmarks such as bound docking, in which experimental structures of receptors in complexes with target ligands have been used as input for docking [6,7,8,9,10,11]. As a result, the performance of docking strongly depends on the input receptor structure. It has been well-known that with usual docking approaches, results obtained by using apo- or modeled receptor structures are significantly worse than results obtained by using high resolution ligand-bound (holo-) experimental receptor structures [12,13,14]. The high sensitivity of computational docking to input receptor structures is also evidenced by the fact that when docking is applied to screen the same diverse ligand library for the same receptor, different receptor structures yielded different results [15]. This can be explained by the fact that the receptor adopts various conformations in order to establish favorable interactions with different ligands [16], and only one input conformational state (sometimes with local variations such as sidechain rotations) can be considered in one docking run. Because of this strong dependency on receptor structure, computational docking ideally should use input receptor structures coming from high quality experimental studies, preferably determined in complex with a ligand that is similar to the target small molecules that are considered for docking [17]. On the other hand, a docking method that perform well on such inputs does not necessarily lead to the same performance in applications that have to rely on less accurate receptor structures, such as homology models.

One possible way to reduce the above sensitivity to the input structure is to consider the structural flexibility of the receptor. Developments in docking programs have been made in past to allow a certain degree of receptor flexibility by including sidechain variations in e.g. AutoDock [18], GOLD [19], FlexX [11], and RosettaLigand [20]. Although these approaches can help in achieving better results in cases such as docking into the apo-structure of a receptor, they may still not be sufficient for treating modeled receptor structures, which usually contain varied extents of peptide backbone deviations from the actual receptor structure. Ensemble docking is an approach in which backbone variations are included by considering as docking targets multiple receptor conformations generated using MD simulations [21]. However, because of the common insufficiency of using only a small number of conformations to represent a vast number of possible conformations [22], there is always high risks of missing from the ensemble conformations accommodated to the binding of particular ligands. Besides these limitations, the explicit consideration of receptor flexibility in docking significantly increases the computational costs, which may make the virtual screening of large ligand libraries computationally too expensive. In addition to receptor flexibility, consideration of ligand flexibility is another approach of unbound docking. GalaxyDock3, for example, is a protein–ligand docking approach which considers full ligand conformational flexibility but the protein conformation is fixed at crystal structure [23].

In the current study, we will explore another approach that can potentially reduce the sensitivity to input receptor structure of computational docking. This approach uses fragment docking. We note that fragment docking has emerged over the past decades as a mainstream paradigm in ligand-protein docking and ligand discovery [24,25,26]. Due to their small sizes, rigid chemical fragments can be used to probe the configurational space inside a binding pocket more efficiently, hence allowing different regions of the pocket to be systematically explored [27, 28]. From the perspective of cheminformatics, the chemical complexity of a fragment is often lower than that of larger compounds, which simplifies the interpretation of structure-activity relationship (SAR) data and may consequently accelerates experimental lead optimization [24, 29]. A number of computational techniques exploiting fragment docking are already available. Examples include software like LUDI [30], GLIDE [10], LigBuilder [31], SEED [32], and web servers like ACFIS [33]. For unbound docking studies, a fragment mapping program, Fsubsite, has been validated on apo-structures of CDK2 [34].

The reasons that fragment docking may better tolerate structure inaccuracies of homology receptor models than complete ligand docking include the following. Consider the docking of a large ligand with a modeled structure of its cognate receptor. For the docking to be successful, every part of the ligand in its correct binding pose need to interact favorably with the modeled receptor structure, at least without any severe steric clashes. If we refer the immediate receptor environment of a part or a fragment of the ligand as a subpocket, any inaccuracy of any subpocket structure that leads to bad ligand-protein interactions (e.g. severe steric clashes) will lead to docking failure. In addition, the covalent connections between different fragments of the ligand translate into stringent constraints on the relative geometries between the subpockets. Structure inaccuracies that lead to errors in these relative geometries also prohibit successful docking. If not the complete ligand but only small fragments comprising the ligand are separately docked, the modeling errors at one subpocket will not affect the proper docking of fragments at other subpockets. In addition, the constraints on the relative geometries between different subpockets accommodating different fragments may be relaxed, as the docked fragments are not covalently connected. We expect that these properties of fragment docking may be exploited to increase the tolerance of structure inaccuracies when docking a multi-fragments ligand into a modeled receptor structure.

In our proposed approach, after obtaining possible docked poses of fragments, a graph theory algorithm is applied to align a complete ligand with the binding poses of its composing fragments. Graph theory-based methods have been adopted in various biological and bioinformatics studies previously, including metabolic pathway analysis [35, 36], protein flexibility predictions [37], protein side chain predictions [38], secondary structure motif comparisons [39], calcium binding site predictions [40] and more. The application of graph theory in the current work has been inspired by earlier studies using graph theory to find maximum common substructures between two structures [41,42,43,44].

Our proposed approach is benchmarked using protein–ligand complexes in which the proteins belong to different families, the cytochrome P450 (CYP450) enzymes, G-protein coupled receptors (GPCR), chaperone, bromodomains and severe acute respiratory syndrome coronavirus 2 (SARS-Cov-2) protease. We have chosen protein families of different sequence length and binding pockets to better understand how our strategy works. The CYP450 enzymes are found in genomes of virtually all organisms. In humans, they are involved in detoxification of various drugs [45]. In plants, they are involved in biosynthesis of various secondary metabolites and thus have numerous applications in industrial biotechnology [46]. The prediction of ligand binding poses using modeled CYP450 proteins is of practical relevance for the prediction of substrate selectivity and reaction site selectivity for CYP450 proteins with unknown structures. The GPCR proteins are important regulatory elements in a wide spectrum of normal and pathological procedures [47]. They are one of the most important families of therapeutic targets for small molecule drugs [48, 49]. Chaperones play a vital role in the stabilization of an unfolded protein [50]. Bromodomain (BRD) is an evolutionary conserved domain with an approximate length of 110 amino acids. Proteins containing BRD regulate the process of gene expression [51]. SARS-CoV-2 is a novel coronavirus that has caused the coronavirus disease 2019 (COVID-19) worldwide [52, 53].

Materials and methods

Two types of docking were carried out i.e. bound and unbound. In bound docking crystal structures from PDB with bound ligands were used. Whereas in unbound docking, homology models were built and used to dock native ligands. Scripts and source codes to carry out the calculations can be downloaded from http://biocomp.ustc.edu.cn/servers/download_others.php#scripts-for-small-mol-frag-docking. Pymol or discovery studio visualizer have been used to carry out visualization [54, 55]. Origin software was used to plot graphs [56].

Structures of the receptor proteins

In bound docking, the receptor structures have been extracted from the experimentally determined PDB structures of the respective complexes (Table 2). In unbound docking, the receptor structures have been either from a structure complexed with a different ligand or, more interestingly, constructed using the homology modeling program Modeller [57]. Homology models of eleven CYP450 proteins, CYP2B4 (PDB id 1suo, 3me6), CYP2A6 (2fdu), CYP3A4 (PDB id 3ua1, 4d7d), CYP1A1 (PDB id 4i8v), CYP105AS1 (PDB id 4oqr), CYP119 (PDB id 4wqj), CYPBM3 (PDB id 4zf8), CYP126A1 (PDB id 5li8), CYP4B1 (PDB id 6c94), four GPCR proteins, 5-HT1B (PDB id 4iar), 5-HT2B (PDB id 4nc3), GRK5 (PDB id 4wnk), GRK2 (PDB id 5he0), two chaperone proteins, Hsc70/Bag1 (PDB id 3fzm), Hsp 90-alpha (PDB id 6f1n), two bromodomains, namely, human BRD2 (PDB id 4a9m), human ATAD2 (PDB id 5lj0) and one SARS-CoV-2 main protease (PDB id 6wtt) have been constructed. Structure templates for comparative modeling have been identified by sequence BLAST against the PDB database. Details of targets and templates are presented in Table 1. Templates have been chosen on the basis of sequence identity. The protein sequence of a target receptor was downloaded from UniProt database (https://www.uniprot.org/) and protein blast available at https://blast.ncbi.nlm.nih.gov was run for the given sequence against proteins in PDB excluding the target protein. The purpose for building homology models was to assess our methodology in case of poor models or where it was hard to find a good template. Therefore, templates with sequence identity above 85% were ignored. For 14 out of 20 systems, templates with sequence identity of less than 60% were chosen (Table 1). We also included templates with sequence identity as low as 19% and 25%. Besides sequence identity, query coverage was also taken into account and we made sure that it was more than 50%. For each receptor, twenty initial models were constructed based on a selected template. The initial models were optimized and refined using molecular dynamics with simulated annealing. Then one model with the minimum DOPE (Discrete Optimized Protein Energy) score [58] was selected and used for unbound docking. Further details about model quality are summarized in Supplementary Table S1. Analysis was carried out using the PROCHECK program available at https://www.ebi.ac.uk/thornton-srv/software/PROCHECK/.

Table 1 Target and template protein structures. For the target receptor, the original ligands are given. For each template, overall sequence identity and binding pocket identity with respective target are given
Table 2 PDB ids of complexes used for the bound docking tests. The ligands are listed along with corresponding fragments

Fragment docking

Figure 1 illustrates the overall workflow of the fragment-based docking approach.

We employed the BRICS (breaking of retrosynthetically interesting chemical substructures) algorithm [59] as included in the RDKit package [60] to break complete ligands into small, rigid fragments. BRICS shreds the molecules retro-synthetically, avoiding the generation of overlapping fragments. The program also avoids the generation of small terminal fragments such as single hydrogen atoms, hydroxyl, nitro and other small groups. Ligands that were fragmented are presented in Table 2 (bound docking) and Table 1 (unbound docking).

Fragment and receptor data were preprocessed using respective AutoDock tools [61]. More specifically, complete ligands were extracted from the respective PDB files. Since AutoDock requires pdbqt format, all the ligand and receptor structures were prepared using graphical user interface program of AutoDock tool [61]. Partial charges were computed using the gasteiger method and resulting molecules were saved as pdbqt format to be used for docking. Polar hydrogens were added in molecules before charges were computed. In case of fragments, SMARTS for each fragment was saved using MarvinSketch [62] and resulting structures were stored in PDB format. These rigid fragments were then converted to pdbqt format using AutoDock GUI as explained. In case of rigid docking using complete ligands, the receptor molecule was kept rigid whereas ligand was flexible with rotations in rotatable bonds switched on. Cyclic portions of ligands were treated as rigid. The heme cofactor in CYP450 binding pocket was considered as a rigid part of the receptor. For flexible docking, torsions were chosen in the receptor pocket residues which were saved as flexible pdbqt and the rest of the protein was saved as rigid pdbqt. In case of bound docking, the center of the grid for docking was defined based on the information of bound ligand. For unbound docking, pocket residues of the modeled structure were identified and then center coordinates were computed. The same center was used for all fragments of one ligand. The grid size was set to 15◊15◊15 in xyz directions for fragment docking so that fragments can span the entire pocket. All this information was written to a configuration file to be given as an input to AutoDock Vina [18] for carrying out docking. For each configuration 25 docking runs were carried out to obtain up to 500 binding poses.

Poses generated in the previous steps were clustered based on their atomic Cartesian coordinates using k-means clustering [63], with the number of clusters determined using the so-called “elbow” method [64]. After clustering, the binding pose of the minimum binding score was picked from each cluster to be used further.

Fig. 1
figure 1

Workflow of the docking method

Obtaining docked poses of complete ligands

A complete ligand is aligned to combined sets of the docked fragment poses using the following protocol. It makes use of the graph-theory algorithm for determining maximum cliques [65, 66] to identity sets of matches between fragments contained in the ligand and the docked fragment poses (Fig. 2).

To describe the method in graph theory terms, we define two sets, Q and T, respectively,

$$\text{Q}\text{u}\text{e}\text{r}\text{y} \text{s}\text{e}\text{t} \text{Q}=\{{\text{q}}_{1}, {\text{q}}_{2}, \dots , {\text{q}}_{\text{N}}\} $$
(1)
$$\text{T}\text{a}\text{r}\text{g}\text{e}\text{t} \text{s}\text{e}\text{t} \text{T}=\left\{{\text{t}}_{1}, {\text{t}}_{2}, \dots , {\text{t}}_{\text{M}}\right\}$$
(2)

Each element \({\text{q}}_{\text{i}}\) of Q corresponds to a docked fragment pose, associated with the corresponding chemical type of the fragment and the atomic positions. For convenience, we refer the chemical type of an element qi as \({\Phi }\left({\text{q}}_{\text{i}}\right)\), and the set of atomic positions as \(\text{R}\left({\text{q}}_{\text{i}}\right)\). For each type of fragment, chemically matching substructures in a complete ligand can be identified. We used the RDKit cheminformatics tools to perform the substructure matching [60], by representing the chemical structures of the fragments as SMARTS strings [62] which are matched against a complete ligand. Each element \({\text{t}}_{\text{k}}\) of the set T in formula (2) corresponds to a part of the ligand that chemically matches a fragment, its chemical type noted as \({\Phi }\left({\text{t}}_{\text{k}}\right)\) and the set of atomic positions as \(\text{R}\left({\text{t}}_{\text{k}}\right)\).

An auxiliary graph Gc representing the compatibility between the elements in the sets Q and T is defined as below. Each vertex of Gc’s corresponds to a combination of two elements associated with the same chemical type, one from Q and the other from T, namely,

$$\text{V}\text{e}\text{r}\text{t}\text{e}\text{x} \left({\text{v}}_{\text{i}\text{k}}\right)=\left({\text{q}}_{\text{i}},{\text{t}}_{\text{k}}\right), {\text{q}}_{\text{i}}\in \text{Q}, {\text{t}}_{\text{k}}\in \text{T}, \text{a}\text{n}\text{d} \,{\Phi }\left({\text{q}}_{\text{i}}\right)={\Phi }\left({\text{t}}_{\text{k}}\right) $$
(3)

An edge between two vertices \({\text{v}}_{\text{i}\text{k}}\) and \({\text{v}}_{\text{j}\text{l}}\) is defined in Gc if and only if the following conditions are met: \(\text{i}\ne \text{j}, \text{k}\ne 1\), and the two sets of atomic positions \(\text{R}\left({\text{q}}_{\text{i}}\right)\bigcup R\left({q}_{j}\right)\) and \(\text{R}\left({\text{t}}_{\text{k}}\right)\bigcup R\left({t}_{l }\right)\) can be well-aligned against each other. A clique in the graph Gc is a fully connected (or complete) subgraph in which there is an edge between every vertex in the subgraph to every other vertex in the same subgraph. A maximal clique is a clique which cannot be extended to include new vertices without violating the full connection requirement. The maximal cliques of Gc can be found with the Bron and Kerbosch algorithm [67]. Each maximal clique corresponds to an alignment between at least two fragments contained in the ligand with the same number of docked fragment poses. The definition of edges in the compatibility graph guarantees that in the alignment, a ligand fragment is only matched to a docked fragment of the same chemical type. In addition, all the inter-fragment relative geometries in the complete ligand are consistent with the relative geometries between the respectively matched docked poses.

Fig. 2
figure 2

Determining binding poses of complete ligand through alignments with docked fragment poses using the graph theory maximum clique algorithm. a Fragments are given as SMARTS input to find their matching parts in complete ligands. b Five fragments of 3 different chemical types are shown. The different 3D positions (docked poses) of the same fragment types are labeled as i and j and are colored yellow and blue, respectively. The parts matching the fragment types in a complete ligand are labeled respectively as t1, t2, t3. c Five vertices of the compatibility graph are shown, each vertex is a pairing of a docked fragment pose and a part of the complete ligand that chemically matches the fragment type. d Connectivity between the vertices indicate both chemistry and geometry compatibility of the pairings. Solid lines represent the edges between the vertices, dotted line means there is no edge. e Four maximal cliques of size 3 can be found in the compatibility graph, each corresponding to an alignment of the complete ligand to a set of docked fragment poses. f Transforming the complete ligand according to each alignment produces a binding pose of the ligand inside the receptor pocket

In the next step, for the aligned set of fragments defined by each clique, a rigid body geometry transformation is calculated to simultaneously superimpose the different aligned fragments contained in the ligand onto the correspondingly matched docked fragment poses. The resulting transformation is applied to the entire ligand, generating a possible binding pose of the complete ligand inside the binding pocket. To remove potential local steric clashes in the resulting ligand-receptor complex, the complex is subsequently energy minimized by 300 steepest descent steps using UCSF Chimera [68]. All hydrogen atoms were added prior to this minimization step. The minimized structures were scored separately by two scoring functions, namely, Cyscore [69] and DSX [70].

Results and discussion

Binding poses of fragments generated by fragment docking

The PDB ids and ligand structures of the cytochrome P450 protein–ligand complexes used as test dataset for bound docking are given in Table 2. The ligands in these complexes are of different sizes ranging from 12 to 36 heavy atoms. Fragments generated by applying the BRICS method to each ligand are given in the same table.

Docking the ligand fragments into the binding pocket of the respective receptor structure led to a large number of binding poses for each fragment type. Inspecting the poses suggested that by considering 25 docking runs for each fragment, the positioning of the fragments in the actual ligand-receptor complexes can usually be closely reproduced by one or more docked poses. On the other hand, a large number of docked poses are redundant, i.e., of similar structures. Therefore, following fragment docking, the docked poses for a given fragment were clustered and one representative pose from each cluster was chosen to be used for the subsequent steps. RMSDs of the representative poses from the respective ligand parts in the actual complex structures were computed. For every fragment, the set of representative poses included at least one experimental binding position-like pose. This indicates that because of the small sizes and conformational rigidity of the fragments, it is relatively easy for standard docking protocols to thoroughly explore and identify their favorable binding positions inside a receptor pocket. In Fig. 3, the experimental-like representative poses of fragments are shown together with the containing ligands in experimental binding poses inside the corresponding binding pockets.

We observed that the experimental-like docked poses of fragments do not always correspond to the docked poses associated with the lowest energy scores. Thus if we only retain a small number of top-ranking poses from the docking results without the clustering step to remove redundant poses, we could have missed the near native binding poses from the retained poses. Thus the clustering step minimized the chance of losing native like poses while retaining only a small number of docked poses for subsequent processing. We note that as the fragments are small and can form only reduced number of specific interactions within the binding pockets, a fragment acting as a probe can sample an entire binding pocket more efficiently compared with a complete ligand. Despite this, for a large ligand comprising many different fragments, there can still be chances that for some fragments, the respective near native poses are missed because of the lack of specific interactions for strong binding (for example, in Fig. 3, result 9f, docking of the phenyl fragment generated poses distant from the native binding position of the fragment in the complete ligand). For this reason, in the subsequent graph-theory-based fragment-ligand alignment step, an accepted alignment is required to cover not all but as many as possible the fragments that comprise the ligand. In this way, near native poses of the complete ligand can still be produced making use of the subset of fragments whose near native poses have been generated and retained.

Fig. 3
figure 3figure 3

Experimental position-like representative fragment poses (yellow) docked inside binding pockets. (1) 1suo, (2) 2fdu, (3) 3me6, (4) 4d7d, (5) 4i8v, (6) 4oqr, (7) 4wqj, (8) 4zf8, (9) 5li8 (10) 6c94. Experimental ligand positions (green) are also shown. Binding pocket residues are labelled and interactions are shown as dotted lines. Red lines represent hydrogen bonds and blue represent hydrophobic interactions

Binding poses of complete ligands derived from docked fragment poses

Bound docking

For each of the 10 ligands listed in Table 2, the number of poses generated by applying the graph theory algorithm to align the complete ligands with docked fragment poses are given in Table 3, together with the minimum RMSDs from the respective actual bound conformations in crystal structures. For some complexes, the total number of poses generated from cliques is quite high due to the large numbers of representative fragment poses chosen in the previous step. The RMSD for every pose from the experimental ligand position was calculated, and low RMSD poses have been found for every complex. Figure 4 shows the superimposed structures of the minimum RMSD docked ligands superimposed with the actual bound ligands inside the binding pockets. These results confirm that the fragment-based docking approach thoroughly searches for possible binding positions of complete ligands inside the binding pockets.

Table 3 Minimum RMSD values of the bound docking results
Fig. 4
figure 4

Minimum RMSD poses from bound docking. Green sticks represent experimental ligands and pink sticks represent docked poses

Unbound docking

For unbound docking, the fragment-based docking approach have been applied to dock the ligands contained in 20 PDB complexes (eleven CYP450-substrate, four GPCR, two chaperons, two bromodomains and one SARS-Cov-2 complex, see Table 1) with receptor structures generated by comparative modeling. The PDB IDs of the template structures as well their sequence identities with respect to the actual receptor proteins are also given in Table 1. Possible binding poses of complete ligands have been generated with the fragment-based approach described above. For simplicity, we have only aligned and transformed each ligand in its actual bound conformation, as present in the respective PDB structures. In real unbound docking, it is usually not difficult to first systematically explore the conformational space of a small molecule ligand to identify its low energy conformational states, and then separately align each conformation with the docked fragment poses by the graph theory algorithm.

The fragment-based docking results are compared with those obtained by direct docking of complete ligands with AutoDock Vina. The latter calculations included rigid receptor docking, in which the receptor structures were not allowed to change, as well as flexible docking, in which the side chains of pocket residues were allowed to rotate. Table 4 shows the minimum RMSDs of docked poses from the experimental bound structures for different docking approaches. The results suggest that for several complexes, while direct docking of the complete ligands was not able to yield any experimental result-like binding poses, the fragment-based approach was able to produce poses that were of relatively small RMSDs from experimental results. Figures 5 and 6 show the docked minimum RMSD binding poses in comparison with respective actual ligand positions in respective experimental complexes. The results obtained by the fragment-based approach and the direct docking approach are compared side by side. Results clearly indicate that the proposed fragment-based strategy outperforms complete ligand docking with AutoDock Vina, with all complexes having RMSD within 3 Å where 12 complexes present RMSD of < 2 Å and 6 complexes having the RMSD of less than 1 Å. In case of direct docking of complete ligands with AutoDock Vina, 12 out of 20 complexes yielded the RMSD of less than 3 Å, 6 complexes within RMSD of 2 Å and only 1 complex with RMSD of less than 1 Å (Table 4). We have found that complexes for which Autodock gave good results either have small protein sequence length e.g. bromodomains or they are bound to small ligands e.g. in case of 1suo_A and 4zf8_A (Table 1). Inaccuracies in the modeled structures prevent appropriate binding of large sized ligands therefore usual docking methods often fail to generate native ligand conformations in such cases. Our methodology, on the other hand, performed quite well under circumstances where the protein size is big and they are bound to large ligands e.g. CYP450, GPCR and SARS-CoV-2.

Table 4 Minimum RMSD values generated by our methodology and AutoDock (rigid, flexible) with modeled structures
Fig. 5
figure 5

Docked ligand positions obtained with modeled CYP450 receptors compared with actual ligand positions. a 1suo, b 2fdu, c 3me6, d 4d7d, e 4i8v, f 4oqr, g 4wqj, h 4zf8, i 5li8, j 6c94. Ligand binding modes generated by our fragment-based docking approach are colored light pink. Results of direct docking of complete ligands by AutoDock Vina are colored cyan. Actual ligand positions are shown as green sticks

Fig. 6
figure 6

Docking of modeled proteins. a 3ua1, b 4iar, c 4nc3, d 4wnk, e 5he0 f 3fzm, g 6f1n, h 4a9m, i 5lj0, j 6wtt. Ligand binding modes generated by our fragment-based docking approach are colored light pink. Results of direct docking of complete ligands by AutoDock Vina are colored cyan. Actual ligand positions are shown as green sticks

Different modeled receptor structures (4zf8_model, 3fzm_model, 1suo_model) were considered for which results were compared between two different homology models generated via the same respective templates (for the quality of these models see supplementary Table S1). Only slight variations in RMSD (< 0.5 Å) were observed (Table 5).

Table 5 Results indicating RMSD values using different models for three receptors

In addition to docking with modeled structures, cross docking analysis was also performed. For this purpose, three protein–ligand complexes were considered. CYP2B4 is present in various conformations in protein data bank bound with different ligands. We took 1suo bound with cpz ligand, 3me6 bound with cge ligand and 3r1a co-crystallized with tb2 ligand. Every ligand was docked with every receptor and RMSD from the native ligand conformation was calculated (Table 6). All the ligands were well aligned with the native conformations with RMSD less than 2 Å except for one ligand tb2 which yielded RMSD of 2.4 Å when cross docked with 1suo receptor. These results clearly indicate that the proposed methodology works efficiently for unbound docking.

Table 6 Cross docking RMSD results for three complexes

Scoring the docked poses of complete ligands

Before subjected to scoring, each ligand-receptor complex was first energy-minimized using Chimera [68]. As the transformed ligands may be involved in unacceptable steric clashes with the protein environment, this local optimization step is necessary. Two scoring functions have been considered: Cyscore [69] and DSX [70].

Figure 7 shows the RMSDs from respective experimental ligand positions of the top 10% scored conformations for bound docking. Evaluated with the Cyscore scoring function, for 9 of the 10 complexes, the 10% top scoring binding poses included one or more experimental result-like ones. Evaluated with the DSX scoring function, experimental result-like binding poses were included in the 10% top scoring binding poses for all complexes (Fig. 7b). Because of inaccuracies in both the structural models and the scoring functions, some false binding poses were unavoidably ranked top along with true binding poses. Visual inspection of several such false poses inside binding pockets reveal multiple favorable interactions with the pocket residues (Fig. 8). In some cases, high RMSD can be attributed to flipped conformations. The molecule is completely flipped along its axis such that the reaction center is pointed in opposite direction (Fig. 8). In native pose 1a, imidazole is pointing towards heme group but in pose 1b the molecule has flipped by 180 degrees orienting chlorophenyl towards heme group. In pose 3b, the ligand molecule has flipped along two axis, moving benzene ring away from the heme group, and flipping the carbonyl at the same time. For pose 4b, we can see the flip along both axis. The good ligand-receptor shape-complementarity as well as the favorable local interactions of the false poses cause difficulty the scoring functions to tell them from the true poses. In some cases, there could be the possibility that not all possible binding poses have been captured in the corresponding crystallographic structures, as for some CYP450 enzymes, a ligand may bind in different conformations and leading to more than one reaction sites on the ligand.

Figure 7 shows the RMSDs from respective experimental ligand positions of the top 10% scored conformations for bound docking. Evaluated with the Cyscore scoring function, for 9 of the 10 complexes, the 10% top scoring binding poses included one or more experimental result-like ones. Evaluated with the DSX scoring function, experimental result-like binding poses were included in the 10% top scoring binding poses for all complexes (Fig. 7b). Because of inaccuracies in both the structural models and the scoring functions, some false binding poses were unavoidably ranked top along with true binding poses. Visual inspection of several such false poses inside binding pockets reveal multiple favorable interactions with the pocket residues (Fig. 8). In some cases, high RMSD can be attributed to flipped conformations. The molecule is completely flipped along its axis such that the reaction center is pointed in opposite direction (Fig. 8). In native pose 1a, imidazole is pointing towards heme group but in pose 1b the molecule has flipped by 180 degrees orienting chlorophenyl towards heme group. In pose 3b, the ligand molecule has flipped along two axis, moving benzene ring away from the heme group, and flipping the carbonyl at the same time. For pose 4b, we can see the flip along both axis. The good ligand-receptor shape-complementarity as well as the favorable local interactions of the false poses cause difficulty the scoring functions to tell them from the true poses. In some cases, there could be the possibility that not all possible binding poses have been captured in the corresponding crystallographic structures, as for some CYP450 enzymes, a ligand may bind in different conformations and leading to more than one reaction sites on the ligand

Fig. 7
figure 7

RMSD distributions of the bound docking results in top 10% hits according to a Cyscore b DSX ranking

Fig. 8
figure 8

Comparison of experimental-like poses with non-experimental-like poses that have been ranked top by the scoring function. The experimental-like poses are colored pink (a) and non-experimental-like poses are colored cyan (b). 3D representation of 8 docked complexes from bound and unbound docking are presented: (1) 1suo-cpz, (2) 3me6-cge, (3) 4i8v-bhf, (4) 6c94-v16, (5) 3me6-model, (6) 3ua1-model, (7) 4zf8-model, (8) 5li8-model. The heme group is shown in gray sticks. Favorable contacts with binding pocket residues are indicated. Green lines are hydrophobic interactions of alkyl or pi-alkyl groups. Blue lines are hydrogen bonds. Pink lines are pi–pi interactions

Figure 9 displays the RMSDs from respective experimental ligand positions of the top 20% scored conformations for unbound docking. In total 20 complexes have been considered for unbound docking. The Cyscore evaluation reveals that for 12 of 20 complexes, the top 20% scoring binding poses included one or more experimental result-like poses (within 3 Å). The DSX score evaluation yielded one or more experimental result-like poses for 14 of 20 complexes in top 20% results. In general, the accuracy of scoring functions was lower for ranking the unbound docking results than the bound docking results. This can be attributed to the fact that the unbound docked complexes originated from approximately modeled structures.

Fig. 9
figure 9

RMSD distributions of unbound docking results for top 20% hits according to a Cyscore b DSX ranking

Considering multiple ligand conformations

To test the fragment-based docking approach in a more realistic scenario, we further considered different ligand conformations. Two complexes (PDB IDs 5li8 and 4wnk) have been considered as examples. The respective ligands are named as KKK and 453, as in the corresponding PDB records. For each ligand, we generated 15 different conformers (Figs. 10 and 11) using the ETKDG (Experimental-Torsion basic knowledge distance geometry) algorithm [71] contained in RDKit. Each of the resulting conformers were separately aligned to the docked fragment poses using the graph theory algorithm. Each conformer was also used for direct docking with AutoDock Vina. Complete ligand docking results for the GPCR complex (ligand 453) are as good as the fragment-based docking results. However, for the CYP450 complex (ligand KKK), fragment-based docking yielded much better RMSD results (Table 7). The binding pocket of the GPCR receptor is comparatively exposed, which makes it easier for complete ligand docking to find correct ligand orientation. On the other hand, for the docking of a relatively large ligand (ligand KKK) in to the buried pocket of the CYP450 receptor, direct docking of complete ligands seems to be rather challenging, especially with modeled receptor structures.

Table 7 Minimum RMSD values for different ligand (KKK, 453) conformations

The resulting docked poses were also energy minimized and then scored using the DSX scoring function [70]. Figure 12 shows that for 10 of 15 KKK and 453 conformations, the top 20% scoring poses included one or more binding poses close to native (< 3 Å).

Fig. 10
figure 10

15 conformations of ligand KKK ligand used in unbound docking

Fig. 11
figure 11

15 conformations of ligand 453 used for unbound docking

Fig. 12
figure 12

RMSD distributions for docking different conformations of a ligand KKK and b ligand 453 in top 20% hits according to DSX ranking

Conclusions

In this article, we introduce a fragment-based docking method in which binding poses of complete ligands are determined based on alignments with docked fragment poses. A graph theory maximum clique algorithm is adopted to identify proper alignments between a multi-fragment ligand and a set of docked poses of various types of fragments. Bound docking of 10 CYP450 complexes verified that this fragment-based approach yields valid docking results as other established docking methods: for all the tested complexes, experimental-like binding positions of complete ligands can be successfully recovered. Then the fragment-based method was challenged with unbound docking, in which the input receptor structures were not from experimental complexes, but constructed through homology modeling. For all the complexes tested, binding poses with RMSD less than 3 Å from actual ligand binding positions could be recovered by the fragment-based approach, with 12 complexes having RMSD within 2 Å. For comparison, direct docking of complete ligands into the same modeled receptor pockets generated the binding positions in 12 of 20 complexes with RMSD < 3 Å.

The relatively poor performance of unbound docking of complete ligands are consistent with the well-known fact that conventional docking approaches are highly sensitive to receptor structures, making unbound docking, especially docking with modeled receptor structures, a challenging task. The graph theory based fragment docking approach proposed here is intended to be less sensitive to receptor structures. In this new approach, different regions of a binding pocket are probed more systematically and thoroughly by small, rigid fragments. By retrieving possible binding positions of complete ligands through alignments with the docked fragments, one avoids difficulties associated with moving a relatively large ligand within the limited volume of a binding pocket according to an energy function. Because the different sub-pockets (the favorable binding sites for different fragments) are searched separately and independently, the requirements on the modeled receptor pocket to be accurate as a whole is efficiently reduced. Additionally, the relative geometries between the docked fragment poses are far less restricted than the relative geometries between the covalently linked fragments in a complete ligand. This may lead to better tolerance of inaccuracies in the relative geometries between different regions of a modeled receptor pocket. Compared with existed flexible docking approaches, the graph-based approach does not require extensive sampling of receptor structures, which is time consuming and convergence or thorough coverage of different states important for ligand binding is not easily achieved. Our graph based docking strategy is less sensitive to receptor inaccuracies and hence gave good results even with poor quality structure models. However, with this method, one will end up in having multiple similar poses that need to be filtered out. In addition, this method requires prior knowledge of possible binding conformations to pick the best binding pose. In practice, this should not cause serious problems as the conformational space of a ligand are usually not enormous and a set of candidate binding conformations may be generated from systematic conformational analysis. These theoretical considerations and the actual docking results taken together, the fragment-based docking approach proposed here provides a useful new method for computational ligand docking with approximate, modeled receptor structure.