Introduction

Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) recently emerged and spread to cause a pandemic of coronavirus disease 2019 (COVID-19). Given the failure to contain the initial outbreak, the global failure to restrain the pandemic, and the absence of an effective vaccine, we may need to identify existing drugs or develop new drugs to interrupt COVID-19 at a critical juncture.

A number of targets may be of interest for the development of small molecule therapeutics for COVID-19: main protease (Mpro, 3CLpro), helicase (Nsp13), endoribonuclease (Nsp15), and 2′-O-methyltransferase (Nsp10/16) are known viral protein drug targets for SARS-CoV-2. Small molecule drugs may target the substrate binding site of Mpro, the ADP binding site of Nsp13, the active site of Nsp15, or the S-adenosylmethionine (SAM) binding site of Nsp16.

Water is essential to the description of interactions between drugs and their biomolecular targets because solvation is a key contributor to molecular recognition and binding. Energies, entropies, and structural features of water molecules can be used to identify waters that may produce favorable or unfavorable contributions to the free energy of binding upon displacement and therefore aid in the identification of ligand interactions that may or may not be desirable. Water networks and tightly bound structural waters can affect ligand–receptor binding affinities. Information on water structure and thermodynamics may be useful to screen virtual compound databases, to identify new lead drug candidates, and inform rational lead modification to improve affinity and specificity for its target [1,2,3,4]. Ignoring water molecules in binding sites may reduce the chance that a drug design project will be successful.

Solvation thermodynamic mapping (STM) is widely used in academic studies of drug–protein interactions and has been widely integrated into the workflow of drug discovery and rational design efforts at major pharmaceutical companies. The utility of STM spans a number of areas in early-stage drug development efforts including virtual screening [1, 2], formation or improvement of pharmacophores [2, 5], docking [1, 6], and rational lead modification [3, 4]. While the utility of STM is apparent, there are significant obstacles to widespread use. Of particular concern is that many existing software packages for characterizing water properties are commercial and, hence, not available to all and/or they require computational expertise in molecular dynamics, computer modeling, and statistical mechanics in order to apply. This set of skills often does not exist in wet chemistry labs whose research is dedicated to discovering and optimizing new pharmaceutical compounds.

The goal of this publication is to remove these obstacles and make publicly available solvation thermodynamic and structural maps of SARS-CoV-2 targets as a resource to the academic and industrial drug design community to aid in their pursuit of identifying small molecule treatments for COVID-19. In order to aid in screening and modification of drugs, we offer a free public repository of solvation thermodynamic maps of significant small molecule COVID-19 drug targets. Here we present solvation maps of seven targets that are likely viable for small molecule modulation. All GIST maps, active site 3D-RISM maps, and simulation data are publically available on the KurtzmanLab github (github): github.com/KurtzmanLab/COVID19_GIST_HSA. Full system 3D-RISM maps can be accessed at https://scholarworks.csun.edu/handle/10211.3/217209.

Methods

Protein preparation

Protein monomer structures were prepared using the Protein Preparation Wizard [7] in Maestro [8] with default settings. ACE and NMA groups were used to cap the protein termini. Active sites were visually inspected and compared to ligand-bound structures to ensure that protonation states and conformations were consistent with known ligand–protein interactions. Proteins were left as prepared by Maestro except for 6YB7 for which the protonation state of H163 was changed from being protonated in the delta position (HID) to the epsilon position (HIE) to produce an expected ligand protein interaction. No changes were made for other proteins. Energy minimization for hydrogen atoms was then performed in Maestro. The resulting protonation states can be found in Table 1.

Table 1 Protonation states of Mpro

A second set of structure models for SARS-CoV-2 Mpro (PDB IDs 6YB7 and 6W63) were manually prepared by one of the authors (McKay). All histidine side chains were assigned as either HIE or HID given the local environment. Protonation states for histidine residues found near the active site can be found in Table 1. All asparagine and glutamine side chains were examined and found to be in reasonable rotameric states. For these systems, the PARM@FROSST [9] small molecule extension to ff14SB [10] and AM1-BCC [11] charges were used for the ligands. All prepared structures can be found in the github repository.

Molecular dynamics simulations

Molecular dynamics simulations were performed in GPU accelerated AMBER 16 [12] using the ff14SB [10] force field and the optimal point charge (OPC) model [13] of water. Ligand force field parameters were assigned with the general AMBER force field, GAFF [14] using the Antechamber package [15] in AmberTools. Antechamber assigns charges, missing bonds, angles, dihedral angles and Lennard–Jones parameters for each atom. Ligand charges were assigned using AM1-BCC [11].

For systems with a co-crystalized ligand, the ligand was removed from the protein, and then the protein was solvated in a box of OPC water molecules with dimensions that ensured there were at least 10 Å between any atom of the protein and the box edge. Sodium or chlorine counterions were added accordingly to neutralize the system. Each system was then energetically minimized in a two-step process. The first minimization step was performed with 1500 steps of steepest descent with all protein atoms restrained harmonically using a force constant of 100 kcal/mol·Å2. For the second minimization step, only main chain heavy atoms were restrained. Following minimization, the system was heated to 300 K in a 240 ps NVT simulation with the main chain heavy atoms restrained; the temperature was regulated by Langevin thermostat with collision frequency of 1 ps. This was followed by a 20 ns NPT simulation with the atom restraints declining from 100 to 2.5 kcal/mol·Å2 in the first 10 ns. In the production phase, the temperature was regulated via a Langevin thermostat set to 300 K with a collision frequency of 2 ps. The constant pressure (1 atm) was maintained by isotropic position scaling with a relaxation time of 0.5 ps.

The SARS-CoV-2 main protease is expected to be functionally active as a dimer due to its structural homology with SARS-CoV-1 [16,17,18]. To investigate possible differences between the monomer and dimer we simulated both the monomeric crystal structure and the reported dimeric biological assemblies for 6W63 and 6YB7. We present data for several different sets of heavy atom restraints to model different amounts of protein flexibility. In one set (denoted rigid in the repository), restraints of 2.5 kcal/mol·Å2 were applied to all heavy atoms. In the second set (denoted SCflex) no restraints were applied to the side chain heavy atoms and restraints of 2.5 kcal/mol·Å2 were applied to the backbone heavy atoms. In the third set (denoted McKay structures) for both monomers and dimeric biological assemblies, carbon atoms were restrained (2.0 kcal/mol·Å2) for six residues (L87, V91, A206, A211, F294, R298) located distal to the active site. These six-residue restraints were applied to each monomer during the production simulations. All MD simulations files for all simulations are included in the github repository.

GIST

GIST maps were created using the GPU port [19] of AmberTools cpptraj-GIST [20]. Analyses were performed on the complete 50 ns production trajectory for each system (25,000 configurational snapshots). For each system, maps were created in a cubical region with 30 Å length sides centered on the geometric mean position of the co-crystalized ligand for the PDB (see Fig. 1). The resolution of the grid was 0.5 Å (0.125 Å3 per voxel). For structures with no co-crystalized ligand for the PDB entry, a homologous protein with a co-crystalized ligand was structurally aligned to the PDB structure and the geometric center of that ligand was used to define the GIST analysis region. In the case of 6JYT, the region was defined for HSA by a partial set of the residues found in the active site (K288, S289, D374, E375, R567). For the GIST analysis of 6JYT, the geometric center of ADP from a structurally aligned 2XZL was used as the center of the box. The ligands used for defining the GIST region for each structure can be found in the repository.

Fig. 1
figure 1

The co-crystalized structure of Mpro (cartoon) with ligand N3 from 6LU7. The GIST analysis was performed in the cubical region shaded in gray

Hydration site analysis

Hydration site analysis, HSA [21] was performed using the publicly available SSTMap code [22] with the default settings except for the region analysis which was set to within 10 Å of the ligand (− d 10). For each system, the analysis was per the first 20 ns (10,000 frames) of the MD production run for each protein.

Briefly, the method analyzes all the water positions from an MD trajectory and identifies high-density 1 Å radius spherical regions called hydration sites. In each hydration site, average quantities of the water molecules found in the hydration site are calculated and provide estimates for the local IST thermodynamic quantities. A number of measures that describe the local solvent structure and characterize the hydrogen-bonding environment of the water in each hydration site are also calculated. These measures can be used to characterize the enhancement or disruption of local water structure, describe the local enclosure, and describe the average hydrogen bonding interactions that water has in each hydration site with both its water neighbors and protein. Full details of the calculations are specified in a previous publication and the code is available on the github.

We also use newly developed code to determine the most probable orientations for water molecules in each hydration site. To do this, the orientations of all water molecules in each hydration site are clustered using a quaternion distance metric and the centroid orientation of each high-density cluster (generally at least 10% of the population) is recorded. The code and complete details of the method are in the github.

3D-RISM

3D-RISM maps were created for all rigid structures in the repository using rism3d.snglpnt [23] from AmberTools 19 [24]. In each case, the initial protein conformations of the production-phase molecular dynamics simulations were used after all waters had been removed. Water and solvation thermodynamics and number density distributions [25] were calculated for each protein structure with a residual tolerance of 10−6 using the partial series expansion of order-3 (PSE-3) closure [26]. The solvation box was extended to included Lennard–Jones interactions < 10−7 kcal/mol, which corresponded to a minimum distance between the protein and box edge of 47 Å. A 0.5 Å grid spacing was used with the centering = 3 option, which rounds the positions of the grid points to the nearest 0.5 Å. Oxygen and hydrogen solvent site contributions to the thermodynamics maps were combined using the molecular reconstruction technique [27].

Bulk water solvent-susceptibility input for 3D-RISM was generated with rism1d using the coincident SPC/E (cSPC/E) water model [23], the PSE-3 closure and dielectrically consistent RISM (DRISM) theory [28]. The temperature, density and dielectric constant were set to 298.15 K, 55.41 M and 78.45 respectively. A grid of 16,384 points was used with a grid spacing of 0.025 Å.

Repository contents

Structures 1–5 (SARS-CoV-2 structures)

Main protease (Mpro, 3CLpro): 6LU7 [29] (2.16 Å), 6YB7 (1.25 Å), 6M03 (2.00 Å), 6Y84 (1.39 Å), 6W63 (2.10 Å). Target the substrate binding site of Mpro.

Structure 6 (SARS-CoV-1 structure)

Helicase (Nsp13): 6JYT [30] (2.80 Å). Target (1) the ADP binding site but discourage (2) the nucleic acids binding site. No SARS-CoV-2 structure exists for this protein.

Structure 7 (SARS-CoV-2 structure)

Nsp16 (2′-O-methyltransferase, nsp 10/16): 6W4H (1.80 Å). Target the S-adenosylmethionine (SAM) binding site.

All files with prepared structures, topologies files, and molecular dynamics input and restart files are provided as well as solvation structural and thermodynamic maps described below.

Solvation thermodynamic maps

Inhomogeneous solvation theory, IST [31,32,33] provides the statistical mechanical framework for the solvation thermodynamic quantities from explicit solvent molecular dynamics simulations. Here, we use two methods: grid-based inhomogeneous solvation theory, GIST [34, 35] and HSA [21] to localize the IST thermodynamic quantities onto a three-dimensional grid and onto high density 1 Å radius spherical “hydration sites”, respectively. These localization approaches both process snapshots of the system configurations generated in molecular dynamics simulations to estimate local IST thermodynamic quantities including local energies, entropies, and number densities.

Grid based solvation maps

The repository contains grid-based solvation maps of calculated IST and 3D-RISM entropies, energies, and densities in Data Explorer (dx) format. The dx format enables visualization in standard graphics packages such as VMD and Pymol. For each target, energetic maps are provided for water’s interactions with the protein, with other water molecules, and the total interactions of the water in each voxel with the system as a whole.

GIST provides entropy maps for the total entropy as well maps that separately include the translational and orientational contributions to the total entropy. Maps are provided for all of the entropy and energy quantities for both normalized (per water quantities) and density (per voxel) quantities. A complete list of quantities can be found in Table 2. Detailed descriptions of these quantities can be found in our prior work [20, 35].

Table 2 Key GIST quantities

3D-RISM provides maps of the total entropy and the solvation free energy, as well as maps solvation energy. In all cases, only densities maps are provided. A list of quantities is provided in Table 3 and descriptions can be found in Nguyen et al. [27].

Table 3 Key 3D-RISM quantities

3D-RISM maps differ from GIST maps in that they are not limited to the ligand binding site and do not rely on solvent sampling, which may be incomplete. However, the orientations of water molecules are lost in this process and the density distributions may differ from those of explicit solvent, primarily in the height and breadth of the maxima and minima. As the files for the full thermodynamic and number density distributions are quite large, files available in the github repository have been truncated to the binding site and match the dimensions of the GIST maps.

Hydration site solvation maps

For each target, the positions and calculated thermodynamic and structural quantities for the water in each hydration site are summarized in a space delimited spreadsheet file.

The same energetic quantities as calculated for GIST (above) are calculated for each hydration site and reported in per water (normalized) units. Additionally, the HSA data includes a breakdown of the total energy into contributions from Lennard–Jones, electrostatic, and first solvation shell water–water interactions.

SSTMap also calculates a number of quantities that are aimed at characterizing the local environment surrounding each hydration site. These are aimed at better describing local water structure and the interactions of the water in the hydration site with the protein surface.

Quantities that provide a measure of local water structure include the average number of first shell neighbors each water has in its first solvation shell, the fraction of these neighbors to which the hydration site water is hydrogen bonded, and the average energy of interaction with each neighboring water. When compared to bulk water values, these quantities provide measures of whether the local water structure is enhanced or frustrated [36].

Additional quantities that characterize the interaction of the water in each hydration site with the protein include: (1) an enclosure parameter that describes how much of the region around the hydration site is protein and how much is water, (2) the average number of hydrogen bond donor and acceptor interactions that water molecules found in the hydration site have with the protein surface, and (3) lists of the protein residues that donate and accept hydrogen bonds to the water in the hydration site.

A list of thermodynamic and structural quantities can be found in Table 4. A text delimited spreadsheet file summarizing all calculated water properties is found in the HSA directory for each protein.

Table 4 HSA structural quantities

In addition, to facilitate visualization, each HSA directory includes PDB files that feature (1) the hydration site centers, (2) water molecules located at the center of each hydration site that have the most probable orientation, and (3) water molecules located at the center of each hydration site that include all probable orientation clusters.

Potential applications

Solvation thermodynamic mapping has been used in a variety of applications aimed at aiding the discovery and design of new pharmaceutical compounds. In docking, scoring terms have been added to explicitly account for solvent displacement upon ligand binding and the modified docking scoring functions have been used to help improve AUC, pose prediction, and identify novel binding ligands [1, 6, 37]. Solvation maps have also been used to create pharmacophores [2] as well as provide criteria to prioritize the selection of pharmacophore sites [5]. Both water thermodynamics and water interactions with protein surfaces have been used to direct lead modification [4, 38].

Here, we describe by example several potential applications for the GIST and HSA solvation maps provided in this repository. 3D-RISM solvation maps can be used as a complement or alternative to GIST and HSA results as they treat regions where water molecules may not be able to exchange during normal sampling, allow the hydration of the entire protein to be explored, and are based on a different theoretical framework. Certainly, users should have greater confidence for regions where all three methods agree. As the 3D-RISM maps extend beyond the active site, they may be used to investigate the dimer interface as well as investigate potential allosteric modulation distal from the substrate binding pocket.

Rational lead modification

The properties of water in and around the binding site may be used to direct the design of chemical modifications to a lead compound or fragment. The physical principles of this are that the displacement of thermodynamically unfavorable surface water upon the binding of a ligand will lead to favorable contributions to the free energy as the water is displaced to the more thermodynamically favorable environment of bulk biological water.

Here, we illustrate how solvation structural and thermodynamic solvation mapping in this repository can be used to provide insight into which modifications may lead to boosts in binding affinity.

The binding site of Mpro features a large number of energetically unfavorable hydration sites (see Fig. 2). Prior work [39, 40] suggests that the displacement of water from these hydration sites may be correlated with differences in binding affinities between congeneric pairs of ligands. Most of the hydration sites identified in Fig. 1 are displaced by N3. However, the two leftmost sites are not. We will focus on the upper left site, hydration site 7 (HS7), as it has an exceptionally unfavorable thermodynamic profile.

Fig. 2
figure 2

N3 bound to Mpro (PDB ID 6LU7). Hydration sites that are located within 7.5 angstroms of N3 and have highly unfavorable energy (ΔE > 0.5 kcal/mol with respect to neat water) are shown as transparent red spheres. The most probable water orientation for each hydration site is represented by a water molecule at the center of each sphere. The protein surface proximal (within 11 Å) to N3 is shown in gray

HS7 occupies a small cleft on the surface of the protein, which is formed by seven different residues (N28, G143, N119, T26, Y118, and C145). The water in this cleft is resolved the crystal structures of 6LU7, 6W63, 6Y84, and 6YB7. However, this water is not reported in 6M03. The water is highly enclosed by the protein (81.7%) having slightly less than 1 (0.96) water neighbor, on average, in its first solvation shell. Despite the hydration site being highly occupied (84.5% occupancy), the water is exceptionally unfavorable energetically (+ 2.6 kcal/mol) and entropically (− TS of 4.45 kcal/mol) by IST estimates. Its low entropy result is based on the water’s highly restricted translational and orientational motion. The water’s high enclosure in the protein cleft and its formation of two hydrogen bonds with the protein surface severely restrict the water’s translational freedom leading to a translational entropic penalty of 2.11 kcal by IST estimates. The two hydrogen bonds it forms with the protein surface as well as forming a hydrogen bond 82% of the time with its water neighbor located above the cleft (HS56), further restrict its orientational freedom resulting in an entropic penalty of 2.33 kcal/mole.

Despite being on a hydrophilic surface (forming on average 2.00 hydrogen bonds with the protein), the water in HS7 cannot form a full complement of hydrogen bonds, instead forming only 2.85 geometric hydrogen bonds on average compared to a bulk OPC water which would form 3.62. This deficiency of more than three quarters of a hydrogen bond, on average, is a significant contribution to the unfavorable energetic profile (+ 2.6 kcal/mole overall) of HS7.

Both the unfavorable IST energy and entropy suggest that displacing this HS7 water could lead to gains in binding affinity. In order to displace this water, an optimal chemical group must replace interactions that the water makes with the protein without disrupting the hydrogen bond network that the water is making with its neighbors. As the water in HS7 is located in a cleft, any chemical group would also need to displace its water neighbor (h-bonded water in Fig. 3). The optimal chemical group would need to both donate a hydrogen bond to the backbone carbonyl of G143 and accept a hydrogen bond from the backbone amine of N119. A hydroxy group seems ideal for this.

Fig. 3
figure 3

The most probable orientation of the water in HS7 donates a hydrogen bond (red dashed line) to the backbone carbonyl of Gly143, accepts a hydrogen bond (blue dashed line) from the backbone NH of Asn119, and donates a hydrogen bond to HS56 above the cleft wherein lies HS7

All of the numerical data in the above analysis is located in the HSA summary spread sheet for 6LU7 (6LU7_apo_flex_hsa_ summary.csv). All the data for the visualizations is likewise located in the repository.

Scoring solvation displacement in docking

Four studies outline how solvation thermodynamic mapping can be used to aid in the discovery of new leads in docking. The first two of these studies are based on our prior work on Factor Xa [39, 40], in which a displaced solvent functional used high energy and high density voxels as functional inputs to correlate with experimental measurements of differences in binding free energies between congeneric pairs of ligands [39, 40]. The third docking study [6] by Uehara and Tanaka instead used a displaced solvent functional with free energetic maps created by GIST as input whereas the fourth study [1] by Balius et al. used the displacement of voxels with high energy densities as input. The third study showed improvements in pose prediction and enrichment and the fourth showed only nominal measurable improvements to docking enrichment and pose prediction, though the method was successfully used to prospectively identify new tightly binding compounds, including the tightest binding compound to cytochrome c peroxidase. A map showing related unfavorable and favorable energy density regions for Mpro is shown in Fig. 4.

Fig. 4
figure 4

Unfavorable and favorable solvation energy density map of Mpro. Regions of unfavorable energy density (Edens > 0.1 kcal/mole/Å3) and favorable energy density (Edens > 0.1 kcal/mole/Å3) are shown in red or blue wireframe, respectively. The predicted score for a docked ligand would be penalized for displacing water from the favorable blue regions or given an affinity boost for displacing water from the red regions

The GIST maps in this repository provide the data to create the maps used in all three of the GIST-based studies. Necessary modifications of the provided GIST dx maps (e.g. creating a free energy density map from the energy and entropy density maps) can be easily created using the GIST post-processing (GISTPP) code provided on the github.

Pharmacophore creation

Solvation mapping can be used to generate water-based pharmacophore hypotheses [2] and to prioritize ligand- or protein-based pharmacophore sites [5]. Here we combine several interesting hydration sites with ligand-based pharmacophore elements.

Three pharmacophore sites were constructed using ligand–protein interactions based on analyses of co-crystalized ligands found inside the binding sites of SARS-CoV-2 Mpro structures (PDB ID 6W63, 6LU7, 6Y2F, 6Y2G, and 6M2N). These ligand-based sites appear as dotted spheres in Figs. 5 and 6.

Fig. 5
figure 5

Hybrid ligand- and water-based pharmacophore within the binding site of Mpro (PDB ID 6LU7). The ligand-based sites are shown as dotted spheres and the water-based sites are shaded spheres. Ligand-based sites have an NH group for donors or an oxygen for acceptors. The most probable water orientation is found at the center of each water-based pharmacophore site. Acceptor sites are red and donor sites are blue spheres

Fig. 6
figure 6

The same hybrid pharmacophore hypothesis as shown in Fig. 5, except the interactions with chemical groups on the surface are shown explicitly. Blue dashed lines show the pharmacophore sites donation of hydrogen bonds and red dashed lines show acceptation

The leftmost ligand-acceptor site (Figs. 5, 6) lies inside the oxyanion hole. All five of the co-crystalized ligands accept a hydrogen bond from the backbone amino group of G143 while three of five (6Y2F, 6Y2G, and 6M2N) also accept a hydrogen bond from C146. The pharmacophore site shown in Fig. 6 shows both of these interactions. The middle ligand-based site donates a hydrogen bond to the backbone carbonyl of H164. Ligands from 6W63, 6Y2G and 6Y2F make this contact. The rightmost ligand site, inside the S1 subsite, accepts a hydrogen bond from the backbone amino group of E166. Four of the five (all except 6M2N) co-crystallized ligands accept a hydrogen bond from this group. Each ligand-based site is proximal to a hydration site and GIST high-density group of voxels but none have any significant thermodynamic signal for use in prioritization. These ligand-based sites were chosen by the fact that they were well conserved across the limited number of structures available with co-crystallized ligands.

We used hydration site analysis to add three additional ligand-based pharmacophores sites. These sites are shown in shaded spheres in Figs. 5 and 6. While water-based pharmacophore sites can be chosen using other criteria (as outlined in Jung et al. [2]), here we simply chose water-based sites that are energetically unfavorable and categorized them based on their donor/acceptor interactions with the protein surface.

The first site (on the far right of Figs. 5, 6) is from HS9, which primarily accepts a hydrogen bond from the backbone amino group of residue T190 and has an unfavorable energy of − 11.28 kcal/mole (almost 1 kcal above the bulk energy of − 12.26 kcal/mole). The second site HS52 (middle left in Figs. 5, 6) has an energy of − 10.46 kcal/mole (1.8 kcal above bulk energy) and donates a hydrogen bond to T26. The third site, HS56 has an energy of − 11.69 kcal/mole (0.57 kcal less favorable than bulk) and donates a hydrogen bond to T25.

Together, the conserved ligand sites and the water-based sites create a pharmacophore hypothesis that can used to screen virtual compound databases.

While we arbitrarily chose three conserved sites from the ligand and three proximal hydration sites to construct the hypothesis outlined here, this approach allows a drug designer flexibility to choose ligand and water sites on virtually any solvent exposed surface of the protein, allowing different regions of the active site or potential allosteric sites to be targeted.

How to access data

All hydration site, 3D-RISM and GIST data is available on github with a readme.md that details directory structure and the descriptive file naming convention. Briefly, each PDB structure has its own subdirectory named after its PDB ID. Each PDB ID subdirectory has further subdirectories for simulations with apo or complexed structures and different protein restraints. Additional subdirectories for each of these include the hydration site, 3D-RISM and GIST analyses, as well as the prepared protein input files and Amber MD restart files in case longer simulations are desired. All of the above can be found on the github (github.com/KurtzmanLab/COVID19_GIST_HSA).