1 Introduction

In December 2019, a severe respiratory illness similar to severe acute respiratory syndrome coronavirus emerged in Wuhan, Hubei, China and is spreading all over the world with high mortality. In the past, beta coronaviruses, severe acute respiratory syndrome coronavirus (SARS-CoV) and Middle East respiratory syndrome coronavirus (MERS-CoV), respectively, have caused high mortality rates and became a threat to human life [1]. The most recent outbreak of the viral pneumonia was first disclosed by the Wuhan Municipal Health Commission [2, 3], and the World Health Organization (WHO) was alarmed about the outbreak of pneumonia announced by the Chinese Officials [4]. The novel coronavirus (2019-nCoV) was isolated from 27 patients who were initially reported and the number of patients was subsequently revised to 31,498 as of March 23, 2020, with 3267 deaths [5]. The current 2019-nCoV outbreak has some common features like the SARS outbreak: both have happened in winter, are linked to live animal markets, and caused by unknown coronaviruses [2, 5].

Fever, cough, and shortness of breath are the symptoms in common cases, whereas pneumonia, severe acute respiratory syndrome, and kidney failure are being reported as the symptoms in severe cases [4]. Most of the 2019-nCoV patients are linked to the Huanan Seafood Wholesale Market where several wildlife animals including bats, snakes as well as poultry are sold. So far, no specific wildlife animal is identified as the host of the novel coronavirus. Bat is considered as the native host of the novel coronavirus (2019-nCoV) although there are other hosts in transmission from bats to humans [5]. The Spring Festival travel rush has accelerated the spread, so it is of top priority to prevent the spread, develop a new drug to combat it, and cure the patients in time. Knowledge of current 2019-nCoV can be learned from previous SARS-CoV. For SARS-CoV, a variety of modern machine learning methods, in particular, deep neural networks were used for drug discovery and development. These methods take advantage of bigger datasets compiled from high-throughput screening data and perform prediction of bioactivities of a target with high accuracy [6].

The genetic sequences of 2019-nCoV have shown similarities to SARS-CoV (79.5%) [7, 8]. The S-protein and 3C-like protease are potential drug targets. The S-protein is the main target of neutralizing antibodies, and antibodies binding with this protein have the potential to stop the virus entry into host cells [9]. The 3C-like protease catalyzes a chemical reaction which is important in SARS coronavirus replicase polyprotein processing [10, 11]. The neutralizing antibodies against S-protein of SARS have been obtained from human patients and the anti-SARS-CoV S antibody triggered fusogenic conformational changes [9]. This provides an important clue to prevent virus entry into host cells by antibodies or peptides. The 3C-like protease inhibitors also have potential to prevent coronavirus maturation, and series of unsaturated esters inhibitors against 3C-like protease of SARS-CoV was deposited in PDB database (crystal structures of SARS-Cov 3C-like protease complexed with a series of unsaturated esters, Protein Databank Identifier: 3TIT).

One can also use these previous SARS inhibitors to design the inhibitor against 2019-nCoV. Based on the increasing protein–ligand complex structures, the deep learning algorithms for identifying/predicting potential binding compounds for a given target became possible [12, 13]. In addition to small molecular chemical compounds, scientists also rely on peptide/antibody to combat the virus due to stronger binding affinity. In the post-genomics era, a Dense Fully Convolutional Neural Network (DFCNN) model is more effective, faster, and cheaper for drug discovery, because the deep layers of the model can learn more features from the data and perform an accurate prediction. By using these techniques, an antimalarial drug “pyrimethamine” was discovered against Dihydrofolate reductase (DHFR) enzyme and another drug BPM31510 is in a phase II trial involving humans with advanced pancreatic cancer [14,15,16]. Hence we believe that the integrated applications of such machine learning models as a pipeline for drug discovery have implications in therapeutic drug targeting.

Considering all the above facts, in the present work, we considered 2019-nCov_3C-like protease as a potential target and built a structural model after systematically analyzing its sequence features. We built a pipeline with a deep learning based method developed in our group by representing molecules as vectors to identify potential drugs (peptides or small ligands) against the protein target of the 2019-nCoV virus [13]. Our method is extremely fast in virtual drug screening and it takes less than a day to finish the virtual screening over millions of protein–ligand or protein-peptide predictions, whereas traditional docking methods take several weeks with the help of a supercomputer. Although, 2019-nCoV outbreak is a major challenge for clinicians [17], we believe the proposed potential drug list can help them to validate the drug that relieves symptoms or even cures the disease rapidly.

2 Materials and Methods

2.1 Dataset and Sequence Alignment

We retrieved the virus RNA sequences from Global Initiative on Sharing All Influenza Data (GISAID) database [18] and the sequences are aligned with a focus on the interested S-protein and ligand binding region of 2019-nCov_3C-like protease. The amino acid sequence is translated from the RNA sequence by Translate web tool (https://web.expasy.org/translate/). We used 18 patient’s sequences in this work (EPI_ISL_402119 to EPI_ISL_404228). Details of the sequences and acknowledgement to the authors who submitted the data to the server is presented in the Supplementary Table S1. Multiple sequence alignment is performed by using Clustal Omega program [19].

2.2 Homology Modeling of 2019-nCov_3C-like Protease

The structural model of 2019-nCov_3C-like protease was built by using Modeller 9.9 [20]. The SARS coronavirus 3C-like protease was used as a template (PDB ID: 3TNT) which has about 96.07% amino acid sequence identity. The software outputs multiple predicted structures and they are ranked according to the discrete optimized protein energy (DOPE) score [21]. The quality of the model was validated by looking at the stereo chemical quality on Ramachandran map. The model was further optimized by PROCHECK [22], ERPAT [23] and Qmean [24] and the final optimized structural model is considered for further analysis.

2.3 A Deep Learning Model is Used to Virtual Screen Large Databases

In our previous work, we built a Dense Fully Convolutional Neural Network (DFCNN) deep learning model to reverse search drug targets. Here we apply this model to perform large-scale virtual screening. Since the method is shown to have relatively higher accuracy and efficiency, it is very suitable for applying to such an emerging disease outbreak. The DFCNN is a densely fully connected neural network, and the densely network (similar to DenseNet, but replace the convolution layer to fully connected layer) allows deep layer without the gradient vanishing problem. The deeper layers make it to learn more abstract features from the data. The training data of DFCNN is from PDB bind database [25], for which we define the crystal protein–ligand PDB complexes as positive and cross-docking complexes as negative. The detailed process to build the deep learning model is described in our recently published work to virtual screen targets by inputting a small molecule by using a vector type of representation [13]. The overall workflow of the proposed method is shown in Fig. 1. DFCNN model has two advantages over many other methods such as independent of docking simulation and the training dataset includes nonbinding decoys. The independence of the docking simulation makes it extremely fast, while the inclusion of nonbinding decoys during training makes the model robust in the real application scenarios.

Fig. 1
figure 1

The workflow of virtual screening of small chemical compounds and tripeptides against the 2019-nCov_3C-like protease

2.4 Virtual Screening Against Chimdiv Database

The structural model of the ligand binding region of 2019-nCov_3C-like protease is used as the target protein structure. We define the residues with a cutoff distance of 1 nm from the known ligand as a pocket (binding site is defined based on the ligand from the template PDB 3TNT is used). The ligand database is taken from the chimdiv company (https://www.chemdiv.com/) which contains around 1000,000 compounds. We first used the DFCNN model to perform large-scale virtual screening. The mean and deviation of the training dataset were used during data normalization for a more stable performance. In the second stage, the top prediction by DFCNN model was chosen for an autodock vina-based docking simulation. The docking result was visualized and examined by the discovery studio visualizer [26]. Finally, we provide a proposed compound list that has the potential to bind protein pocket.

2.5 Virtual Screening Against Targetmol-Approved_Drug_Library, Targetmol-Natural_Compound_Library, and Targetmol-Bioactive_Compound_Library

The Targetmol-Approved_Drug_Library, Targetmol-Natural_Compound_Library, and Targetmol-Bioactive_Compound_Library contain about 2040, 1680, and 5370 compounds, respectively. We have applied DFCNN model to perform virtual screening against these three libraries for 2019-nCov_3C-like protease. The compounds with high DFCNN scores are recommended as the potential inhibitors for further experimental validation.

2.6 Virtual Screening Against Tripeptide Database

Tri-amino acid peptide database is first built with a total size of 8000. Each amino acid in the tripeptide database was converted into a molecule vector by Mol2vec [27]. For each peptide, the sum of its amino acid vector was used to represent this peptide’s vector. Protein pocket is defined as residues with a cutoff distance of 1 nm from the known ligand. The pocket is then converted into Vector. The pocket and peptide vector are then concatenated into one line as input with a maximum dimension of 600. We will use the same model as DFCNN, a densely fully connected model that is trained by a protein–ligand dataset from the PDB bind database. Since the ligand and peptides are composed of chemical groups, the model trained on the protein–ligand complexes should also be suitable for protein–small peptide interaction.

3 Results

3.1 Sequence Alignment and Homology Modeling

Eighteen patient’s RNA sequences obtained from GISAID public domain database are translated into protein sequences by using translate tool. The ligand-binding sites of the template protein (3TNT) is considered as reference to define pocket region of our homology model. We have checked the mutations in the pocket region of 2019-nCov_3C-like protease, and the sequences have 100% similarity with the virus from 18 different patients. This indicates the virus is highly conserved in this region, and it is suitable for designing drugs by targeting this site. The alignment of S-protein epitope regions also shows high conservation among the patients (Supplementary Figure S1). From the figure, it is observed that the RNA sequence EPI_ISL_402132 has a point mutation at 32nd position where the codon of phenylalanine is replaced by isoleucine. 2019-nCoV_3C-like protease is also aligned to SARS-CoV protease by Clustal Omega [19]. The aligned sequence is shown in Fig. 2. There are 276 amino acid residues in both of the proteins. The figure indicates high similarity between 2019-nCov and SARS-CoV, which is consistent with the findings by Xu et al. [5]. Using the X-ray crystallographic structure of SARS coronavirus 3C-like protease solved at 1.59Å resolution as the template, a theoretical protein model is built for 2019-nCoV_3C-like protease using modeler software. Figure 3a shows the crystallographic structure of SARS_coronavirus_3C-like protease and 3B shows the homology model of 2019-nCoV_3C-like protease. There are only four mutations (T35V, A46S, S94A and K180N) between SARS_coronavirus_3Clike protease and 2019-nCoV_3C-like protease shown in Fig. 3a and b. In the Figure, the mutated residues are marked with blue color. Figure 3c shows the model structure with known SARS_coronavirus_3C-like protease inhibitor. The binding pocket and two-dimensional ligand interaction pattern of the target protein is shown with reference to the template. There are 23 protein–ligand interactions observed including 15 hydrogen bonds, one disulphide bond and few pi stacking interactions which is shown in Fig. 3d. The pocket extracted from the model is used for further analysis of large-scale virtual screening.

Fig. 2
figure 2

The sequence alignment of SARS_coronaivrus_3C-like protease and 2019-nCov_3C-like protease

Fig. 3
figure 3

The structural model of 2019-nCov_3C-like protease and its template. In a and b, the modeled 2019-nCov_3C-like protease and SARS_3C-like protease are shown with the mutated four residues marked with blue color. The ligand from the PDB 3TNT is transferred to the modeled structure (c) and based on residue distance from the transferred ligand, we define the pocket (d). The interaction between the ligand and the modeled 2019-nCov_3C-like protease is also shown (d)

3.2 Virtual Screening Against Four Small Molecular Compound Databases

Chemdiv dataset, widely used for large-scale virtual screening, contains a large amount (~ 1,000,000) of drug-like compounds or drug leads. The potential drug candidates with the highest score (Autodock vina score and our deep learning model score) from the Chemdiv dataset are presented in Table 1. Interestingly, the compound with identifier “C998-0189” has a top vina score compared to other six compounds listed. The name of the compound is N ~ 2 ~ -(3,5-dimethylphenyl)-N ~ 2 ~ -(5,5-dioxido-3a,4,6,6a-tetrahydrothieno[3,4-d][1,3] thiazol-2-yl)-N ~ 1 ~ -[3-(trifluoromethyl)phenyl]glycinamide with molecular formula C22H22F3N3O3S2. The molecular weight of the compound is 497.6 g/mol and the compound satisfies most of the drug-likeness parameters including Lipinski’s filters. The other five recommended compounds also have reasonable vina scores around 7.5 with important stabilizing interactions.

Table 1 The selected compounds that may inhibit 2019-nCov_3C-like protease based on the DFCNN score and autodock vina score

The top 100 predictions by our deep learning model against the database are shown in Supplementary Table S2. The top five compounds with Chimdiv identifier 8017-4328, 8017-4325, 8002-7777, 8004-0123 and 8010-0095, respectively, are listed with the high DFCNN score. Three other well-known compound libraries were screened in the present work, including Targetmol-Approved_Drug_Library, Targetmol-Natural_Compound_Library and Targetmol-Bioactive_Compound_Library. It is worth to test whether there is any natural compound that can combat the virus by inhibiting 2019-nCov_3C-like protease. Table 2 shows the screening result for Targetmol-Natural compound library. The compounds with a DFCNN score higher than 0.997 are listed in Table 2, and it is found that Adenosine, Vidarabine, Mannitol, Dulcitol, d-Sorbitol, d-Mannitol, Allitol, Sodium_gluconate are the top predictions (Table 2). Natural products are often active ingredients of known herb medicine, and relatively safe because of long history usage. If it is proved by an experiment that is effective to the target, patients can easily access it by taking corresponding herb medicine. There are about 8 compounds with the score of 0.999 and about 20 compounds with the score of 0.998 which are presented in Table S2. As indicated above, most of the drugs listed by our model are antiviral drugs and hence it can be tested against nCoV-2019 and can be validated in the clinical lab within a short time.

Table 2 The potential drug candidates selected from the Targetmol-Natural Compound Library

The screening result for Targetmol-Approved Drug library is shown in Table 3. The compounds with a DFCNN score higher than 0.997 are listed in Table 3. We randomly considered drugs from potential drugs list and performed a systematic literature search. It is found that Meglumine, Vidarabine, Adenosine, d-Sorbitol, d-Mannitol, Sodium_gluconate, Ganciclovir and Chlorobutanol, respectively, are top predictions according to the DFCNN score (Table 3). Interestingly, we found most of the drugs in the list such as meglumine, Ganciclovir and Vidarabine, respectively, show antiviral activity. The list of all the compounds above score 0.990 is provided in Table S4. The screening result for Targetmol-Bioactive_Compound_Library is shown in Table 4. The compounds with a DFCNN score higher than 0.997 is listed in Table 4. Bioactive compounds are a type of chemicals that can be found in plants and some foods and have been studied in the prevention of various diseases. It is worth to check whether any of them can act on the target protein. We found compounds such as Vidarabine, Adenosine, Dulcitol, d-Sorbitol, d-Mannitol, Ganciclovir and 5′-deoxyadenosine are the top predictions in the Targetmol-Bioactive compounds (Table 4). The list of compounds all the compounds above score 0.99 is provided in Table S5. The list in Table 4 has narrowed down the hit compounds for later drug development stages, such as molecular dynamics simulation, or even directly experimental validation for finding bioactive compounds against 2019-nCov_3C-like protease.

Table 3 The potential drug candidates selected from the Targetmol-Approved Drug library
Table 4 The potential drug candidates selected from the Targetmol-Bioactive Compounds

3.3 Virtual Screening Against Database of Tripeptides

Peptides have the potential to exert higher binding affinity and specificity than small molecular chemical compounds; meanwhile, small peptides are easier to be synthesized compared with small molecules and antibodies. Since the known ligands of SARS_3C-like protease are compounds similar to tripeptides and the combination of 20 amino acids for tripeptide is also affordable for our method, we decide to perform virtual screening on the tripeptides. The screened tripeptides with a DFCNN score higher than 0.995 (0.997, 0.996 and 0.995) for the 2019-nCov_3C-like protease is shown in Table 5. A higher value indicates the peptide can most likely bind with the pocket of the 2019-nCov_3C-like protease. Our method found that the peptides formed by I, K, P amino acids have the highest possibility to bind in the pocket. The combinations by G, K, L or G, K, K or K, P, V are also found to be favorable binding partners predicted by DFCNN (Table 5). The list of all tripeptides above score 0.99 is provided in Table S6. The combination of short peptides and its composition play a crucial role in affecting the overall conformation of protein [28, 29]. It was found that the tripeptide, pentapeptide and octapeptides are believed to be promising candidates for drug development of infectious diseases [30, 31]. Since these peptides are relatively easy to produce, many of the top predictions can be validated by the experimental techniques in a very fast and less expensive manner.

Table 5 The predicted tripeptide that has high possibility (DFCNN score ≥ 0.99) to bind with the pocket of 2019-nCov_3C-like protease by DFCNN Score

4 Conclusion

Designing small compound or peptide drugs to cure the 2019-nCoV is extremely urgent. Effective and safe drugs are required for treating deadly viral disease which caused an epidemic outbreak all over the globe. Researchers use different modern technologies to combat such diseases and deep learning is one among them with faster prediction and achieves greater than ~ 80% accuracy. With the extremely high speed and relatively high accuracy, our DFCNN model for 3C-like protease–ligand interaction analysis is suitable to overcome the challenge of screening tens of thousands of drugs in a short time in a certain emergency situations, such as 2019-nCov outbreak. Our deep learning model based on DFCNN is a data-driven model, which learns 3C-like protease–ligand interaction from known binding and non-binder data. The model use the binding pocket of 3C-like protease–ligand conformation instead of whole conformation of the complex; hence our model is so fast and accurate compared to all other molecular docking procedures.

The identified potential 3C-like protease-ligand pairs can be subjected to MD simulation to further check the binding stability and atomic interaction pattern, or even the binding free energy with techniques such as metadynamics to narrow down the candidate list. A variety of repurposed drugs and investigational drugs have been identified in the past. Screening National Medical products Administration (NMPA) approved drug libraries and other chemical libraries have identified novel agents. Hundreds of clinical trials involving remdesivir, chloroquine, favipiravir, chloroquine, convalescent plasma, TCM and other interventions are planned or underway. In this connection, we have performed a deep learning-based drug screening and provided potential compound and tripeptide lists for 2019-nCov_3C-like protease. Since the inhibitor candidates provided are on-market drugs, the list provided can help to facilitate the 2019-nCov_3C-like protease drug development and could be used immediately.