Introduction

Ebola virus (EBOV) is a member of Filoviridae family also known as Zaire ebolavirus, on the basis of the origin country, i.e., Democratic Republic of Congo (formerly Zaire). EBOV is responsible for thousands of deaths due to its periodic outbreaks since 1976. According to the World Health Organization (WHO), the fatality rate of the EBOV outbreak varies from 25 to 90% (https://www.who.int/news-room/fact-sheets/detail/ebola-virus-disease). EBOV cases are mainly found in the region of sub-Saharan Africa and pass-through animals like a bat, other nonhuman primates or any patient infected with EBOV. As per WHO, the EBOV outbreak is classified under level 3 emergency due to its high mortality and fatality.

EBOV is a negative-stranded, enveloped, non-segmented and helical single-stranded RNA with 19-kb nucleotides. It constitutes eight structural and one nonstructural proteins. The structural proteins include the nucleoprotein (NP), glycoprotein (GP), soluble glycoprotein (sGP), RNA-dependent RNA polymerase (L) and four virion proteins (VP24, VP30, VP35, VP40) [1]. As EBOV is an RNA virus, thus the development of effective antivirals against EBOV is a very challenging task. Currently, Favipiravir, Remdesivir, ZMapp and INMAZEB are the four most commonly used anti-Ebola agents for the treatment of EBOV infection. Among them, Favipiravir and Remdesivir are the ‘experimental’ category drugs that inhibit the viral polymerases while the ZMapp is the mixture of the three monoclonal antibodies, which are directed against the surface glycoproteins [2, 3]. INMAZEB, also known as REGN-EB3, is a mixture of three monoclonal antibodies, namely atoltivimab, maftivimab and odesivimab. It is the first USFDA-approved therapeutics in 2020 against EBOV infection. The Favipiravir (6-fluoro-3-hydroxy-2-pyrazinecarboxamide) and Remdesivir (GS-5734) are in use as the broad-spectrum antiviral drugs. Initially, the Favipiravir was used to treat influenza virus, but now has been used against EBOV [4]. Likewise, anti-Ebola drug Remdesivir was also repurposed to inhibit murine hepatic virus (MHV), Middle East respiratory syndrome (MERS-CoV), severe acute respiratory syndrome (SARS-CoV) and Nipah virus (NiV) [5].

Numerous computational studies are reported in the literature to highlight the use of machine learning in drug development against various pathogens. Todeschini R et al. described the importance of molecular descriptors in the process of designing the efficient drugs [6, 7]. Hansch C et al. explained the importance of physicochemical parameters in the quantitative structure–activity relationship (QSAR) analysis [8]. Matta CF explored the role of biophysical and biological properties in the formulation of QSAR models [9]. Toussi CA et al. design the Ser/Thr-protein kinase inhibitors by using machine-trained elastic networks [10]. However, our group previously implemented the machine learning approaches to develop computational methods to predict the antiviral compounds against various viruses like flaviviruses, Nipah virus and coronaviruses as AVCpred [11], anti-Flavi [12] and anti-Nipah [13] and anti-corona [14], respectively. Recently, we have developed a comprehensive repository of experimentally validated repurposed drugs against 23 viruses (including Ebola virus) responsible for causing epidemics/pandemics [15].

Furthermore, various computational approaches have been tried to identify repurposed or novel leads against EBOV. Anantpadma M et al. developed Bayesian machine learning models and identified three active molecules, namely tilorone, pyronaridine and quinacrine against EBOV [16]. Kwofie SK et al. used pharmacoinformatics and molecular docking approach to prioritize 19 compounds against EBOV after screening 7675 natural products [17]. Zhao Z et al. used a molecular dynamics approach to screen all FDA-approved drugs and finalized 15 potent drug candidates against EBOV [18]. Ekins et al. integrated Bayesian machine learning models to filter out potential lead compounds against EBOV [19]. However, most of the drug repurposing approach was done by various in vitro and in vivo assays, e.g., minigenome assay [20], GIP/HIV core pseudovirus with firefly luciferase reporter gene [21], HIV pseudovirions with high-throughput assay [22] and many more. However, any dedicated web server to identify the promising drug candidates is not available in the literature. In the current study, we have developed a machine-learning-based pipeline named 'anti-Ebola' for the identification of inhibitors against Ebola virus.

Methods

Data collection

The anti-Ebola predictor was developed using the data of EBOV inhibitors available from our recently published ‘DrugRepV’ database [15]. There are 868 compounds reported in this database, which were experimentally validated for anti-Ebola activities. However, we have selected only those molecules whose antiviral activities are given in terms of IC50/EC50 so as to develop regression-based models. Further, we used strict quality control filters like IC50/EC50 uniqueness, SMILES, assays, etc., to finalize our dataset. Finally, we obtained 305 unique inhibitors with the respective half-maximal inhibitory concentration (IC50/EC50) values from our database [15]. The IC50/EC50 values were converted into the negative logarithm of half-maximal inhibitory concentration (pIC50) using formula:

$$ pIC_{50} = - \log_{10} \left( {IC_{50} \left( M \right)} \right) $$
(1)

where IC50 is in the form of dimensionless activity that can be approximated numerically as molar concentrations. The higher pIC50 indicates exponentially greater potency. The pIC50 is used for the designing of various regression-based prediction algorithms [12, 13, 23]. Overall methodology of the anti-Ebola is available in Fig. 1.

Fig. 1
figure 1

Overall methodology used to develop anti-Ebola predictor

Data preparation

The chemical name was used to extract the chemical information like simplified molecular-input line-entry system (SMILES), which was then converted to 3D-SDF using obabel software [24]. Finally, the 3D-SDF is used to calculate the molecular descriptor and fingerprints.

For running the machine learning algorithm, the overall dataset (305) was divided into training/testing (T274) and independent validation (V31) datasets using randomization approaches in six sets [13, 25, 26].

PaDEL descriptor

The 3D-SDF structures were used for the calculation of 1D, 2D and 3D molecular descriptors as well as fingerprints. The PaDEL software is used for calculation of all the 17,968 descriptors available in the software [27]. Further, to take only relevant features and to rule out the possibility of overfitting of the model, we performed feature selection.

Feature selection

Feature selection is an important step to extract the most relevant features, remove irrelevant features and help to achieve high accuracy of the developed models [28, 29]. The feature selection was done using the support vector regression (SVR) implemented using libsvm using a parameter to control the number of support vectors. Finally, we extracted the most relevant 50 features out of 17,968 descriptors (Supplementary Table S2).

Ten fold cross-validation

The tenfold cross-validation was used to develop the predictive models. In the tenfold cross, training/testing (T274) was divided equally into ten sets. Initially, the nine datasets were combined for training and the remaining one set for testing to finally calculate the model performance. Likewise, all the sets get a chance to become the testing set; however, the average performance of all the testing sets represents the overall performance of the model. Further, the performance of the developed model was cross-evaluated using the independent dataset, which was not used during training and testing.

Machine learning techniques

In the current study, we implemented the three types of MLTs, i.e., support vector machine, random forest and artificial neural network techniques to develop predictive models.

Support vector machine is a supervised machine learning method which is used for both regression and classification-based problems. SVM constructs a set of hyperplanes which can be used to detect the regression/classification task. It is very effective for high-dimensional spaces [30]. Different kernel functions can be used as a decision function. The main objective of the SVM is to find the hyperplane in N-dimensional (N is the number of features) space which identifies the data points. Random forest is an ensemble machine learning technique and has been extensively used for both classification and regression problems. It functions by making decision trees from the training dataset, and the output would be in the form of mean prediction [31]. Artificial neural network is the organization of the connected units/nodes generally known as artificial neurons, which is analogous to the neurons in the human brain. The neural networks consist of input layer, output layer and hidden layers, which are used to transform the input to the reasonable output [32].

Performance measure

The performance of the developed model was analyzed through Pearson’s correlation coefficient (PCC), mean absolute error (MAE) and root mean absolute error (RMSE).

$$PCC=\frac{n{\sum }_{n=1}^{n}{E}_{i}^{act }{E}_{i}^{pred }- {\sum }_{n=1}^{n}{E}_{i}^{act } {\sum }_{n=1}^{n}{E}_{i}^{pred } }{\sqrt{n{\sum }_{n=1}^{n}{{({E}_{i}^{act})}^{2}-({{\sum }_{n=1}^{n}E}_{i}^{act})}^{2}}- \sqrt{n{\sum }_{n=1}^{n}{{({E}_{i}^{pred})}^{2}-({{\sum }_{n=1}^{n}E}_{i}^{pred})}^{2}}}$$
(2)
$$MAE = \frac{1}{n}\sum_{n=1}^{n}\left|{E}_{i}^{pred}-{E}_{i}^{act}\right|$$
(3)
$$RMSE = \sqrt{\frac{1}{n} \sum_{n=1}^{n}{({E}_{i}^{pred}- {E}_{i}^{act} )}^{2}}$$
(4)

In eqns (2), (3) and (4), n, \({E}_{i}^{pred}\) and \({E}_{i}^{act}\) are the size of the test set, predicted and actual efficiencies of Ebola inhibition, respectively.

Applicability domain

The robustness of the developed model was evaluated using William’s plot. William’s plot depicts the relationship between standardized residuals and leverage. The leverage (h) is set as a warning threshold (h*) of 3*p/n; in it the p is 1 + the number of finally used descriptors and n is the size of the training dataset. However, the standardized residuals threshold was ± 3σ [33]. The predictive model was robust if most data points lie within the warning threshold [13].

Chemical analysis

We performed the analysis of the anti-Ebola compounds to check their chemical diversity. The diversity was checked by the multidimensional scaling (MDS) with a similarity score of 0.4. The cluster map was constructed through ChemmineR software [34]. Further, the chemical dendrogram was formed using the Scaffoldhunter software through the chemical Fingerprints [35].

Web server

The best performing predictive models are implemented in the form of web server 'anti-Ebola.' The front end of the web server is designed using HTML, CSS and PHP while the backend of the web server is constructed using python, perl and javascript.

Results

Performance of QSAR models

Among the six randomized training/testing (T274) datasets, the best QSAR model displayed a PCC of 0.83, 0.98 and 0.95 for SVM, RF and ANN machine learning techniques, respectively, on the best performing dataset (Table 1). Cross-validation of the training/testing dataset was done using independent validation (V31) dataset and showed the PCC values of 0.65, 0.62 and 0.64 for SVM, RF and ANN correspondingly (Table 1). The performance of all the remaining five training/testing and independent validation datasets is provided in Supplementary Table S1.

Table 1 Table depicting the performance of training/testing (T274) and independent validation data set (V31) for the support vector machine, random forest and artificial neural network

Applicability domain

While plotting William’s plot, we found that most of the data points of both training/testing and validation data lie within the warning threshold, showing that the developed model is robust. We found the h* is 1.21, 1.25 and 1.18, while the 3σ is 2.0, 1.9, 1.0, respectively, for SVM (Fig. 2a), RF (Fig. 2b) and ANN (Fig. 2c). Both the h* and the 3σ were plotted as a warning threshold in William’s plot. William's plot shows the relationship between standardized residuals and leverage (Fig. 2).

Fig. 2
figure 2

Applicability domain of the anti-Ebola compounds presented by William’s plot. a random forest, b support vector machine, c artificial neural Network

Chemical analysis

We performed an analysis of the anti-Ebola chemicals to explore the chemical variability. For the same, we used the multidimensional scaling (MDS) whose distance matrix was calculated by ‘all-against-all’ comparison of compounds through atom pair similarity measures (Fig. 3a). Further, the generated similarity scores were transferred into the distance values through the cmdscale method. The cluster map shows the diversity up to 320 clusters with the similarity cutoff of 0.4. Further, the chemical dendrogram was also constructed to check the details of the chemical scaffolds using the EstateNumericalFingerprint (largest fragment, deglycosilated) physicochemical properties. It showed that the highest number of the molecules, i.e., 55, comes under the parent chemical with benzene ring (Fig. 3b). Furthermore, 32 molecules consisted of pyridine parent molecules. Remaining information of all the anti-Ebola molecules is provided in Fig. 3b.

Fig. 3
figure 3

Chemical analysis of anti-Ebola compounds. a Scatter plot showing the diversity of the 305 anti-Ebola compounds, b chemical dendrogram of the anti-Ebola compounds showing the chemical side chain similarity among them

Web server

The web server 'anti-Ebola' is freely available at: https://bioinfo.imtech.res.in/manojk/antiebola. It contains the predictor, where the input query can be provided in the form of a SDF and the output displayed as a tabular form with information of SMILES, predicted IC50 in μM along with its structure. To make our web server more informative, we have also provided the important drug-like properties of the input query. We used filter-it software to calculate these drug-likeness properties. It includes the drug-likeness properties, namely Lipinski acceptor, Lipinski donor, H-bond acceptors, H-bond donor, molecular weight, logP, rotatable and rigid bonds, formal charges and molecular formula. The H-bond acceptor shows the number of hydrogen bond acceptors; it includes an aromatic N with no connected H atoms, no amide nitrogen and which doesn’t possess any positive charge; an aliphatic N with no connected H atoms as well as no positive charge on it; any O atom without any positive charge; and a thionyl sulfur atom. The H-bond donor shows the number of hydrogen bond donors and includes any H bonded to a N; any H bonded to an O; and any H bonded to a S. Lipinski acceptor refers to the Lipinski H-bond acceptor like any N or O atom which may or may not be connected to any H atom. Lipinski donor denotes the Lipinski H-bond donor e.g., each H-atom connected to N or O. Here, Lipinski’s rule of five is the rule of thumb to determine the drug likeness of a compound. It indicates whether the compound has certain biological, chemical, pharmacological activities appropriate for human consumption.

Case study

We have checked the utility of our web server by predicting the IC50/EC50 values of the already identified promising hits from other studies. We used an anti-Ebola SVM predictive model to predict anti-EBOV activity of these lead molecules. For example, Zheng et al. identified Indinavir, Maraviroc, Abacavir, etc. as good anti-EBOV compounds [18]. Interestingly, our predictive model also predicts high inhibition efficacy of Indinavir (IC50 0.03uM), Maraviroc (IC50 0.30uM), Abacavir (IC50 1.27uM). Likewise, Anantpadma A et al. identified three effective anti-EBOV drugs, namely Tilorone, Pyronaridine and Quinacrine with Kd values of 0.73 uM, 7.34uM and 7.55 uM [16]. These three lead molecules also show potential inhibition efficacy by our ‘anti-Ebola’ web server such as Tilorone (IC50 1.95uM), Pyronaridine (IC50 0.50uM) and Quinacrine (IC50 0.002uM). Thus, these findings further validate the utility of our prediction algorithm.

Discussion

Ebola is a dreadful pathogen, which is responsible for causing epidemics in the past, with a high mortality rate [36]. There is a need for developing effective anti-Ebola agents. In this endeavor, intervention of the computational approaches would accelerate the research in the field [16]. Therefore, in the current study, we provided machine learning-based prediction models to identify novel and effective anti-Ebola compounds. Apart from that, we also analyzed the chemical diversity of the available Ebola inhibitors.

We implement three MLTs like SVM, RF and ANN to develop effective predictive models. SVM, RF and ANN are the machine learning techniques that work on different principles. For example, the SVM is a nonlinear algorithm, RF works with a decision tree group of algorithms, and the ANN is a neural networks-based algorithm. Various researchers have used these techniques in numerous studies [37,38,39,40]. Likewise, we had also used these techniques to develop predictive algorithms like QSPpred [25], VIRsiRNApred [41], AVP-IC50Pred [42], anti-flavi [12] and many more. For the development of the high-quality predictive models, we extracted the highly relevant features out of the 17,968 (1D, 2D, 3D and fingerprints) features from the available anti-Ebola compounds. Among the three MLTs, the PCC of the SVM, RF and ANN ranges from 0.83 to 0.98. Further, we checked the robustness of the developed models by constructing William's plot (applicability domain). Further, we implemented the developed models in the form of a web server named ‘anti-Ebola’ (https://bioinfo.imtech.res.in/manojk/antiebola/). The implementation of the predictive models in the form of a web server makes them easily accessible for the users. Apart from that, we analyzed the chemical diversity of the available EBOV inhibitors. We noticed that the available anti-Ebola molecules showed high chemical diversity. However, the highest (55) amount of the molecules are derivatives of the benzene parent compound, followed by the 32 molecules which are the derivative of the pyridine heterocyclic ring. This is an important approach based on the implementation of the MLTs on the available experimentally validated anti-Ebola molecules. Thus, our study would be very important for identification of the new and promising anti-Ebola agents. Researchers can use our web server to identify the promising repurposed drug candidates also.

Few researchers performed computational studies for the identification of repurposed drugs against EBOV. These computational studies include the use of Bayesian machine learning models, molecular simulations, molecular docking, etc. [16, 17, 19]. These studies used different datasets as input like natural products, FDA-approved drugs and small active molecules from repositories. However, our study is different from these approaches, as we have incorporated three different MLTs for the prediction of anti-EBOV agents. For the development of the predictive models, we used the experimentally validated anti-EBOV compounds which are chemically diverse. Furthermore, our predictive models are incorporated as a web server which is not available with any of the previously published computational approaches for EBOV.

The frequent outbreaks of EBOV with high mortality and fatality rate are serious concerns worldwide. As EBOV is a dangerous infectious pathogen and comes under the Biosafety Level-4 (BSL-4) category, it requires a highly specialized laboratory to work. Therefore, designing an anti-Ebola agent is a challenging task. Thus, the intervention of computational approaches would be of great help in speeding up the identification of effective EBOV inhibitors. In this endeavor, we have developed the machine learning-based QSAR regression model 'anti-Ebola.' We will update the web server on a yearly basis or whenever a significant amount of data is available. Thus this 'anti-Ebola' web server would be helpful to researchers to predict Ebola inhibitors and the antiviral therapeutic development.