1 Introduction

The full name of COVID-2019 is the Coronavirus disease of 2019, which has created panic in the whole world today [1, 2]. Novel COVID-2019 has been reported to be the most harmful and dangerous in the world since the 1918 H1N1 influenza epidemic. Based on the report of the World Health Organization, by April 10, 2020, a total of 15,225,252 case reports were filed and a total of 100,075 deaths occurred. Thus, it can be said that COVID-2019 has been spreading very fast since the first December 2020 to till date. Till now COVID-2019 has spread in 172 countries. At present, the highest number of cases has been found in the United States of America (USA). COVID-2019 is a terrible contagious disease that results in very rapid movement from one person to another people. The COVID-2019 epidemic is a member of the family of Severe Acute Respiratory Syndrome Coronavirus-2 (SARS-CoV-2). Thus here it can be said that coronavirus is a contagious disease.

The invention of the coronavirus was first discovered in 2002 and 2012 from China and Saudi Arabia respectively. Corona is a family of viruses that is responsible for diseases ranging from cold, cough, respiratory diseases and life-threatening diseases such as Middle East Respiratory Syndrome (MERS) and Severe Acute Respiratory Syndrome (SARS). COVID-2019 was first invented by China in mid-December 2020. It was first found in the city of Wuhan, China [2]. According to some media reports, this COVID-2019 was found in China in mid-19 November. Therefore, we can say here that China did not reveal the correct information about this virus to the countries of the world. This is a serious matter.

In a study with Jiang and his colleagues, it was found that the fatality rate for COVID-2019 is around 7.5% [3]. These people have also found in their study that the fatality rate for persons in the age group of 70–79 is 8.0% whereas, for those above the age of 80 years, the fatality rate is around 14.8% [3]. This study considers individuals above the age of 50 years with the highest risk of underlying illnesses such as diabetes, Parkinson’s disease, and cardiovascular.

A person suffering from COVID-2019 starts showing symptoms in 2–14 days. Due to this virus, the patient suffers from diseases like fever, cough, breathlessness, pneumonia, kidney failure, etc. [1]. The coronavirus spreads very rapidly from one human to another by respiratory drops. This virus does not live long in the air. The virus does not spread through the air because it is not alive for long in the air [3].

Machine learning is an automated machine used to analyze various types of data. Regression analysis is a part of machine learning. Machine learning is a subset of Artificial Intelligence. Today, most of the data analyzers and scientists are using machine learning for data analysis in different domains. In this proposed study, we have proposed regression analysis based quadratic, third, fourth, fifth, sixth degree, and exponential polynomial for COVID-2019 prediction in the next 7 days for Indian doctors and Indian Government. These regression analysis based models help us for doctors and the Indian government for the next 7 days plans.

In recent times, machine algorithms have proved it to be efficient in predicting healthcare data [4,5,6,7]. Nsoesie and his colleagues have provided a systematic way to predict influenza pandemic dynamics [8]. They have also studied most of the research paper regarding prediction such as regression analysis, mass action based deterministic models, prediction rules, deterministic mass action models, regression models, prediction rules, Bayesian network, SEIR model, ARIMA forecasting. The full form of SEIR is susceptible (S), exposed (E), infected (I), and resistant (R) and ARIMA forecasting is Auto-Regressive Integrated Moving Average. The study of the solution by researchers on COVID-2019 has revealed that only exploratory analysis of limited data has been done on it [8,9,10,11]. No country has yet invented any medicine to reduce the effects of the COVID-2019 epidemic and to cure the disease completely [12]. Therefore, we can say that an important part of the management of this epidemic is to reduce the peak of the epidemic. Lowering this peak of the epidemic is also called leveling the epidemic curve. Data mining researchers and data scientists are very important to explain the characteristics of COVID-2019 and to collect technology and related data for the role of this virus [13,14,15,16]. This type of study can help in making the right decision of this epidemic and make a concrete plan of its actions. So, in the end, this study shows that in the future we will be able to properly treat and reinforce the infrastructure, wellbeing, vaccine development, and such epidemics. This type of study also shows that How can we get rid of diseases in the future.

The objectives of these studies are given below:

  1. 1.

    Finding the rate of spread of the disease in the next 7 days with the help of regression analysis models.

  2. 2.

    We have developed a machine learning-based regression analysis models for exposed COVID-2019.

  3. 3.

    Forecast of COVID-2019 in India with the next 7 days for better management for doctors and various government organizations.

2 Machine learning

Machine learning is an automated method for data analysis in various domains like medical engineering, financial sector, business sector, educational domains other related sectors. It comes under Artificial Intelligence which teaches machines from training datasets. Through machine learning, we can identify patterns, analyze data, and make correct decisions with no human intervention or less human intervention. Machine learning is broadly categorized into three parts which are given below:

  1. 1.

    Supervised learning.

  2. 2.

    Unsupervised learning.

  3. 3.

    Reinforcement learning.

Superior learning means that a machine or model teaches the teacher, or in other words, we can say that the machine or model learns through a training dataset. In supervised learning, class-level information is available in the training datasets.

Whereas unsupervised learning means-learning without a teacher or in other words learning algorithms learn dynamically with help partitioning or clustering algorithm. Most of the clustering algorithms are available in literature such as K-Means, Fuzzy C-Means, hierarchical clustering methods, and so on. Reinforcement learning is a combination of supervised and unsupervised learning methods.

3 Regression analysis

Regression analysis is a part of machine learning or in other words, regression analysis is a subset of machine learning algorithms [17, 18]. It is the first machine learning algorithm. Regression analysis inventor says that “Regression analysis consists of a set of machine learning methods that allow us to predict a continuous outcome variable (Y) based on the value of one or multiple predictor variables (X). It assumes a linear relationship between the outcome and the predictor variables”. Let us consider equation straight line connecting any two variables X and Y can be stated algebraically as:

$$ Y = aX + b $$
(1)

where b is called the intercept on the y-axis and a is called the slope of the line. Here a and b are also called the parameters of regression analysis. These parameters should learn through proper learning methods.

In this proposed, we have proposed six regression analysis based models known as exponential, quadratic, third degree fourth degree, fifth degree polynomial. The description of these models is given below:

$$ {\text{Y}} = {\text{a}}e^{bx} $$
(2)
$$ Y = aX^{2} + bX + c $$
(3)
$$ Y = aX^{3} + bX^{2} + cX + d $$
(4)
$$ Y = aX^{4} + bX^{3} + cX^{2} + dX + e $$
(5)
$$ Y = aX^{5} + bX^{4} + cX^{3} + dX^{2} + eX + f $$
(6)
$$ Y = aX^{6} + bX^{5} + cX^{4} + dX^{3} + eX^{2} + f + g $$
(7)

where \( a, b, c, d, e, f {\text{and g}} \) are called the parameters of regression analysis.

3.1 Correlation coefficients

The strength of a linear relationship between two variables is known as the Correlation coefficient means. According to Karl Pearson, the coefficient of correlation is a measure or degree of the linear relationship between two variables. Karl Pearson has been developed a formula known as Correlation Coefficient. The correlation coefficient between two random variables X and Y, usually denoted by is a numerical measure of the linear relationship between them and is defined as:

$$ r\left( {X,Y} \right) = \frac{{Cov\left( {X,Y} \right)}}{{\sigma_{X} \sigma_{Y} }} $$
(8)

where \( Cov\left( {X,Y} \right), \sigma_{X} {\text{and }}\sigma_{Y} \) is defined by the following formulae:

$$ Cov\left( {X,Y} \right) = \frac{1}{N}\mathop \sum \limits_{i = 1}^{N} \left( {x_{i} - \bar{x}} \right)\left( {y_{i} - \bar{y}} \right) $$
(9)
$$ \sigma_{X} = \frac{1}{N}\mathop \sum \limits_{i = 1}^{N} \left( {x_{i} - \bar{x}} \right)^{2} $$
(10)
$$ \sigma_{Y} = \frac{1}{N}\mathop \sum \limits_{i = 1}^{N} \left( {y_{i} - \bar{y}} \right)^{2} $$
(11)

Here \( \left( {x_{i} ,y_{i} } \right), i = 1,2,3,4 \ldots ,N, \) is the set of input and output variables. Here there are some prediction is given below:

  1. 1.

    If the value of the correlation coefficient is zero, it means there is no correlation between input variables X and output variable Y.

  2. 2.

    If the value of the correlation coefficient is equal to positive one. It means there is a strong relation between the input variable and the output variable. In other words, if the input variable is increased then the output variable is also increased.

  3. 3.

    If the value of correlation coefficient is equal to negative. It means the input variable is increased then the output variable is also decreased and vice versa.

Two variables that have a small or no linear correlation might have a strong nonlinear relationship. However, calculating linear correlation before fitting a model is a useful way to identify variables that have a simple relationship. In this proposed study, first of all, we have calculated the correlation coefficient between date and number of confirmed cases of COVID-2019 spread up of India between 1st March 2020 to 11 April 2020 [19, 20]. The correlation coefficient between date and number of confirmed cases are as follows:

$$ r = \left[ {\begin{array}{*{20}c} {1.0000} & {0.8157} \\ {0.8157} & {1.0000} \\ \end{array} } \right] $$

In the matrix, the diagonal elements represent the perfect correlation of the input variable (Date) and the output variables (confirmed cases) of COVID-2019 spared up of India with itself and are equal to 1. The off-diagonal elements are very close to 1, indicating that there is a strong statistical correlation between the variables Date and number of confirmed case of COVI-2019 spread up datasets in India.

3.2 Residuals and goodness of fit

The difference between the observed value of the response variable (Y) and the value of proposed model is known as residuals. This is the measure of goodness of fit a straight lines of the proposed models. Residuals are the difference between observed values and values of the proposed models. The following formula is used for the calculation of residuals:

$$ Residuals = Y_{Observed } - Y_{Modeles Values} $$
(12)

Another formula also measure the goodness of fit is known as \( R^{2} \) and defined as the following formula:

$$ R^{2} = 1 - \frac{{SS_{Residual} }}{{SS_{Total} }} $$
(13)

where \( SS_{Residual} \) represent the sum of the squared residual from the regression analysis and \( SS_{Total} \) represent the sum of the squared difference from the mean of the dependent variables. The sum of the squared residuals from the regression and the sum of the squared differences from the mean of the dependent variables both are positive.

3.3 Adjusted \( R^{2} \) for polynomial regressions analysis

In the proposed study, the two or more polynomials have been used for data analysis of COVID-2019. Therefore, in this study, the residuals in a model can be reduced by fitting a high degree of polynomial. Adjusted \( R^{2} \) for polynomial regression is defined as the following formula:

$$ R^{2}_{Adjusted} = 1 - \left( {\frac{{SS_{Residual} }}{{SS_{Total} }}} \right)*\left( {\frac{{\left( {n - 1} \right)}}{{\left( {n - d - 1} \right)}}} \right) $$
(14)

where n is the number of observations in COVID-2019 data training datasets and \( d \) is the degree of polynomials of proposed regression analysis models. In this proposed study, we have compute the both simple and adjusted \( R^{2} \) to evaluate whether the extra terms n and d terms improve the predictive power of proposed methods.

4 Experimental results and discussion

In this proposed study, we have taken the COVID-2019 outbreak dataset from India. The first case of the COVID-2019 epidemic was found in Kerala state of India in January 2020. At that time, three the COVIDs cases in Kerala were infected with the the COVID-2019 epidemic. All three patients came from the city of Wuhan in China at that time. However, things escalated in March, after several cases were reported all over the country, most of whom had a travel history to other countries. The first outbreak of the COVID-2019 epidemic was to begin in India in early March and by 20 March the number had risen to about 282. For the first time, the Prime Minister of India, Shri Narendra Modi, addressed the nation about the COVID-2019 pandemic on 19 March 2020 and announced a public curfew on 22 March 2020. After this, the Prime Minister of India again addressed the name of the nation for the second time about the COVID-2019 on March 22 and locked India down from March 25 to April 14, 2020. The growth of the COVID-2019 epidemic in India is going on in exponential form from 20 March 2020 to 10 April 2020. Even today, the outbreak of this epidemic is happening in exponential form.

We can do a machine learning based regression analysis methods for data analysis to create a model based on regression analysis that helps in the forecast next 7 days for the COVID-2019 outbreak in India. The whole dataset of the COVID-2019 outbreak of India is available on Kaggle and World Health Organization (WHO) website [19, 20]. In this study, we have used MATLAB software for programming and data analysis. Figure 1 shows the number of cases detected from 1st March 2020 11th April 2020. Different regression analysis models have been utilized for data analysis of the COVID-2019 of India based on data stored by Kaggle in between 1 March 2020 to 11 April 2020. In this study, we have been utilized six regression analysis based models namely quadratic, third degree, fourth degree, fifth degree, sixth degree and exponential polynomial respectively for the COVID-2019 dataset. Table 1 and Table 2 shows the training and datasets of the COVID-2019 outbreak of India during 1st March 2020 to 11th April 2020 and testing dataset during 12 April 2020 to 19 April 2020.

Fig. 1
figure 1

Scatter plot of traning dataset of the COVID-2019 in India

Table 1 Training dataset of COVID-2019 of India during 1st March 2020 to 11th April 2020
Table 2 Testing dataset of the COVID-2019 of India during 12th April 2020 to 19th April 2020

The analysis is based on the date data and confirmed cased data of whole India as presented in Table 1. In this regard, the regression calculations between date and confirmed cases parameter have been done for dataset from 1st March 2020 to 11th April 2020 [19, 20]. For the purpose of experimental results, we have used linear regression models like quadratic to sixth degree polynomial and exponential polynomial. In these proposed regression models, we have used date (say X) as independent variable and number of confirmed cases consider as dependent variable (say Y) or predictor. The proposed linear regression models equation are given below:

$$ {\text{Y}} = 18.74 *e^{0.14x} $$
(15)
$$ Y = 8.1572*X^{2} - 214.7599*X + 1013.4 $$
(16)
$$ Y = 144.4802*X^{3} + 597.7748*X^{2} + 865.6646*X + 618.8247 $$
(17)
$$ Y = 144.4802*X^{4} + 597.7748*X^{3} + 865.6646*X^{2} + 618.8247*X + 272.4252 $$
(18)
$$ Y = - 52.17*X^{5} + 144.48*X^{4} + 766.97*X^{3} + 865.66*X^{2} + 512.09*X + 272.43 $$
(19)
$$ Y = - 90.92*X^{6} \pm 52.17*X^{5} + 505.65*X^{4} + 766.97*X^{3} + 515.23*X^{2} + 513.09*X + 320.96 $$
(20)

Equation 1520 represent the exponenetial, quadratic, third dgree, fourth degree, fifth degree and sixth degree polynomial equations.

Figure 2a shows the results of confirmed cases of the proposed fitted regression analysis based models namely exponential, quadratic, third degree, fourth degree, fifth degree and sixth degree polynomials for the training datasets. In the proposed study, we have plotted all the calculated residuals of the proposed models namely exponential, quadratic, third degree, fourth degree, fifth degree and sixth degree of polynomial. In regression analysis, residuals play an important role the COVID-2019 outbreak data analysis in India. Figure 2a shows the residuals for the proposed methods namely exponential, quadratic, third degree, fourth degree, fifth degree and sixth degree polynomials. These Fig. 2a, b also shows that the sixth degree polynomial for fitted result and residuals, respectively and gives better results in comparison to other like quadratic, third degree, fourth degree and fifth degree polynomial fitted results and residual. In this study, the quadratic polynomial gives unsatisfactory results, while exponential, third degree, fourth degree, fifth degree and six degree polynomials give better and satisfactory results on how to fit the dataset of the COVID-2019 in India (Fig. 2a). Figure 2b shows the residual of exponential polynomial fit gives the strongly pattern behavior while other polynomial fit residuals still strongly patterned.

Fig. 2
figure 2

a Fitted curve with data of exponential, quadratic, third degree, fourth degree, fifth degree polynomial. b Residuals with dataset of exponential, quadratic, third degree, fourth degree, fifth degree polynomial

Regarding the best fit of the proposed models, we have calculate the \( R^{2} \) and adjusted \( R^{2} \) for the proposed models. Table 3 shows the calculated results of the Sum of Square Errors (SSR), \( R^{2} \), Degree of Freedom for Error (DFE) and adjusted \( R^{2} \). Table 3 also shows that sixth degree polynomial based regression analysis model has lowest values of SSE, \( R^{2} \), DFE and adjusted \( R^{2} \) in comparison to other models. It means proposed regression analysis based sixth degree polynomial gives better results for the prediction or forecast the COVID-2019 outbreak in India in comparison to other models like exponential, quadratic, third degree, fourth degree, fifth degree polynomial regression models.

Table 3 The results of SSE, \( R^{2} \), DFE and adjusted \( R^{2} \)

Tables 4 show the results of the COVID-2019 outbreak training datasets of India during 1st March 2020 to 11 April 2020. The last column of this table shows that results of proposed sixth degree polynomial. Because, according to above discussion here we have found that regression analysis based sixth degree polynomial gives better result for predicting the outbreak of the COVID-2019 in India to next 7 days. Table 5 shows the results of the COVID-2019 outbreak testing datasets of Indian during 11th April 2020 to 19 April 2020. In the last column of Table 5 also shows the predicted or fitted results of proposed sixth degree polynomial.

Table 4 Training datasets analysis of the COVID-2019 of India during 1st March 2020 to 11th April 2020
Table 5 Testing datasets analysis of the COVID-2019 of India during 12th April 2020 to 19th April 2020

Figure 3 shows Comparison of Confirmed Case (Actual Result) and Results of the Proposed Model Sixth Degree Polynomial (predicted results) for training dataset of the COVID-2019. This figure is also shows that the result of the proposed sixth degree polynomial method is very close to confirmed cases (actual results).

Fig. 3
figure 3

Comparison of confirmed case (actual result) and results of the proposed model sixth degree polynomial (predicted results) for training dataset of the COVID-2019

The above Fig. 4 shows a comparison of the confirmed case (actual result) and results of the proposed model sixth degree polynomial (predicted results) for the training dataset of the COVID-2019. This figure also shows that the result of the proposed sixth degree polynomial method is very close to confirmed cases (actual results). Therefore the proposed method is very useful for future prediction of the COVID-2019 outbreak to the next 7 days from the current date.

Fig. 4
figure 4

Comparison of confirmed case (actual result) and the results of the proposed model sixth degree polynomial (predicted results) for testing dataset of the COVID-2019

5 Conclusion

In this paper, we have proposed six regression analysis based machine learning models for prediction of the COVID-2019 outbreak datasets of India. These models basically regression analysis based exponential, quadratic, third degree, fourth degree, fifth degree and sixth degree polynomials. These models also predict the outbreak of the COVID-2019 in India for the next 7 days. After analyzing the COVID-2019 outbreak datasets on India between 1st March 2020 to 11th April 2020 and predict the results to the next 7 days with the help testing datasets from 12th April 2020 to 19 April 2020. Here, we have find out that the value of for proposed models namely sixth degree polynomial is very close to the confirmed case or actual results regarding training dataset of the COVID-2019. According to Table 3, the value residuals of sixth degree polynomial are higher in comparison to the residual of other proposed models. It means this model achieved best fitted results for COVID-2019 datasets of India. Therefore, here we can says that the proposed regression analysis based sixth degree polynomial gives better results of the COVID-2019 outbreak training and testing datasets of India. Table 5 shows the prediction results of the COVID-2019 outbreak results of the next 7 days. This table also shows that the very little difference between confirmed results and predicted results for the COVID-2019 outbreak of India. In the last, this proposed study is very useful for Indian doctors and the Indian government for managing the COVID-2019 outbreak for the next 7 days. In the future, we will develop a regression analysis based on artificial neural networks that can be developed to obtain data at regular intervals. This model will automatically estimate the number of cases of weekly and bi-weekly data. Therefore, we can say that the Indian government and doctors can maintain a check on hospital facilities, necessary supplies for new patients, medical aid, and isolation for next week or in the future.