Introduction

Coronavirus disease (COVID-19) is a novel respiratory illness that originated in 2019 and can spread from person to person, as defined by Centers for Disease Control and Prevention (CDC)1. The first incidence of such disease was publicly reported to World Health Organization (WHO) as an outbreak in Wuhan, China, on 31 December 20192,3. It was assumed on an association with the consumption of wild animals sold at Huanan Seafood Wholesale Market4,5. So far, the original source of this disease has not been clearly identified and the disease is continuously spread in over 70 countries. Reverse polymerase chain reactions and genome sequencing were used for diagnostics and therapeutics measures6. COVID-19 is a member of coronavirus family, and it is contagious among humans and animals7. Coronaviruses are a group of RNA viruses; the earliest study on animal coronavirus was reported in the late 1920s8, and human coronavirus was first studied by Kendall, Bynoe and Tyrell in 1960s through extracting the viruses from patients who suffered from common colds9,10. The genome size of coronaviruses ranges between 26 and 32 kilobases11. The viruses have characteristic club-shaped spikes projected from their surface, and the surface morphology of the viruses resembles a solar corona12. The viruses can be further categorized into gamma, beta, delta and alpha coronaviruses13. Beta and alpha coronaviruses originate from bats, while gamma and delta coronaviruses spread among birds and pigs14. COVID-19 is a member of beta category, which is associated with severe diseases. The genome structure of COVID-19 is 96% similar to that of bat coronaviruses15,16. It is still not clear about the exact route of the virus jump from bats to humans.

COVID-19 had a profound impact on global social and economic development17. It caused severe demographic changes and extremely high unemployment rates with many economic activities being halted. This extraordinary event brought about some unintended consequences such as the violation of international law on the settlement of refugees due to border closure18.

Different governments and health officials have introduced various preventive measures to curb the COVID-19 virus spread, including hand sanitizers, gloves, masks, social distance, and geographical closure17. Although the geographical closure led to temporary urban air quality improvement, lockdown of towns, cities, states, and countries causes severe damage to the well-being and economic growth of society in a broad sense, and the virus poses a major threat to the international healthcare system19. The unknown nature about the peak of virus spread makes the decision of lockdown or closure a difficult task to plan in advance. This calls for an accurate early forecasting model for the ongoing spread of COVID-19 virus.

Literature review

Many studies have been carried out on the epidemic investigation of COVID-19 spread. The first category of studies is a pure statistical analysis. Important epidemic parameters were estimated20,21, including basic reproduction number22, doubling time23 and serial interval24. In addition, some advanced models were developed in handling untraced contacts25, undetected international cases26, and actual infected cases27. Statistical reasoning28,29 and stochastic simulation30,31 were also explored by a few researchers.

The second group of investigations was based on dynamic modelling. Susceptible exposed infectious recovered model (SEIR) was used in assessing various measures in the COVID-19 outbreak32,33,34,35. Furthermore, it was utilized in investigating the effect of lockdown36, transmission process37, transmission risk32, and the effect of quarantine32. The SEIR model with time delays was also developed for studying the period of incubation and recovery38,39.

Richards40 developed a flexible growth function for empirical use in the context of plant data based on von Bertalanffy’s growth function41, which was originally designed for animals. This model was later used for fitting the single-phase outbreaks of severe acute respiratory syndrome (SARS) in Hong Kong42 and Taiwan43 as well as a multi-phase outbreak of SARS in Toronto44. The Georgia State group45 recently studied short-term forecasts of the COVID-19 epidemic in Guangdong and Zhejiang, China between February 13 and 23, 2020 via a generalized logistic growth model46, the Richards model, and a sub-epidemic model47. As an extension to a logistic growth model, the Richards mode can be described by a single differential equation:

$$ \frac{dP\left( t \right)}{{dt}} = rP\left( t \right)\left[ {1 - \left( {\frac{P\left( t \right)}{K}} \right)^{a} } \right], $$
(1)

where P(t) represents the cumulative number of infected cases at time point t, r denotes a growth rate, K refers to the maximum asymptote, and a is a scaling parameter. One solution of Eq. (1) is

$$ P\left( t \right) = \frac{K}{{\left[ {1 + e^{{ - r\left( {t - t_{0} } \right)}} } \right]^{\frac{1}{a}} }}, $$
(2)

where \(t_{0}\) is the time value of the sigmoid’s midpoint. When a = 1, this model is degenerated into a simple logistic growth mode with three parameters [K, r, \(t_{0}\)]. Equation (2) represents a four-parameter model [a, K, r, \(t_{0}\)] and other variations of the Richards model could consist of up to 6 parameters. In this paper, our comparison and discussion are limited to Eq. (2).

Although there have been many recent studies with respect to the COVID-19 virus spread, an accurate forecasting model for the virus spread based on data at a very early time point is still elusive. Such a model is crucial to a decision-making process for strategic plans to achieve a balance between reduction in life loss and avoidance of economic crisis due to lockdown. In this paper, we develop a new recursive bifurcation model, apply it to the recent data in two countries (South Korea and Germany), and compare it with a simple logistic growth model and the Richards model in the context of COVID-19 virus spread.

The rest of this paper is organized as follows. In “Recursive bifurcation model”, a recurve bifurcation model is introduced to model the COVID-19 spread. A bifurcation analysis is given in “Bifurcation analysis of COVID-19 virus spread” on the data of infected population in South Korea. “Early forecasting of COVID-19 virus spread” describes the forecasting of COVID-19 virus spread based on our model and a comparative study with two existing models, followed by some concluding remarks in “Conclusions”.

Recursive bifurcation model

In this paper, we focus on the cumulative number of infected population, which is an important metric to measure the extent of the COVID-19 spread in different countries. Although the infected population in most countries follows a pattern of a logistic or sigmoid function, the logarithm of the infected population may provide more information, as shown in Fig. 1b.

Figure 1
figure 1

The number of infected population in South Korea as of April 5, 2020.

The countries, in which a bifurcation pattern occurred, include South Korea, Germany, United States, France, Canada, Australia, Malaysia, and Ecuador. Figure 2 shows the pattern in the last 5 countries of the list. The detailed information of Germany and United States data are given in “Early forecasting of COVID-19 virus spread”. By utilizing the bifurcation, we can find out the intrinsic parameters (such as growth rate) in cycle 1 and apply those parameters in the prediction for cycle 2 or beyond. More importantly, the bifurcation model performs better than the Richards model in the early forecasting of COVID-19 virus spread. A more detailed discussion about this improvement is provided in “Early forecasting of COVID-19 virus spread”.

Figure 2
figure 2

A bifurcation pattern of the infected population of COVID-19 virus spread in five countries.

Following the above idea, we introduce a recursive Tanh function to describe the cumulative number of infected population within each cycle of an entire virus spread process:

$$ log\left( {P + 1} \right) - log\left( {P_{i - 1} + 1} \right) = \left( {log\left( {P_{i} + 1} \right) - log\left( {P_{i - 1} + 1} \right)} \right)\left[ {\frac{2}{{1 + e^{{ - 2r_{i} \left( {D - D_{i - 1} } \right)}} }} - 1} \right], $$
(3)

where i refers to the i-th cycle, P is the cumulative number of infected population at any time point in the i-th cycle, D represents the number of days since the initiation of virus spread, \(P_{i - 1}\) stands for the cumulative number of infected population at the end of the (i − 1)-th cycle, \(r_{i}\) is the spread rate in the i-th cycle, and \(D_{i - 1}\) refers to the number of days at the end of the (i − 1)-th cycle. The purpose of adding 1 in the logarithm calculation is to avoid an infinity caused by the case where P = 0.

Note that Eq. (3) is not strictly a recursive formula in a conventional sense. The reason for us to call it as a recursive one is that Eq. (3) should be recursively solved starting from cycle 1 toward cycle n, if n is the last cycle for the virus spread. When n = 1, this equation is degenerated to a regular Tanh function:

$$ log\left( {P + 1} \right) = log\left( {P_{1} + 1} \right)\left[ {\frac{2}{{1 + e^{{ - 2r_{1} D}} }} - 1} \right]. $$
(4)

Bifurcation analysis of COVID-19 virus spread

In order to validate Eq. (3) for the analysis of COVID-19 virus spread, we have to select a complete virus spread process. Among all the countries, South Korea seems to be the best choice for this validation because the country provides reasonably reliable data and the virus spread in that country has been stabilized.

\(r_{i}\) in Eq. (3) represents an intrinsic attribute of the virus spread rate. It can be estimated by a linear least-squares fit of the following linear equation in a parameter space:

$$ r_{i} \left( {D - D_{i - 1} } \right) = - 0.5ln\left[ {\frac{2}{{1 + \frac{{log\left( {P + 1} \right) - log\left( {P_{i - 1} + 1} \right)}}{{log\left( {P_{i} + 1} \right) - log\left( {P_{i - 1} + 1} \right)}}}} - 1} \right], $$
(5a)
$$ r_{i} X = W, $$
(5b)

where \(X = D - D_{i}\) and \( W = - 0.5 ln \left[ {\frac{2}{{1 + \frac{{log\left( {P + 1} \right) - log\left( {P_{i - 1} + 1} \right)}}{{log\left( {P_{i} + 1} \right) - log\left( {P_{i - 1} + 1} \right)}}}} - 1} \right]\).

Figure 3a shows the result of determining the virus spread rate, \(r_{1}\). By using this r value, we predict the infected population, \(y_{p}\), which is very close to the true data, y, as shown in Fig. 3b.

Figure 3
figure 3

Determination of virus spread rate with South Korea data in cycle 1.

Furthermore, by using \(r_{1}\) in cycle 2 of South Korea data, we want to validate whether Eq. (3) is still valid by introducing Eq. (6), and \(\alpha\) should be unity (Fig. 4):

$$ log\left( {P + 1} \right) - log\left( {P_{i - 1} + 1} \right) = \alpha \left( {log\left( {P_{i} + 1} \right) - log\left( {P_{i - 1} + 1} \right)} \right)\left[ {\frac{2}{{1 + e^{{ - 2r_{i} \left( {D - D_{i - 1} } \right)}} }} - 1} \right], $$
(6a)
$$ y = \alpha Z, $$
(6b)

where \(y = log\left( {P + 1} \right) - log\left( {P_{i - 1} + 1} \right)\) and \(Z = \left( {log\left( {P_{i} + 1} \right) - log\left( {P_{i - 1} + 1} \right)} \right)\left[ {\frac{2}{{1 + e^{{ - 2r_{i} \left( {D - D_{i - 1} } \right)}} }} - 1} \right]\).

Figure 4
figure 4

Analysis of virus spread with South Korea data in cycle 2.

Based on Fig. 4, \(\alpha = 0.968\), which is very close to unity. This indirectly indicates the correctness of Eq. (3) for cycle 2 with growth rate, \(r_{1}\), from cycle 1. The bifurcation in Fig. 1a is easy to identify visually. An automatic algorithm can be created on the basis of discontinuity of tangential direction with a traverse on the curve. Since it is not the main focus of this paper, we do not explore this aspect any further herein.

Early forecasting of COVID-19 virus spread

Based on the model in “Bifurcation analysis of COVID-19 virus spread”, we design an algorithm for early forecasting of COVID-19 virus in South Korea and Germany, as given in Table 1. Since the infected population has recently been stabilized in these two countries, it is then possible to validate the accuracy of this early forecasting model.

Table 1 An algorithm for early forecasting of COVID-19 virus spread.

We first use the following formula to estimate \(\hat{\beta }_{n}\) through a linear least-squares fit:

$$ log\left( {P + 1} \right) - log\left( {P_{n - 1} + 1} \right) = \hat{\beta }_{n} \left[ {\frac{2}{{1 + e^{{ - 2r_{n} \left( {D - D_{n - 1} } \right)}} }} - 1} \right], $$
(7a)
$$ y = \hat{\beta }_{n} { }Z, $$
(7a)

where \(y = log\left( {P + 1} \right) - log\left( {P_{n - 1} + 1} \right)\) and \(Z = \left[ {\frac{2}{{1 + e^{{ - 2r_{n} \left( {D - D_{n - 1} } \right)}} }} - 1} \right]\).

Nonlinear Levenberg-Marquart least-squares fitting48 is computed to determine three unknown parameters (\(\beta_{n}\), \(\theta_{n}\,and\,D_{n}\)) simultaneously in the following equation for future forecasting:

$$ log\left( {P + 1} \right) = \beta_{n} \left[ {\frac{2}{{1 + \left( {e^{{ - 2r_{n - 1} \left( {D - D_{n} } \right)}} } \right)^{{\theta_{n} }} }} - 1} \right] + log\left( {P_{n - 1} + 1} \right), $$
(8)

where \(\theta_{n} \) is an extra parameter, which plays a role of slope control. Once \(\beta_{n}\), \(\theta_{n}\) and \(D_{n}\) are determined, Eq. (8) can be used to predict the values of infected population at future time points.

Figure 5a defines several different time points to investigate the performance of early forecast on the “future” infected population. Here, the “future” is termed only in a sense with respect to a selected time point (i.e., a reference point after which the “future” is factitiously defined) even though we already have the true infected population data for a period after that time point. \(t_{i}\) refers to the time value of the inflection point. \(0.9t_{i}\), \(0.8t_{i}\), \(...,\) and 0 equally divide the range [0, \(t_{i}\)] into ten intervals. We use a similar interval for time values that are greater than \(t_{i}\): \(1.1t_{i}\), \(1.2t_{i}\), \(...,\) and \(mt_{i}\), where \(m\) could be any real number and should be greater than unity.

Figure 5
figure 5

Notations for early forecast on future infected population.

It is shown in Fig. 5b that the infection point, \(t_{i}\), appeared 12 days later than the cycle transition point, \(t_{c}\), between cycles 1 and 2 on South Korea data; a similar pattern was also observed in U.S., Germany and the countries listed in Fig. 2. This provides a better opportunity for our bifurcation method to produce an early forecast, compared to existing methods such as the Richards method, which generates a reasonably good prediction only at a time point after the inflection point because the right part of a curve after the inflection point can’t normally be mirrored from the left part of the curve before the inflection point. In the case of Germany data (Fig. 5c), \(t_{c}\) appeared 38 days earlier than \(t_{i}\). This provides an opportunity for our bifurcation method to utilize the cycle information for early forecast on the COVID virus spread.

Tables 2 and 3 are a comparison in the 95 percent confidence interval of prediction errors of three models at a reference time point defined in Fig. 5a. The second column of these tables contains a mean value and 95 percent confidence bounds in a pair of parentheses. In general, our bifurcation model performs relatively better for an early forecast at 0.8 \(T_{i}\) or 0.9 \(T_{i}\). The forecast time point was selected to a date of writing this paper. Between the Richards model and a simple logistic growth model, the former is better in terms of relative error in one out of two cases. Since there are at least 4 parameters [a, K, r, \(t_{0}\)] in the Richards model, closely-estimated start values are needed in nonlinear least-squares fitting. Figure 6 shows the curve fitting of two-country data based on the three models used in this paper. The parameters associated with each model are given in Table 4.

Table 2 A comparison in infected population prediction at 3.55 \(T_{i} \) based on South Korea data at 0.9 \(T_{i}\).
Table 3 A comparison in infected population prediction at 2.0 \(T_{i} \) based on Germany data at 0.8 \(T_{i}\).
Figure 6
figure 6

Curve fitting of two-country data at their respective reference time points with three different models.

Table 4 Model parameters associated with the fitting in Fig. 6.

Note that for the data from South Korea and Germany, there is no multi-stage pattern if the simple logistic growth model or the Richards model is used. Only through the special treatment in our bifurcation model, did a two-stage pattern appear, allowing a more accurate forecast at an early time point (such as 0.8 \(T_{i} \) or 0.9 \(T_{i}\)) on the infected population growth at a later time point (for example, 2.0 \(T_{i}\)).

The importance of our model is supported by the results in Tables 2 and 3, where our model performs significantly better than the existing models in terms of early forecast on the growth of COVID-19 virus spread. Consequently, our model has a potential to be used in decision making for the events of virus spread in the future.

The data from United States presents a challenge to our approach and also reflects a limitation of the method. As shown in Fig. 7a, for the period from January 22, 2020 to April 17, 2020, there was no inflection point and our cycle transition point appeared very early (Day 33). This case indicates that the cycle transition point appeared at least 52 days earlier than the inflection point. However, since the inflection point did not appear even on August 3, 2020 (Fig. 7b), it is difficult to validate the accuracy of any early forecasting models on U.S. data at the time of writing this paper. It is still elusive how to design and evaluate an early forecasting model when the entire virus spread history is at its early stage without the occurrence of its inflection point. This will be a future research topic.

Figure 7
figure 7

Infected population data of COVID-19 in United States.

Conclusions

In this paper, we propose a recursive bifurcation approach for early forecasting of COVID-19 virus spread. An algorithm is developed to predict the future infected population based on ongoing existing data as of June 14, 2020. Numerical analyses were conducted in comparison with two existing models (a logistic growth model and a Richards model). The results indicate that our bifurcation model performs relatively better than the two existing models at 0.8 \(T_{i} \) or 0.9 \(T_{i}\) time point, where \(T_{i} \) refers to the inflection point of an infected population-time curve. We presented an important observation in which the cycle transition time point, \(T_{c}\), appeared much earlier than \(T_{i}\). This allows our bifurcation model to perform well in the early forecasting of COVID-19 virus spread in South Korea and Germany. However, the ongoing infection spread in United States presents a challenge to our model. It will be a future research topic on how to evaluate the forecasting of COVID-19 infection spread when its inflection point has not occurred as of August 3, 2020.