INTRODUCTION

In December 2019, a pneumonia outbreak took place in Wuhan (China), as a result of which a COVID-19 strain was first detected in a nucleic acid analysis in a patient with the pneumonia. By the end of June 2020 the pandemic captured 188 countries, with more than 10 million of detected infection cases, and 505,500 people have died. According to the cases detected, the Russian Federation is the third in ranking of infected countries, behind the United States and Brazil, with 484,630 infection cases by June 9, 2020 (see Table 1). Although a peak of the disease in the world seems to have been passed, no decrease in the number of detected cases has been observed for a sufficiently long period [1].

To create proper scenarios of disease development in Moscow (the largest number of detected cases in Russia) and Novosibirsk region (moderate detection rate) is an important step in taking appropriate measures to curb the epidemic in the regions.

Table 1. COVID-19 statistics for five leading countries in the number of cases detected by June 9, 2020

Country

Infected

Deaths

Cured

Total

New cases

USA

1,961,185

17,250

111,007

524,855

Brazil

707,412

15,654

37,134

396,737

Russia

485,253

8 595

6 141

241,917

Great Britain

288,834

1 213

40,680

1 257

Papers [2] and [3] are the first to describe some scenarios of the coronavirus epidemic development in Moscow.

The COVID-19 epidemic spread in China, taking into account the incubation period, passenger traffic, and decisions of the Health Care Commission is studied with SEIR mathematical models in [49]. A review of the mathematical models and methods for their solution can be found in [10], and a review of the literature on estimating contagiousness parameters for various countries, in [11].

A model called QSEIR that takes into account the quarantine measures in China to curb the epidemic is proposed in [12]. The model parameters are estimated by a statistical method. Some forecasts of the epidemic based on some strategies of quarantine measures in China have been made. A forecast of the pandemic development in China based on an approximation of the data with some basic functions has also been made in [13].

In paper [14], a model called SIR-D is used to estimate the contagiousness parameter, daily mortality, and recovery rate. Some parameters are found from Chinese data, and a forecast is made for three weeks up to February 29. It was shown in [15] that the Chinese epidemic would have a peak by the end of February, and gradually decline by the end of April. It was shown with machine learning methods that quarantine lifting in Hubei would cause a second peak of the epidemic in this province in mid-March and continue the epidemic to the end of April.

A mathematical model with some control measures called SEIR-HD is proposed in [16].

In paper [17], a generalized SEIR model is applied to SARS-CoV-2 spread in Italy. A particle swarm method is used to find the model parameters for increasing the reliability of 30-day forecasts. The results obtained are compared with data and forecasts for Spain and South Korea. The model is used to take into account the mobility of people according to Google COVID-19 Community Mobility Reports. In [18, 19], modified SEIR models are used to obtain scenarios of the epidemic in Italy.

In [20], a SIR-D model is used to obtain scenarios of the epidemic in Brazil taking into account school closures and isolation of elderly people.

In [21], a SEIR-D model with time-dependent parameters and calculation results are presented for some Indian districts, as well as for Lombardy (Italy) and Moscow (Russia).

The purpose of the present paper is to study the reliability of forecasts by the SEIR models versus the available information on the COVID-19 pandemic spread.

The SIR (Susceptible, Infected, Recovered) model is used as a basis for describing the spread of infectious diseases; it was proposed in the 1920s by Scottish epidemiologists Anderson Kermack and William McKendrick. According to the SIR, the population is divided into three groups: susceptible (\(S\)), infected (\(I\)), and recovered (\(R\)):

  • \(S\)—susceptible (uninfected) (age three and older);

  • \(I\)—infected with symptoms;

  • \(R\)—cured.

With time, transitions \(S \rightarrow I\) (infection) and \(I \rightarrow R\) (recovery or death) are possible. The SIR-model does not take into account many factors (for instance, population densities in different regions or different transmission routes). Therefore, the SIR has been refined many times [2224]. Today there is a whole family of models (and even open codes) developed on the basis of the SIR model [2528]:

  • SIRS: to describe the dynamics of diseases with temporary immunity.

  • SEIR (Exposed): to describe the spread of diseases taking into account incubation period.

  • SIS: to describe the spread of diseases with no immunity.

  • MSEIR (Maternally derived immunity): to take into account maternally derived immunity.

In the present paper, the following two models are used: SEIR-HCD and SEIR-D, to which we add

  • \(E\)—infected or in incubation period;

  • \(H\)—hospitalized, that is, with a severe form of the disease;

  • \(C\)—individuals in critical condition under artificial lung ventilation (ALV);

  • \(D\)—dead.

The choice of the models is based on the SEIR, since COVID-19 has a rather long incubation period (5–14 days), during which there may be carriers who do not show any symptoms of the disease [29].

In the present paper, first the coefficients of transition from one group to another are specified by solving an inverse problem with data on the number of infected individuals \(I(t)\) and \(E(t)\), and the number of the dead \(D(t)\). Then we solve a direct problem, that is, calculate a scenario of the epidemic.

The following additional information about the disease is used: duration of the incubation period, duration of the latent period (from the time of infection to the time when the infected individual becomes a source of infection), and a contagiousness parameter.

figure 1

Fig. 1. Schematic diagrams of SEIR-HCD (1) left and SEIR-D (5) right mathematical models.

With the generalized model, we can calculate scenarios of the disease in some cities of Russia: For instance, predict a disease peak in Moscow and Novosibirsk region, the time of reaching a plateau, and a decline in the epidemic in each specific case. In these models we plan to take into account social distancing, climatic conditions, age characteristics, peculiarities of human immunity, passenger intercity transportation in the Russian Federation, etc.

1. MATHEMATICAL MODEL SEIR-HCD

Mathematical model SEIR-HCD of COVID-19 spread in a specific region of the Russian Federation was first proposed in [30]. The model is based on a system of seven nonlinear ordinary differential equations in an interval \(t\in [t_0,T]\):

$$\left\{\begin{array}{l} \dfrac{dS}{dt} = - \dfrac{5-a(t-\tau)}{5}\left(\dfrac{ {\alpha_I} S(t)I(t)}{N} + \dfrac{ {\alpha_E} S(t)E(t)}{N}\right) + \gamma R(t),\\[7pt] \dfrac{dE}{dt} = \dfrac{5-a(t-\tau)}{5}\left(\dfrac{{\alpha_I} S(t)I(t)}{N} + \dfrac{ {\alpha_E} S(t)E(t)}{N}\right) - ({\kappa} +\rho) E(t),\\[7pt] \dfrac{dI}{dt} = {\kappa} E(t) - ( {\beta}+ {\nu})I(t),\\[7pt] \dfrac{dR}{dt} = {\beta} I(t) + \rho E(t) - \gamma R(t) + \varepsilon_{HR}H(t),\\[7pt] \dfrac{dH}{dt} = {\nu} I(t) + {\varepsilon_{CH}} C(t) - (\varepsilon_{HC} + \varepsilon_{HR}) H(t),\\[7pt] \dfrac{dC}{dt} = \varepsilon_{HC} H(t) - ( {\varepsilon_{CH}}+ {\mu})C(t),\\[7pt] \dfrac{dD}{dt} = {\mu} C(t)\\[7pt] \end{array}\right. \ \ {(1)}$$

with initial conditions

$$\begin{array}{c} S(t_0)=S_0,\ \ E(t_0)= {E_0},\ \ I(t_0)=I_0,\ \ R(t_0)=R_0,\\[1ex] H(t_0) = H_0,\ \ C(t_0)=C_0,\ \ D(t_0) = D_{0}. \end{array} \ \ {(2)}$$

The entire population \(N = S+E+I+R+H+C+D\) consists of seven groups (see a schematic diagram of the model in Fig. 1, left).

Table 2. Parameters of models (1), (2), (5), and (6) and their limits

Symbol

Description

Limits

SEIR-HCD

 

SEIR-D

\(a(t)\)

Self-isolation index according to Yandex

 

(0, 5)

 

\(\tau\)

Latent period (characterizes delays in the release of virions or contagiousness)

 

2 days

 

\(\alpha_I\)

Infection transmission between infected and susceptible individuals associated with virus contagiousness and social factors

 

(0, 1)

 

\(\alpha_E\)

Infection transmission between asymptomatic and susceptible

 

(0, 1)

 
 

population groups (\(\alpha_E \gg \alpha_I\))

   

\(\kappa\)

Frequency of symptoms in open cases leading to transition

 

(0, 1)

 
 

from asymptomatic to infected population

   

\(\rho\)

Recovery rate of detected cases (recovering without any symptoms)

0

 

(0, 1)

\(\beta\)

Recovery rate of infected cases

 

(0, 1)

 

\(\gamma\)

Reinfection rate. This parameter is reciprocal of virus immunity

 

0

 
 

level (0—stable immunity, 0.001—probability of reinfection)

   

\(\nu\)

Hospitalized cases with severe disease

(0, 1)

 

\(\varepsilon_{HR}\)

Probability of recovery in severe condition

0.225

 

\(\varepsilon_{HC}\)

Fraction of hospitalized cases with severe disease

0.025

 

 

under artificial lung ventilation (ALV)

   

\(\varepsilon_{CH}\)

Probability for a patient to be taken off the ALV

(0, 1)

 

\(c^\mathrm{isol}\)

Coefficient of impact of self-isolation index on infection

 

(0, 1)

\(\mu\)

COVID-19 mortality

(\(10^{-4}\), \(10^{-1}\))

 

(0, 0.1)

\(E_0\)

Initial number of asymptomatically infected

(1, 800)

 

(0, 600)

\(R_0\)

Initial number of cured

 

(0, 600)

The parameters and their average values are given in Table 2.

1.1. Inverse Problem

Assume that additional measurements about the following three functions for fixed times are available:

$$E(t_k) = b_k f_k,\ \ \ I(t_k) = (1-b_k)f_{k},\ \ \ \Delta D(t_k) = g_{k},\quad t_k\in(t_0,T),\ \ \ k=1,\ldots, K, \ \ {(3)}$$

where \(b(t)\in [0,1]\) is the fraction of asymptomatic patients in identified cases (the data are taken from stopkoronavirus.rf), \(f_k\) is the number of detected cases per day \(k\), \(g_k\) is the number of deaths per day \(k\), \(\Delta D(t_k) = D(t_{k}) - D(t_{k-1})\), and \(K\) is the number of days in the statistics.

Unknown parameters: \(q=(\alpha_E, \alpha_I, \kappa, \beta, \nu, \varepsilon_{CH}, \mu, E_0)\in\mathbb{R}^{8}\).

The inverse problem (1)–(3) is in determining the vector of parameters \(q\) using the additional measurements (3).

The inverse problem is reduced to minimizing the following objective functional:

$$J(q) = \sum\limits_{k=1}^K \Big[ (c^\mathrm{test}_k E(t_k;q) - b_kf_k)^2 + (c^\mathrm{test}_k I(t_k;q)-(1-b_k)f_k)^2 + ( \Delta D(t_k;q)-g_k)^2 \Big]. \ \ {(4)}$$

Here \(c^\mathrm{test}_k\in [0,1]\) is the ratio of the number of tests to the number of healthy individuals in the region under study per day \(k\). The functional (4) is based on the following assumptions: \(100 b_k\)% of the identified cases do not show any symptoms (group \(E(t)\)). Hence, \(100 b_k\)% and \(100 (1-b_k)\)% of the detected cases are proportional to the tested (testing coefficient \(c_k^\mathrm{test}\)) asymptomatic and symptomatic cases, respectively.

2. MATHEMATICAL MODEL SEIR-D

Model SEIR-D of COVID-19 spread is described by a system of five nonlinear ordinary differential equations in an interval \(t\in [t_0,T]\) [31] (a schematic diagram of the model is shown in Fig. 1 right):

$$\left\{ \begin{array}{l} \dfrac{dS}{dt} = - c(t-\tau) \left(\dfrac{ {\alpha_I} S(t)I(t)}{N} + \dfrac{ {\alpha_E} S(t)E(t)}{N}\right) + \gamma R(t),\\[7pt] \dfrac{dE}{dt} = c(t-\tau) \left(\dfrac{ {\alpha_I} S(t)I(t)}{N} + \dfrac{ {\alpha_E} S(t)E(t)}{N}\right) - ( {\kappa} + {\rho}) E(t),\\[7pt] \dfrac{dI}{dt} = {\kappa} E(t) - {\beta}I(t)- {\mu}I(t),\\[7pt] \dfrac{dR}{dt} = {\beta} I(t) + {\rho} E(t) - \gamma R(t),\\[7pt] \dfrac{dD}{dt} = {\mu} I(t)\\[7pt] \end{array}\right. \ \ {(5)}$$

with initial data

$$S(t_0) = S_0, \quad E(t_0) = {E_0}, \quad I(t_0) = I_0, \quad R(t_0) = {R_0}, \quad D(t_0) = D_0. \ \ {(6)}$$

Here \(N = S + E + I + R + D\) is the entire population.

A function that uses some restrictions on social mobility is

$$c(t) = 1 + {c^\mathrm{isol}}\left(1 - \frac{2}{5}a(t)\right) , \quad c(t) \in (0,2).$$

2.1. Statement of the Inverse Problem

Additional information:

  • infected per day \(f_k\), \(k = 1, \ldots, K\),

  • deaths per day \(g_k\), \(k = 1, \ldots, K\).

In the model (5) this means

$$0.58 f_k = \kappa E_{k-1}, \quad g_k = D(t_k) - D(t_{k-1}), \quad t_k\in(t_0,T),\,\ k=1,\ldots, K. \ \ {(7)}$$

There is very little information about the real number of asymptomatically infected persons. However, it seems that almost all infected individuals with symptoms are detected during the day. On average, 58% of the infected individuals detected per day (\(f_k\)) are with symptoms (\(I\)). The infected individuals detected on day \(k\) with symptoms (\(0.58 f_k\)) are a fraction of the infected individuals without symptoms on day \(k-1\) (\(\kappa E(t_{k-1}))\).

Unknown parameters of the model are \({q = (\alpha_E, \alpha_I, \kappa, \rho, \beta, \mu, c^\mathrm{isol}, E_0, R_0)} \in \mathbb{R}^{9}\).

The inverse problem (5)–(7) is in determining the vector of parameters \(q\) from the additional information (7).

The inverse problem (5)–(7) is reduced to minimizing the functional

$$J(q) = \sum\limits_{k=1}^K \Big[ w_1 (\kappa E(t_{k-1};q) - 0.58 f_k)^2 + w_2 (D(t_k;q) - D(t_{k-1};q) - g_k)^2 \Big]. \ \ {(8)}$$

The weights in the functional (8) are chosen as follows:

$$w_1 = \sum\limits_{k=1}^K \frac{f_k}{K}, \qquad w_2 = \sum\limits_{k=1}^K \frac{g_k}{K}.$$

3. NUMERICAL EXPERIMENTS AND FORECAST RESULTS

This section presents a comparison of numerical results where the inverse problems (1)–(3) are used for model SEIR-HCD, and (5)–(7) for model SEIR-D in Moscow and Novosibirsk region. The criteria of the comparison are as follows::

  1. 1.

    Accuracy of epidemic peak prediction in Moscow (date and number of detected infected individuals).

  2. 2.

    Accuracy of disease spread prediction in Moscow according to daily detected cases \(f_k\) and mortality \(g_k\), \(k=1,\ldots, K\).

  3. 3.

    Accuracy of processing of the historical data of daily detected cases \(f_k\) and mortality \(g_k\) in Moscow.

  4. 4.

    Required computing resources (runtime of algorithms, computing power).

  5. 5.

    Comparison of the necessary a priori information on the limits of the desired parameters (see Table 2, columns 3 and 4).

3.1. Data Processing

The self-isolation index data \(a(t)\) in Moscow and Novosibirsk region are taken from Yandex maps. The testing coefficient was calculated using statistical information from open sources by the following formula:

$$c^\mathrm{test}_k = \dfrac{T_k}{Z_k},$$

where \(T_k\) is the number of tests made in the region under study on day \(k\), which was obtained by multiplying the number of tests done in the country on day \(k\) by the proportion of the region’s population in the country’s population, \(Z_k\) is the number of uninfected individuals in the region by the day \(k\) calculated as the difference between the region’s population and the total number of sick and dead individuals over the entire observation period up to day \(k\) (no data on recovered individuals were used). \(T_k\) and \(Z_k\) were extrapolated for the future period of forecasting.

3.2. Algorithms for Solving the Inverse Problems

3.2.1. Model SEIR-HCD

To numerically solve the inverse problem (1)–(3), the following sequence of steps was used:

  1. 1.

    Processing of data \(f_k\), \(g_k\), \(c^\mathrm{test}_k\), \(a(t_k)\), \(b_k\), \(k=1,\ldots,K\), for time periods (a)-(d) (see Section 3.3).

  2. 2.

    Determining the parameter limits for the sought-for vector \(q\) (see Table 2, column 3).

  3. 3.

    Using a method of differential evolution with module scipy.optimize.differential_evolution of the SciPy library to determine the minimum of the functional (4) and obtain the optimal vector of parameters \(q^*\) for the mathematical model (1), (2):

    1. 3.1.

      Solving the direct problem (1), (2) and calculating the functional (4) for each iteration of the method. To solve the direct problem (1), (2) module scipy.integrate.odeint of the SciPy library was used.

    2. 3.2.

      Determining the optimal vector of parameters \(q^*\).

  4. 4.

    Solving the direct problem (1), (2) for time periods (a)–(d) using the thus found optimal parameters from \(q^*\) and making forecasts for corresponding time periods (a)–(d). The number of infected individuals per day \(\hat{f}_k\) is calculated by the formula \(\hat{f}_k = c^\mathrm{test}_k(I(t_{k}) + E(t_{k}))\).

3.2.2. Model SEIR-D

The calculations were made using the Python programming language. To numerically solve the inverse problem (5)—(7), the following sequence of steps was used:

  1. 1.

    Processing of data \(f_k\), \(g_k\), \(a(t_k)\), \(k=1,\ldots,K\), for time periods (a)-(d).

  2. 2.

    Determining the parameter limits for the sought-for vector \(q\) (see Table 2, column 4).

  3. 3.

    Determining the minimum of the functional (8) and obtaining the optimal vectors of the parameters \(q^*\) of the mathematical model (5), (6) using the following four methods:

    1. 1)

      differential evolution method (module scipy.optimize.differential_evolution),

    2. 2)

      simulated annealing method (module scipy.optimize.dual_annealing),

    3. 3)

      genetic algorithm (https://pypi.org/project/geneticalgorithm/ library),

    4. 4)

      particle swarm method (https://pythonhosted.org/pyswarm/ library).

  4. 4.

    From the vectors \(q^*\) obtained by methods 1)–4), the best one is chosen in terms of the functional (8):

    1. 1)

      Solving the direct problem (5), (6) and calculating the functional (8) for each iteration of the method. To solve the direct problem (5), (6) module scipy.integrate.ode_ivp of the SciPy library, method RK45, was used.

    2. 2)

      Determining the optimal vector of parameters \(q^*\) of the mathematical model (5), (6).

  5. 5.

    Solving the direct problem (5), (6) for time periods (a)–(d) using the thus found optimal parameters from \(q^*\) and making forecasts for corresponding time periods (a)–(d). The number of infected individuals detected per day \(\hat{f}_k\) is calculated by the formula \(\hat{f}_k = \dfrac{\kappa E(t_{k-1})}{0.58}\).

3.3. Numerical Calculations for Moscow

As initial data on the COVID-19 spread in Moscow (\(N_0 = 11~514~330\)) for SEIR-HCD model (1), we used the following statistical information for March 23, 2020:

$$S_0 = 11~514~241-q_8,\,\ E_0 = q_8,\,\ I_0 = 71,\,\ R_0 = 9,\,\ H_0 = 8,\,\ C_0 = 1,\,\ D_0 = 0.$$

In the SEIR-D (5), the following initial data were used:

$$S_0 = 11~514~330-I_0-q_8-q_9,\,\ E_0 = q_8,\,\ I_0 = 71,\,\ R_0 = q_9,\,\ D_0 = 0.$$

Data for the inverse problems were obtained from open sources for Moscow in the following time periods:

  1. (a)

    23.03.2020–21.05.2020 (\(K=60\) days), (b) 23.03.2020–11.05.2020 (\(K=50\) days),

  2. (c)

    23.03.2020–01.05.2020 (\(K=40\) days), (d) 23.03.2020–21.04.2020 (\(K=30\) days).

The calculations were made using the Python programming language. The forecast was made for the following time periods according to above data (a)–(d):

  1. (a)

    22.05.2020–21.06.2020 (30 days), (b) 12.05.2020–11.06.2020 (30 days),

  2. (c)

    02.05.2020–01.06.2020 (30 days), (d) 22.04.2020–21.05.2020 (30 days).

3.3.1. Comparison of the Results

A comparison of the results obtained by using two mathematical models, SEIR-HCD and SEIR-D, was made. The parameters of these models were specified for Moscow by optimization methods (differential evolution, simulated annealing, genetic algorithm, and particle swarm).

Figure 2 shows the number of infected individuals detected for the two models verified by using real data up to May 27, 2020 for time periods (a)–(d). Note that even with the scarce statistical data the epidemic spread forecast is more accurate with SEIR-HCD (curve 3). This means that the forecast of the number of detected cases in Moscow is rather accurate (see the ratio of the SEIR-HCD simulation result (curve 3) to the statistical data not used to solve the inverse problem (curve 2)). SEIR-D has a larger error in predicting the number of detected cases, but it describes historical data with better accuracy (see the ratio of the SEIR-D simulation result (curve 4) to the statistical data used to solve the inverse problem (curve 1)).

The parameters reconstructed for SEIR-HCD and SEIR-D are presented in Tables 3–6. An analysis of the behavior of the parameters for SEIR-HCD with the various data sets has shown that the parameters \(\varepsilon_{CH}\) and \(\beta\) have larger variation ranges, in contrast to the other parameters, as noted in Section 4 where a sensitivity analysis is performed. Thus, in solving the inverse problem some form of regularization and a priori information [32] (for instance, in the form of statistical constraints) were used in the calculations.

figure 2

Fig. 2. The number of detected cases in Moscow for measurements from 23.03.2020 to simulation periods of 90, 80, 70, 60, and 90 days (shown in the plots), respectively: Curve 1—real data, \(f_k\), used in solving the inverse problems, curve 2—real data up to 27.05.2020 not used in solving the inverse problems, curve  3—solution for SEIR-HCD, curve 4—solution for SEIR-D. The reconstructed parameters for the mathematical models are presented in Tables 3–6.

Table 3. Reconstructed parameters for 23.03.2020–21.05.2020 measurement period

Model

\(\alpha_E\)

\(\alpha_I\)

\(\kappa\)

\(\rho\)

\(\beta\)

\(\nu\)

\(\varepsilon_{CH}\)

\(\mu\)

\(c^\mathrm{isol}\)

\(E_0\)

\(R_0\)

SEIR-HCD

0.995

0.499

0.280

0.146

0.06

0.003

0.0076

795

SEIR-D

0.510

0.618

0.002

0.187

0.230

0.0052

0.9

783

187

Table 4. Reconstructed parameters for 23.03.2020–11.05.2020 measurement period

Model

\(\alpha_E\)

\(\alpha_I\)

\(\kappa\)

\(\rho\)

\(\beta\)

\(\nu\)

\(\varepsilon_{CH}\)

\(\mu\)

\(c^\mathrm{isol}\)

\(E_0\)

\(R_0\)

SEIR-HCD

0.999

0.5

0.290

0.16

0.037

0.001

0.011

798

SEIR-D

0.874

0.669

0.037

0.66

0.97

0.0163

0.346

598

48

Table 5. Reconstructed parameters for 23.03.2020–01.05.2020 measurement period

Model

\(\alpha_E\)

\(\alpha_I\)

\(\kappa\)

\(\rho\)

\(\beta\)

\(\nu\)

\(\varepsilon_{CH}\)

\(\mu\)

\(c^\mathrm{isol}\)

\(E_0\)

\(R_0\)

SEIR-HCD

0.994

0.491

0.28

0.165

0.04

0.0065

0.011

794

SEIR-D

0.849

0.716

0.011

0.58

0.36

0.0089

0.56

586

146

Table 6. Reconstructed parameters for 23.03.2020–21.04.2020 measurement period

Model

\(\alpha_E\)

\(\alpha_I\)

\(\kappa\)

\(\rho\)

\(\beta\)

\(\nu\)

\(\varepsilon_{CH}\)

\(\mu\)

\(c^\mathrm{isol}\)

\(E_0\)

\(R_0\)

SEIR-HCD

0.998

0.493

0.268

0.063

0.007

0.171

0.0473

798

SEIR-D

0.078

0.787

0.643

0.703

0.173

0.0039

0.63

17

332

Tables 7–10 present comparisons of the simulation results for \(\hat{y}\) with real data \(y\) for (a)–(d) according to the following metrics taking into account the time period length \(M\):

  • mean-square error: \(\mathrm{MSE}(y,\hat{y}) = \sum\limits_{k=1}^{M} \frac{(y_k - \hat{y}_k)^2}{M}\);

  • mean average error: \(\mathrm{MAE}(y,\hat{y}) = \sum\limits_{k=1}^{M} \frac{|y_k - \hat{y}_k|}{M}\).

Let us introduce the following notation: \(Err^\mathrm{MAE}_\mathrm{fore}=\mathrm{MAE}(f,\hat{f})\) and \(Err^\mathrm{MSE}_\mathrm{fore}=\mathrm{MSE}(f,\hat{f})\) are errors in the corresponding test period \(N\) whose data were not used in solving the inverse problems (in Fig. 2 the data are denoted by curve 2), \(Err^\mathrm{MAE}_\mathrm{desc}=\mathrm{MAE}(f,\hat{f})\) and \(Err^\mathrm{MSE}_\mathrm{desc}=\mathrm{MSE}(f,\hat{f})\) are errors in the corresponding training period \(K\) whose data \(f_k\) and \(g_k\) were used in solving the inverse problems (in Fig. 2 the data are denoted by curve 1).

To check criterion 1, let us introduce \(\delta_t\) — deviation of the epidemic peak date in the model from the real peak date on May 7, 2020 (in days), \(\delta_f\) — deviation of the epidemic peak magnitude in the model from the real peak magnitude, which was 6703 detected cases.

Table 7. Comparison of the models for data of 23.03.2020–21.05.2020, \(K=60\), \(N=6\)

Model

Criterion 1

Criterion 2

Criterion 3

\(\delta_t\)

\(\delta_f\)

\(Err^\mathrm{MAE}_\mathrm{fore}\)

\(Err^\mathrm{MSE}_\mathrm{fore}\)

\(Err^\mathrm{MAE}_\mathrm{desc}\)

\(Err^\mathrm{MSE}_\mathrm{desc}\)

SEIR-HCD

2

254

338

157690

769

931618

SEIR-D

3

930

198

54977

454

365087

Table 8. Comparison of the models for data of 23.03.2020–11.05.2020, \(K=50\), \(N=16\)

Model

Criterion 1

Criterion 2

Criterion 3

\(\delta_t\)

\(\delta_f\)

\(Err^\mathrm{MAE}_\mathrm{fore}\)

\(Err^\mathrm{MSE}_\mathrm{fore}\)

\(Err^\mathrm{MAE}_\mathrm{desc}\)

\(Err^\mathrm{MSE}_\mathrm{desc}\)

SEIR-HCD

2

174

614

556795

780

974866

SEIR-D

12

817

3346

12745883

338

247688

Table 9. Comparison of the models for data of 23.03.2020–01.05.2020, \(K=40\), \(N=26\)

Model

Criterion 1

Criterion 2

Criterion 3

\(\delta_t\)

\(\delta_f\)

\(Err^\mathrm{MAE}_\mathrm{fore}\)

\(Err^\mathrm{MSE}_\mathrm{fore}\)

\(Err^\mathrm{MAE}_\mathrm{desc}\)

\(Err^\mathrm{MSE}_\mathrm{desc}\)

SEIR-HCD

2

207

779

842300

714

889008

SEIR-D

3

2898

1546

3531484

233

105670

Table 10. Comparison of the models for data of 23.03.2020–21.04.2020, \(K=30\), \(N=36\)

Model

Criterion 1

Criterion 2

Criterion 3

\(\delta_t\)

\(\delta_f\)

\(Err^\mathrm{MAE}_\mathrm{fore}\)

\(Err^\mathrm{MSE}_\mathrm{fore}\)

\(Err^\mathrm{MAE}_\mathrm{desc}\)

\(Err^\mathrm{MSE}_\mathrm{desc}\)

SEIR-HCD

5

7537

5242

30788912

342

183419

SEIR-D

31

348120

48996

6350947086

148

69867

3.4. Numerical Calculations for Novosibirsk Region

As initial data on the COVID-19 spread in Novosibirsk region (\(N_0 = 2,798,170\)) we used the following statistical information on March 23, 2020 for SEIR-HCD:

$$S_0 = 2\,798\,170 - q_8,\ \ E_0 = q_8,\,\ I_0 = 0,\ \ R_0 = 0,\ \ H_0 = 0,\ \ C_0 = 0,\ \ D_0 = 0,$$

and for SEIR-D:

$$S_0 = 2\,798\,170-q_8-q_9,\ \ E_0 = q_8,\ \ I_0 = 0,\ \ R_0 = q_9,\ \ D_0 = 0.$$

Fig. 3 presents the results of simulation and forecast of the number of detected cases in Novosibirsk region with measurements from March 23 to May 31, 2020 (Fig. 3a) and up to June 15, 2020 (Fig. 3b) obtained using the above algorithms for these two mathematical models. Note that the simulation of real data on the detected cases obtained with SEIR-D (curve 2) is more accurate than the simulation with SEIR-HCD (curve 1).

figure 3

Fig. 3. The number of detected cases of COVID-19 in Novosibirsk region from 23.03.2020 to 21.06.2020 (90 days) with measurements from 23.03.2020 to (a) 31.05.2020 and (b) 15.06.2020. Solution for SEIR-HCD (curve 1), solution for SEIR-D (curve 2), real data \(f_k\) used in solving the inverse problems (curve 3). Tables 11 and 12 present the parameters reconstructed for these two mathematical models.

It is difficult to analyze the simulation results of mortality in Novosibirsk region for both models, since the available data are insufficient for a good qualitative forecast of mortality in this region. Note that the self-isolation index coefficient has a zero influence on contagiousness \(c^\mathrm{isol}\) in Novosibirsk region for SEIR-D. This, in fact, means that self-isolation does not affect the contagiousness in the region (see Tables 11 and 12)

Table 11. Reconstructed parameters for 23.03.2020–31.05.2020 measurement period, Novosibirsk region

Model

\(\alpha_E\)

\(\alpha_I\)

\(\kappa\)

\(\rho\)

\(\beta\)

\(\nu\)

\(\varepsilon_{CH}\)

\(\mu\)

\(c^\mathrm{isol}\)

\(E_0\)

\(R_0\)

SEIR-HCD

0.001

0.224

0.108

0.013

0.006

0.055

0.072

1001

SEIR-D

0.999

0.999

0.042

0.952

0.999

0.0188

0

99

24

Table 12. Reconstructed parameters for 23.03.2020–15.06.2020 measurement period, Novosibirsk region

Model

\(\alpha_E\)

\(\alpha_I\)

\(\kappa\)

\(\rho\)

\(\beta\)

\(\nu\)

\(\varepsilon_{CH}\)

\(\mu\)

\(c^\mathrm{isol}\)

\(E_0\)

\(R_0\)

SEIR-HCD

0.219

0.0001

0.118

0.046

0.001

0.001

0.1

6412

SEIR-D

0.271

0.999

\(1.9\cdot 10^{-5}\)

\(10^{-9}\)

0.007

0.0009

0

93

96

4. SEIR-HCD SENSITIVITY ANALYSIS

By using a software package called DAISY [33] on the structural identifiability analysis of the model (1), (2), it has been shown that the mathematical model of the spread of coronavirus in the population is identifiable. It is important to estimate the sensitivity of the parameters \(q\) to the model functions in order to control the stability of the inverse problem solving and the quality of forecasting [3436].

First we will estimate the parameters to which the solution of the model is most sensitive. They are determined by the calculation of semi-relative sensitivity.

To describe this method, assume that we wish to determine the relative sensitivity of the observed model quantities \(x = (E,I,D)^\top\) (and, hence, the model solution \(\bar{x}=(S,E,I,R,H,C,D)^\top\)) to the parameters \(q_k\), \(k = 1, \ldots, 8\). The semi-relative sensitivity of the model solution to the parameter \(q_k\) is defined by the expression \(\frac{\partial x_i(t;q)}{\partial q_k} q_k\) and calculated by formal differentiation of the ODE model

$$\left\{\begin{array}{l} \dfrac{d\bar{x}}{dt} = f(t,\bar{x};q),\\[1.5ex] \bar{x}(0) = \bar{x}_0 \end{array}\right. \ \ {(9)}$$

with respect to \(q_k\) and change in the order of differentiation with respect to time and the parameters.

Thus, we obtain a \((7 \times 8)\)-dimensional system of differential equations for the sensitivity function \(\bar{x}_q(t;q)=\partial \bar{x}(t;q)/\partial q\):

$$\dfrac{d}{dt}\left(\dfrac{\partial \bar{x}(t)}{\partial q}\right) =\dfrac{\partial f}{\partial \bar{x}} \dfrac{\partial \bar{x}(t)}{\partial q} + \dfrac{\partial f}{\partial q}$$

with initial conditions

$$\dfrac{\partial \bar{x}(0)}{\partial q} = \dfrac{\partial \bar{x}_0}{\partial q} = 0.$$

This means that the zero matrix for any initial condition is not a function of all estimated parameters. Here \(\partial f /\partial \bar{x}\) is the Jacobian of the ODE system, and \(\partial f / \partial q\) is the derivative of the right-hand side with respect to these parameters.

figure 4

Fig. 4. Sensitivity function \(c^\mathrm{test}(t)\frac{\partial x_i(t)}{\partial q_k}q_k\) for the time period from 23.03.2020 to 21.04.2020 (30 days along the horizontal line).

figure 5

Fig. 5. Sensitivity analysis for the mathematical model (1) by orthogonal method (a) and eigenvalue method (b) [35] (logarithmic scale).

This process provides some information about the sensitivity as a function of time in the interval of interest. It would be desirable to have some general measure of sensitivity of the solution to the parameters. Therefore, for each state/parameter combination we take a norm (in the space \(L_2\)) in time, and then rank the thus obtained scalars to determine the most sensitive parameters (see Table 13). The smaller \(\|\frac{\partial x_i(t)}{\partial q_k}\|_2\), the less the influence of the parameter \(q_k\) on the variable \(x_i\). For instance, the quality of determining the parameters \(\varepsilon_{CH}\) and \(\mu\) in solving the inverse problem does not depend on the available measurements of the number of infected individuals \(I(t)+E(t)\) in contrast, for example, to the coefficients \(\alpha_E\), \(\rho\), and \(\kappa\), which are more sensitive to these data.

Table 13. Semi-relative sensitivity of various model states to parameters arranged in descending order

Variable \(x_i\)

Parameter \(q_k\)

\(\Big\|c_i(t)\frac{\partial x_i(t)}{\partial q_k}q_k \Big \|_2\)

Variable \(x_i\)

Parameter \(q_k\)

\(\Big \|c_i(t)\frac{\partial x_i(t)}{\partial q_k}q_k \Big \|_2\)

\(D\)

\(\alpha_E\)

\(225.81\)

\(D\)

\(\beta\)

\(13.60\)

\(I+E\)

\(\alpha_E\)

\(156.69\)

\(D\)

\(\varepsilon_{CH}\)

\(12.64\)

\(D\)

\(\mu\)

\(89.06\)

\(I+E\)

\(\nu\)

\(3.38\)

\(D\)

\(\rho\)

\(63.27\)

\(I+E\)

\(\beta\)

\(1.21\)

\(I+E\)

\(\rho\)

\(57.73\)

\(D\)

\(\alpha_I\)

\(0.22\)

\(D\)

\(\nu\)

\(53.78\)

\(I+E\)

\(\alpha_I\)

\(0.19\)

\(D\)

\(\kappa\)

\(47.44\)

\(I+E\)

\(\mu\)

\(0.00\)

\(I+E\)

\(\kappa\)

\(30.42\)

\(I+E\)

\(\varepsilon_{CH}\)

\(0.00\)

Figure 4 presents the time variation of the sensitivity function \(c^\mathrm{test}(t)\frac{\partial x_i(t)}{\partial q_k}q_k\) versus the parameters. The more variable is the dynamics of a parameter, the higher is the sensitivity to the measurement data and, therefore, it will be determined with higher stability. In the inverse problem (1)–(3) \(\alpha_E\), \(\rho\), and \(\kappa\) are the most identifiable parameters, and \(\mu\), \(\varepsilon_{CH}\), and \(\alpha_I\) are the parameters that are least sensitive to the data.

Figure 5 shows the results of the sensitivity analysis of the parameters of the mathematical model (1) for various iterations of an orthogonal algorithm and an eigenvalue method, respectively [35]. One can see that the most identifiable are the parameters of infection between the asymptomatic and susceptible population groups \(\alpha_E\), mortality \(\mu\), and the recovery rate of identified cases \(\rho\).

Table 14. Sequences of parameters obtained by using sensitivity analysis methods for mathematical model (1): from most to least sensitive parameter

Orthogonal method

Eigenvalue method

\(\alpha_I\), \(\beta\), \(\varepsilon_{CH}\), \(\nu\), \(\rho\), \(\mu\), \(\kappa\), \(\alpha_E\)

\(\alpha_I\), \(\beta\), \(\kappa\), \(\varepsilon_{CH}\), \(\rho\), \(\mu\), \(\nu\), \(\alpha_E\)

Table 14 presents sequences of parameters obtained by the two methods. The identifiability analysis shows that the model parameters that are least sensitive (most identifiable) to the data variations (errors) are \(\alpha_E\), \(\kappa\), and \(\mu\). In other words, these parameters are more stably determined in solving the inverse problem (1)–(3). The parameters \(\alpha_I\), \(\varepsilon_{CH}\), and \(\beta\) are most sensitive (less identifiable) to the measurement errors. Therefore, it is necessary to develop a regularization algorithm to control the quality of determining the sensitive parameters.

5. CONCLUSIONS

Two mathematical models, SEIR-HCD and SEIR-D, based on the SEIR structure have been analyzed. For each of these models, inverse problems of specifying their parameters for Moscow and Novosibirsk region have been formulated. Identifiability and sensitivity to statistical data errors for these parameters have been analyzed. Algorithms for solving the inverse problems have been developed, and scenarios of the COVID-19 epidemic spread in Moscow and Novosibirsk region for various measurement and forecast time periods have been constructed.

An epidemic peak in Moscow with an error of 2 days and with 174 detected cases less than the real number was obtained by model SEIR-HCD with data on detected cases and mortality in Moscow from March 23, 2020 to May 1, 2020. A forecast of the epidemic development (the number of new detected cases per day) with the smallest error was obtained by model SEIR-HCD for all data (30, 40, and 50 days of the statistics), except for the longest period (60 days of the statistics), in which SEIR-D had the smallest error. Evidently, the use of a coarser mathematical model (with a smaller number of homogeneous groups) is justified in the case of a large amount of statistical data and a short forecasting period. In fact, SEIR-D gives the best approximation of the historical data on detected cases and mortality in Moscow.

Simulation and forecast of the development of coronavirus infection in Novosibirsk region were made for two sets of measurements: from 23.03.2020 to 31.05.2020 and up to 15.06.2020, due to the scarce statistics up to April 15, 2020. It has been shown that no epidemic peak has yet been reached. For the above measurement intervals, SEIR-D provides the most accurate simulation results for the coronavirus infection spread in Novosibirsk region.

In the case of a large simulation interval it is necessary to take into account the dynamics of measures taken to combat the coronavirus infection (mandatory use of masks in closed rooms, reopening of borders for passenger traffic, reopening of cinemas, etc.), which affects real data, quantitative and qualitative data on the reconstructed parameters, scenarios, and properties of the models. In this case the unknown parameters \(q\) of the models should be considered as functions of time (piecewise constant, piecewise linear ones, etc.). The time variation of the reconstructed parameters will help in explaining the disease development in the regions and how the introduction of some measures (or their slackening) affects the number of infected cases. In further work it is planned to consider \(\alpha_E (t)\), \(\alpha_I (t)\), \(c^\mathrm{isol}(t)\), and others as piecewise constant functions of time.