COVID-19 pandemic through the eyes of a mathematician, or why the classic SEIRD model does not work

Abstract, or about the leisure of young scientists


Over the past few weeks, my colleagues and I have ended the working day by competing in the accuracy of the forecast for the development of the COVID-19 epidemic in Russia using various non-linear regression methods. And if the forecast for tomorrow inevitably turns out to be good, then the forecast for a period of more than one week reflects reality only in general terms. It would seem that everything is clear: there are epidemiological models , there are optimization methods , there are sufficiently detailed data - it is enough to combine this together and get an accurate forecast for a month, or even six months, in advance. In this article I will share my thoughts on what is wrong with the classic SEIRD model and how to fix it. And, of course, I will open the veil of secrecy that envelops our future with you.

Sit back, a furious matan awaits us for those who know what the differential equations are (for the rest, beautiful pictures are attached).


The figure above shows the total number of confirmed cases of COVID-19 on a logarithmic scale for Russia and the three European countries in the top 5 in terms of the number of infected. The explanation is further in the text.


UFO Care Minute


The pandemic COVID-19, a potentially severe acute respiratory infection caused by the SARS-CoV-2 coronavirus (2019-nCoV), has officially been announced in the world. There is a lot of information on Habré on this topic - always remember that it can be both reliable / useful, and vice versa.

We urge you to be critical of any published information.


Official sources

, .

Wash your hands, take care of your loved ones, stay at home whenever possible and work remotely.

Read publications about: coronavirus | remote work

Model SEIRD


The SEIRD epidemic model belongs to the class of so-called compartmental models , the essence of which is to divide the population into several groups ( English compartments), in our case:$ S $( English susceptible) - susceptible,$ E $( Eng. exposed) - those who have the disease in the incubation period,$ I $( English infectious) - sick,$ R $( English recovered) - recovered,$ D $( English dead) - the dead. Then, the size of each group is compared with a variable in the system of differential equations, solving which, you can predict the dynamics of the epidemic. There are a lot of modifications of the SEIRD model, for example, SEIR is a simplified model that does not separately consider cases of recovery and death. To get acquainted with other models, I can recommend a good article on this topic.

Bit of theory


For the first time, an epidemic model in the form of a system of three differential equations for variables $ S, I, R $appeared in the work of W. Kermak and A. McKendrick in 1927.
These differential equations have the form:

$\begin{align} \frac{dS}{dt }&=-\beta \frac{SI}{N},\\ \frac{dI}{dt}&= \beta \frac{SI}{N}-\gamma I,\\ \frac{dR}{dt}&= \gamma I, \end{align}$


where, in addition to the variables familiar to us, the following constants appear: $N = S + I + R$ - total population size, $\beta$ - infection transmission rate, $\gamma$- speed of recovery.

The meaning of the Kermak and McKendrick equation is as follows: the number of susceptibles decreases in proportion to their number times the average proportion of infected in the population$I/N$, the number of infected grows at the same pace, adjusted for the fact that some of them $\gamma I $recovering, and the number of convalesces increases due to a decrease in the number of infected. It is worth noting that the SIR model contains nonlinearity$SI$, because of which the analytical solution of the system of equations becomes generally impossible, but, fortunately, the methods of numerical differentiation can easily cope with this task.

Adding another variable here$E$ (the number of people with the disease in the incubation period), we get the SEIR model:

$\begin{align} \frac{dS}{dt }&=-\beta \frac{SI}{N},\\ \frac{dE}{dt}&= \beta \frac{SI}{N}-\kappa E,\\ \frac{dI}{dt}&= \kappa E-\gamma I,\\ \frac{dR}{dt}&= \gamma I, \end{align}$


where another constant appears $\kappa$- the rate of transition of the disease from the incubation stage to the open. Figure from taken from the article .



It is immediately clear that the SEIR model is not very suitable for describing COVID-19, if only because in this model there are hidden carriers of infection$E$not contagious. This deficiency can be corrected by introducing, following Pengpeng et al. optional parameter$\theta$, characterizing the degree of infectivity of latent carriers of infection compared with diseased. The modified SEIR model, which we will try to apply to the current epidemic, will look like:

$\begin{align} \frac{dS}{dt }&=-\beta \frac{S(I + \theta E)}{N},\\ \frac{dE}{dt}&= \beta \frac{S(I + \theta E)}{N}-\kappa E,\\ \frac{dI}{dt}&= \kappa E-\gamma I,\\ \frac{dR}{dt}&= \gamma I. \end{align}$


At first glance, the resulting model promises to be completely believable.

Numerical experiment with the SEIR model


For modeling, we will try to take the following parameters, focusing on open data . Assuming that the disease lasts 14 days on average (at least how long the mild form lasts , which accounts for up to 80% of cases), we find the value$\gamma=1/14=0,0714$. Will accept$\beta=3/14=0,2143$. Magnitude$\theta=0,6$borrowed from Pengpeng et al . Given the average incubation period of 3 days, take$\kappa=1/3 = 0,33 $. We accept the population of Russia equal$N = 144,5\cdot10^6$person.

As initial conditions, we use the data for Russia on April 2, when the measures introduced at the end of March to limit the spread of infection were to have their effect, namely:

$\begin{align} &S_0=3548,\\ &I_0= 3283,\\ &E_0=0,5 I_0. \end{align}$


Rating $E_0$we took it relatively arbitrarily, and it doesn’t matter, because, as you know, something will go wrong.

As a result of modeling by the Euler method with a step of 1 day from April 2 to April 24 inclusive, we obtain the graphs shown below: on the left in a linear scale, on the right in a logarithmic.



Round markers mark real data on the total number of cases in Russia, square markers on the number of patients. At first glance, the results look good, except for one: with the parameters of the model, we clearly did not guess. And here optimization methods come to our aid.

Optimize it


Optimization methods, if the reader is not familiar with them, are algorithms that allow finding the minimum of some objective function. In our case, we have before us the problem of nonlinear regression: how to choose the vector of parameters of the differential equation$\mathbf{x} = (\beta, \gamma, \kappa, E_0)^\top$ so that the set of solution points of the differential equation $F$ was as close as possible to the set of observation points $\mathcal{F}$.

We use the standard deviation as a measure of the model error. The objective function will take the form

$ f(\mathbf{x}) = \frac{1}{M}\sqrt{\sum_{i=1}^{M}(F_{i} - \mathcal{F}_{i})^2 + \sum_{i=1}^{M}(G_{i} - \mathcal{G}_{i})^2}, $


Where $M$ - the number of points $F$ - the total number of cases of infection that the model gives, $\mathcal{F}$ - the actual total number of cases, $G$ - the number of patients at the moment, which gives the model, $\mathcal{G}$- the real total of current active cases.

Using the Optimization Toolbox in MATLAB, we will adjust the model parameters to the observation data. As a result, we obtain the solution shown in the figure below.

image

At first glance, everything is fine. The discrepancy turned out equal$f(\mathbf{x}) = 131,98$, and “on the convex sea eye” the fit of the solution went well. Let's look at the received parameters:

$\begin{align} &\beta = 0,374,\\ &\gamma = 0,0117,\\ &E_0 = 7,84\cdot 10^6,\\ &\kappa = 4,81\cdot 10^{-5}. \end{align}$


The value of almost 8 million latent patients, with about 60 thousand cases recorded
on April 24, is something dubious. We also found that the average transition time to the active phase of the disease is$1/\kappa = 2079$days.

Why did this happen? Everything will become clear if we analyze the shape of the curve on a long time scale. To do this, take our SEIR model with "plausible" parameters and simulate it over a long period of time (in this experiment I took on a new meaning$\beta=0,186$):



The curve corresponding to the total number of cases has a characteristic S-shape on a linear scale. The optimization program tried to give this form a curve. In addition, the forecast itself with "plausible" parameters is horrific - according to it, almost 90% of the country's population will be ill by September - it is obviously unrealistic if you look at the results for other countries (the same picture as at the beginning of the article, only on a linear scale ):



Here I compare the three European countries in the top 5 in the number of cases, and Russia. It can be seen that we are about a month behind the pace of the epidemic, and that for a month now the growth in the total number of cases in all three countries is almost linear (and even slower than linear), in contrast to the results obtained in the SEIR model. This raises three questions:

  1. ?
  2. SEIR, ?
  3. , , ?

I will begin by answering the third question. When we predict something, we face a rather unpleasant task: the data on which we build the model are not ideal - they contain errors, noise, and the model built on their basis also contains some error. When we continue the time series with our model points, the error accumulates - and rather quickly, if we predict a function that increases in time. And this is the case in our case. Moreover, the model is a model that reflects the real situation is very limited. The sudden development of the epidemic in a new big city, the use of a more effective method of treatment, a change in the method of collecting information - all this can make so many mistakes in real data that a long-term forecast will be completely far from reality.

Crutches and bicycles: we modify the SEIR model


Let's try to answer the question why the growth of the epidemic is slowing down to a linear one. With the number of infected people that we have now, the scale effect associated with the limited speed of communication between people begins to play a significant role.

More precisely, we recall: the number of cases in the SEIR model is directly proportional to the average number of cases in the population$I/N$. This rule works well in small populations, where everyone can communicate with everyone, and patients are evenly distributed. In reality, especially on the scale of tens and hundreds of thousands of people, if you take two randomly ill people, it turns out that they not only never communicated with each other and did not see each other, they did not even ride in the same subway car. And indeed they live in different cities. All that unites them is a chain of social ties, which led to the fact that the virus was transmitted to them.

As an example, I built an epidemic model in the form of a cellular automaton, where each cell interacts with only 4 neighboring ones. This is equivalent to the fact that each individual in the population has 4 social contacts - this is a very small number for the human population, but the faster the effect of limiting social connections will manifest itself. At each iteration with a probability of 0.1, each of the 4 neighbors of the infected cell can be infected. The disease lasts an average of 14 days. The simulation results for a pool of 200x200 cells are presented in the figure below, where$k$Is the iteration number.



Blue color indicates susceptible, yellow - sick, green - recovered. The most interesting thing is how the plots of the number of cases look. And they look something like planned: after a short phase of sub-exponential growth, just like in the SEIR model, there is a protracted phase of linear growth - just like in reality.



My goal was not to get a picture that resembled reality quantitatively. If you want more credibility, I can recommend the project of Sergey Potekhin, which was recently published on Habré . For meticulous readers, the stricter proof of linear growth is given below.

Proof of the linear growth epidemic theorem in a large population
: $d$- . . , 20 ( ) $d \approx 4$. $d$- . $n^{\frac{1}{d}}$, , $P$, $2P$. , ,

$n_{k+1}=\left(n_k^{\frac{1}{d}} + 2P \right)^d,$



$n$ : $n_k = n(t_k); n_{k+1} = n(t_k + 1)$. , :

$n'_{k+1}=n'_{k}\left(1 + \frac{2P}{n_k^{\frac{1}{d}}} \right)^{d-1}.$



, $n_k$ $n'_{k+1}=n'_{k}$, , .

4% 16- .



Turning to the global statistics on the epidemic, we will see the same thing: for a month now, the growth has been linear, despite the fact that classical models promised us a continuation of the exponential rise in the number of cases.



Tasks for the curious
  1. COVID-19, offline .
  2. .1 ?
  3. .1 2, .


Now about the modification of the SEIR model. The simplest thing we can do is to multiply the non-linear component by some function, depending on the number of patients. With a small number of cases, this function should be close to 1, with a large number, it should asymptotically tend to zero. The simplest suitable candidate is

$\mathcal{\varphi}(I,E) = e^{-\alpha(I +\theta E)^{K_0}}.$


Selection of parameters $\alpha$ and $K_0$can compensate for the exponential growth in the original model.

Add to the model, for greater information, and the component$D$- the number of deaths. Get a modified version of the SEIRD model:

$\begin{align} \frac{dS}{dt }&=-\beta \frac{S(I + \theta E) \mathcal{\varphi}(I,E)}{N},\\ \frac{dE}{dt}&= \beta \frac{S(I + \theta E) \mathcal{\varphi}(I,E)}{N}-\kappa E,\\ \frac{dI}{dt}&= \kappa E-\gamma I-\mu D,\\ \frac{dR}{dt}&= \gamma I,\\ \frac{dD}{dt}&= \mu I. \end{align} $



The simulation results are shown in the figure below.



The standard error in comparison with the original model is almost unchanged. The parameter values ​​are already realistic. For convenience, I designated the initial number of infected in the active phase as$I_0$.

$ \ begin {align} & \ beta = 0.219, \\ & \ gamma = 0.0102, \\ & E_0 = 0.13 \ cdot I_0, \\ & \ kappa = 1/3, \\ & \ mu = 1 , 13 \ cdot 10 ^ {- 3}.  \ end {align} $



The model interpolates derivatives very well too - the values ​​of the daily increase in the number of cases and the number of deaths.



Let's try to make a forecast. We take a forecasting horizon of 2 months and continue to model solutions with the parameters found by the optimization program.



At first glance, not bad, but you can’t wish such a forecast to your beloved homeland: the number of new cases will continue to decline, but the total number will continue to grow. In this case, an epidemic can be stopped only with the help of a vaccine, or by waiting until almost the entire population has been ill. The number of new deaths is set at approximately 200 per day. This is a clear illustration of what will happen if we do not strengthen measures to combat the epidemic. Is that what awaits us? And for the sake of this not very bright future, many of us diligently sit at home, having bought buckwheat and toilet paper?

Below I will consider two scenarios, and looking at the foggy distance of the coming months of April 28, 2020, I can’t say for sure which one will develop events further. Now, at the moment of breaking the curve of new cases, we are at a point where it is doubly problematic to predict something.

US scenario


The world hegemon was in an extremely unenviable position. Delayed in making key decisions that would slow the growth of the epidemic in the beginning, he is still unable to cope with the natural increase in new cases.

The modified SEIRD model, trained at the first 33 points, starting from March 2, plus or minus realistically predicts the course of the epidemic in April.



As you can see, the growth in April is almost perfectly linear. The model slightly overstates mortality in April, but the overall picture is correct.



This picture shows the daily increase in new cases and deaths in the United States. It is very similar to what the model predicted for Russia.

Germany scenario


Disciplined Germans managed to turn the curve in their favor, and its growth is slower than linear. Moreover, to make the model relevant, I had to manually add an increase in the recovery rate on April 6$ \ gamma $1.7 times, otherwise such a sharp drop in the number of cases in terms of the SEIRD model cannot be explained.



The model was trained at the first 27 points, starting from March 10. I also changed the nonlinear function. For Germany, the time-dependent exponent is better:

$ \ mathcal {\ varphi} (t) = K_0 e ^ {- \ alpha t}. $


This type of function indicates a cumulative increase in the number of interrupted social ties and, accordingly, the spread of infection. Here you have a clear illustration of the benefits of self-isolation.



The above shows the daily increase in new cases and deaths. As in the case of the United States, real data contain pronounced fluctuations with a period of 7 days. This means that on weekends the number of contacts increases, and consequently, the number of infected ones grows. [UPD: on the contrary, it is decreasing, as the Yandex self-isolation index may indirectly testify . ]

Conclusion


Making predictions - both short-term and long-term - is not just a tribute to curiosity. In the event of an epidemic, you need to know how many beds should be prepared, how many ventilators to produce, for how many months to stockpile personal protective equipment for doctors. Officials must understand whether the measures taken are sufficient or whether new prohibitions and restrictions should be introduced. Ideally, the model should reflect reality so well that it would be possible to see the force of action of each newly adopted measure, and then it would be possible to strengthen useful measures and cancel decisions that turned out to be useless.

Although with some reservations, everyone who has not yet acquired immunity from COVID-19 can really be described with the letter$ S $, the whole variety of sick people - the letter $ I $, etc. Moreover, the SEIRD model can even explain something. But she can predict something in the distant future very roughly.

I deliberately cited in the article only a negative scenario when we repeat the fate of the United States in terms of the dynamics of the epidemic. If this scenario turns out to be true, by the end of June we will have more than 300 thousand registered cases of the disease and more than 10 thousand deaths. Although there are prerequisites for the fact that this scenario does not translate into reality, I would advise you to relate to it according to the principle: "hope for the best, prepare for the worst." As the saying goes, if NASA's best minds have taken on the fight against the epidemic in the United States , then this is really a bad thing.

So far, all that remains for us is to visit public places less, to usewith proper respirators , wash your hands, do not forget to wipe the smartphone with alcohol after you take it out on the street, and follow other simple recommendations.

But still, is it possible to make more accurate predictions? Yes of course. But about this some other time.

If you have a desire to play with the source code and suggest a development scenario yourself, then here is the link to the github .

All Articles