When will the pandemic decline? Evaluating in Python with Pandas

image
Hello everyone.

I saw several dashboards on COVID-19, but I have not yet found the main thing - the forecast of the time of the recession of the epidemic. Therefore, I wrote a small Python script. He takes data from WHO tables on Github, lays out by country, and draws trend lines. And he makes predictions based on them - when in each country of the TOP 20 by the number of cases of COVID-19, a decline in infections can be expected. I admit, I am not an expert in the field of epidemiology. The calculations below are based on publicly available data and common sense. All details are under cut.

Update from 04/10/2020 - the main table and graphs have been updated.

Update 04/16/2020 - Userh3ckslmade an update of the graphs and posted them on his website .

Update from 04/29/2020 - It has now become apparent that the curves of the daily number of infections are very different in different countries. And it seems to depend heavily on the extent of the infection. An ideal bell-shaped curve modeled on South Korea is more likely an exception to the rule. A more common form is with a sharp increase to a peak and a subsequent decrease that is more extended over time. Therefore, in reality, the decline in the incidence will last longer than that which was predicted by this model in early April.

UFO Care Minute


COVID-19 — , SARS-CoV-2 (2019-nCoV). — , /, .



, .

, , .

: |


At first, in order to assess the timing of the epidemic, I used publicly available dashboards. One of the first was the website of the Center for Systems Science and Engineering of Johns Hopkins University.

image

He's pretty pretty with his ominous simplicity. Until recently, it did not allow building the dynamics of the increment of infections by days, but it was corrected. And it is curious in that it allows you to see the spread of a pandemic on a world map. Given the black and red gamut of the dashboard, it’s not recommended to watch it for a long time. I think this may be fraught with the occurrence of anxiety attacks, turning into various forms of panic.

In order to keep my finger on the pulse I liked the page about COVID-19 moreon the Worldometer resource. It contains information on countries in the form of a table:

image

When you click on the desired country, we fall into more detailed information about it, including daily and cumulative curves for infected / recovered / dead.

There are other dashboards. For example, for in-depth information about our country, you can simply type in “COVID-19” in the Yandex search line and you will find this:

image

Well, perhaps this will stop. Neither on these three dashboards, nor on several others, I was able to find information with forecasts when the pandemic will finally decline.

A bit about graphs


Currently, infection is only gaining momentum and the most interesting are the graphs with a daily increase in the number of cases. Here is an example for the USA:

image

It can be seen that the number of infected people is growing every day. On such graphs, of course, we are not talking about all those who have become infected (it is not possible to determine their exact number), but only about confirmed cases. But for brevity, hereinafter we will call these confirmed infected simply “infected”.

What does a tamed pandemic look like ? An example here is a graph of South Korea:

image

It can be seen that the daily increment of the infected first grew, then reached a peak, went down and fixed at a certain constant level (about 100 people per day). This is not a complete victory over the virus, but significant success. This allowed the country to start the economy (although, they say, Koreans managed not to slow it down) and return to a less normal life.

It can be assumed that the situation in other countries will develop in a similar way. That is, growth at the first stage will be replaced by stabilization, and then the number of sick people will decline every day. The result is a bell-shaped curve. Let's look for the peaks of the curves of the most “infected” countries.

Download and process data


Coronavirus distribution data are updated on an ongoing basis here . The source tables contain cumulative curves, they can be picked up and pre-processed with a small script.

Like this
import pandas as pd
import ipywidgets as widgets
from ipywidgets import interact
import matplotlib.pyplot as plt

pd.set_option('display.max_columns', None)

#  
url = 'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/'
confirmed = pd.read_csv(url + 'time_series_covid19_confirmed_global.csv', sep = ',')
deaths = pd.read_csv(url + 'time_series_covid19_deaths_global.csv', sep = ',')
recovered = pd.read_csv(url + 'time_series_covid19_recovered_global.csv', sep = ',')

#   -   'dd.mm.yy'
new_cols = list(confirmed.columns[:4]) + list(confirmed.columns[4:].map(lambda x: '{0:02d}.{1:02d}.{2:d}'.format(int(x.split(sep='/')[1]), int(x.split(sep='/')[0]), int(x.split(sep='/')[2]))))
confirmed.columns = new_cols
recovered.columns = new_cols
deaths.columns = new_cols

#     
confirmed_daily = confirmed.copy()
confirmed_daily.iloc[:,4:] = confirmed_daily.iloc[:,4:].diff(axis=1)
deaths_daily = deaths.copy()
deaths_daily.iloc[:,4:] = deaths_daily.iloc[:,4:].diff(axis=1)
recovered_daily = recovered.copy()
recovered_daily.iloc[:,4:] = recovered_daily.iloc[:,4:].diff(axis=1)

#       
smooth_conf_daily = confirmed_daily.copy()
smooth_conf_daily.iloc[:,4:] = smooth_conf_daily.iloc[:,4:].rolling(window=8, min_periods=2, center=True, axis=1).mean()
smooth_conf_daily.iloc[:,4:] = smooth_conf_daily.iloc[:,4:].round(1)

#   ,    
last_date = confirmed.columns[-1]

#  20       
confirmed_top = confirmed.iloc[:, [1, -1]].groupby('Country/Region').sum().sort_values(last_date, ascending = False).head(20)
countries = list(confirmed_top.index)

# ,      20 
conf_tot_ratio = confirmed_top.sum() / confirmed.iloc[:, 4:].sum()[-1]

#    
# countries.append('Russia')

#       ,   
l1 = 'Infected at ' + last_date + ' - ' + str(confirmed.iloc[:, 4:].sum()[-1])
l2 = 'Recovered at ' + last_date + ' - ' + str(recovered.iloc[:, 4:].sum()[-1])
l3 = 'Dead at ' + last_date + ' - ' + str(deaths.iloc[:, 4:].sum()[-1])

#      
fig, ax = plt.subplots(figsize = [16,6])
ax.plot(confirmed.iloc[:, 4:].sum(), '-', alpha = 0.6, color = 'orange', label = l1)
ax.plot(recovered.iloc[:, 4:].sum(), '-', alpha = 0.6, color = 'green', label = l2)
ax.plot(deaths.iloc[:, 4:].sum(), '-', alpha = 0.6, color = 'red', label = l3)

ax.legend(loc = 'upper left', prop=dict(size=12))
ax.xaxis.grid(which='minor')
ax.yaxis.grid()
ax.tick_params(axis = 'x', labelrotation = 90)
plt.title('COVID-19 in all countries. Top 20 countries consists {:.2%} of total confirmed infected cases.'.format(conf_tot_ratio[0]))
plt.show()


Thus, you can get the same curves as on dashboards:

image

There is a little extra in the script that is not needed to display the graph, but will be needed in the next steps. For example, I immediately formed a list of 20 countries with the maximum number of infected today. By the way - this is 90% of all infected.

Move on. The figures of the daily increase in the sick are quite “noisy”. Therefore, it would be nice to smooth the chart. Add a trend line to the daily charts (I used a centered moving average). It turns out somehow like this:

image

The trend is shown in black. This is a fairly smooth curve with a single peak. It remains to find the beginning of the outbreak of the disease (red arrow) and the peak incidence (green). With a peak, everything is simple - this is the maximum on smoothed data. Moreover, if the maximum falls on the far right point of the graph, then the peak has not yet been passed. If the maximum remains to the left - most likely the peak has remained in the past.

The start of the burst can be found in different ways. Suppose for simplicity that this is the point in time when the number of infected per day begins to exceed 1% of the maximum number of infected per day.

The graph is quite symmetrical. That is, the duration from the onset of the epidemic to a peak is approximately equal to the duration from peak to decline.Thus, if we find the number of days from the beginning to the peak, then approximately the same number of days will need to wait until the epidemic subsides .

We will lay all this logic in the following script.

In such a
from datetime import timedelta, datetime

#     20  + .
confirmed_top = confirmed_top.rename(columns={str(last_date): 'total_confirmed'})
dates = [i for i in confirmed.columns[4:]]

for country in countries:
    
    #       ,      ,  
    confirmed_top.loc[country, 'total_confirmed'] = confirmed.loc[confirmed['Country/Region'] == country,:].sum()[4:][-1]
    confirmed_top.loc[country, 'last_day_conf'] = confirmed.loc[confirmed['Country/Region'] == country,:].sum()[-1] - confirmed.loc[confirmed['Country/Region'] == country,:].sum()[-2]
    confirmed_top.loc[country, 'mortality, %'] = round(deaths.loc[deaths['Country/Region'] == country,:].sum()[4:][-1]/confirmed.loc[confirmed['Country/Region'] == country,:].sum()[4:][-1]*100, 1)
    
    #       .
    smoothed_conf_max = round(smooth_conf_daily[smooth_conf_daily['Country/Region'] == country].iloc[:,4:].sum().max(), 2)
    peak_date = smooth_conf_daily[smooth_conf_daily['Country/Region'] == country].iloc[:,4:].sum().idxmax()
    peak_day = dates.index(peak_date)
    
    #      ,      1%    
    start_day = (smooth_conf_daily[smooth_conf_daily['Country/Region'] == country].iloc[:,4:].sum() < smoothed_conf_max/100).sum()
    start_date = dates[start_day-1]
    
    #     
    confirmed_top.loc[country, 'trend_max'] = smoothed_conf_max
    confirmed_top.loc[country, 'start_date'] = start_date
    confirmed_top.loc[country, 'peak_date'] = peak_date
    confirmed_top.loc[country, 'peak_passed'] = round(smooth_conf_daily.loc[smooth_conf_daily['Country/Region'] == country, last_date].sum(), 2) != smoothed_conf_max
    confirmed_top.loc[country, 'days_to_peak'] = peak_day - start_day
    
    #   ,    
    if confirmed_top.loc[country, 'peak_passed']:
         confirmed_top.loc[country, 'end_date'] = (datetime.strptime(confirmed_top.loc[country, 'peak_date']+'20', "%d.%m.%Y").date() + timedelta(confirmed_top.loc[country, 'days_to_peak'])).strftime('%d.%m.%Y')

#    
confirmed_top.loc['China', 'start_date'] = '22.01.20'
confirmed_top


At the output we get just such a table for the top 20 countries.

Update 10.04.2020: At the time of writing, Russia was not in the top twenty, but on April 7 it appeared in 20th place. April 10 left on the 18th. The table also includes the dates of quarantine measures in different countries.

image

The following columns are in the table:
total_confirmed - the total number of infected in the country;
last_day_conf - the number of infected on the last day;
mortality,% - mortality rate (number of dead / number of infected);
trend_max - trend maximum;
start_date - date the epidemic began in the country;
peak_date - peak date;
peak_passed - peak flag (if True - peak passed);
days_to_peak - how many days have passed from the beginning to the peak;
end_date - end date of the epidemic.

Update 10.04.2020: the incidence peak was reached in 14 countries out of 20. Moreover, on average, from the beginning of mass infections to the peak, an average of 25 days passes:

image

According to the above logic, by the end of April, the situation will have to normalize in almost all countries under consideration . As it will be, only time will tell.

Update 10.04.2020: You can see that the charts of European countries, in contrast to the chart of South Korea, have a more gentle recession after the peak.

There is another script that allows you to display graphs for each of the countries.

This
@interact
def plot_by_country(country = countries):
    
    #   
    l1 = 'Infected at ' + last_date + ' - ' + str(confirmed.loc[confirmed['Country/Region'] == country,:].sum()[-1])
    l2 = 'Recovered at ' + last_date + ' - ' + str(recovered.loc[recovered['Country/Region'] == country,:].sum()[-1])
    l3 = 'Dead at ' + last_date + ' - ' + str(deaths.loc[deaths['Country/Region'] == country,:].sum()[-1])
    
    #        
    df = pd.DataFrame(confirmed_daily.loc[confirmed_daily['Country/Region'] == country,:].sum()[4:])
    df.columns = ['confirmed_daily']
    df['recovered_daily'] = recovered_daily.loc[recovered_daily['Country/Region'] == country,:].sum()[4:]
    df['deaths_daily'] = deaths_daily.loc[deaths_daily['Country/Region'] == country,:].sum()[4:]
    df['deaths_daily'] = deaths_daily.loc[deaths_daily['Country/Region'] == country,:].sum()[4:]
    df['smooth_conf_daily'] = smooth_conf_daily.loc[smooth_conf_daily['Country/Region'] == country,:].sum()[4:]
    
    #  
    fig, ax = plt.subplots(figsize = [16,6], nrows = 1)
    plt.title('COVID-19 dinamics daily in ' + country)
    
    #      ,   
    ax.bar(df.index, df.confirmed_daily, alpha = 0.5, color = 'orange', label = l1)
    ax.bar(df.index, df.recovered_daily, alpha = 0.6, color = 'green', label = l2)
    ax.bar(df.index, df.deaths_daily, alpha = 0.7, color = 'red', label = l3)
    ax.plot(df.index, df.smooth_conf_daily, alpha = 0.7, color = 'black')
    
    #     .
    start_date = confirmed_top[confirmed_top.index == country].start_date.iloc[0]
    start_point = smooth_conf_daily.loc[smooth_conf_daily['Country/Region'] == country, start_date].sum()
    ax.plot_date(start_date, start_point, 'o', alpha = 0.7, color = 'black')
    shift = confirmed_top.loc[confirmed_top.index == country, 'trend_max'].iloc[0]/40
    plt.text(start_date, start_point + shift, 'Start at ' + start_date, ha ='right', fontsize = 20)
    
    peak_date = confirmed_top[confirmed_top.index == country].peak_date.iloc[0]
    peak_point = smooth_conf_daily.loc[smooth_conf_daily['Country/Region'] == country, peak_date].sum()
    ax.plot_date(peak_date, peak_point, 'o', alpha = 0.7, color = 'black')
    plt.text(peak_date, peak_point + shift, 'Peak at ' + peak_date, ha ='right', fontsize = 20)
    
    ax.xaxis.grid(False)
    ax.yaxis.grid(False)
    ax.tick_params(axis = 'x', labelrotation = 90)
    ax.legend(loc = 'upper left', prop=dict(size=12))
    
    #         .
    max_pos = max(df['confirmed_daily'].max(), df['recovered_daily'].max())
    if confirmed_top[confirmed_top.index == country].peak_passed.iloc[0]:
        estimation = 'peak is passed - end of infection ' + confirmed_top[confirmed_top.index == country].end_date.iloc[0]
    else:
        estimation = 'peak is not passed - end of infection is not clear'
    plt.text(df.index[len(df.index)//2], 3*max_pos//4, country, ha ='center', fontsize = 50)
    plt.text(df.index[len(df.index)//2], 2*max_pos//3, estimation, ha ='center', fontsize = 20)
    
    #plt.savefig(country + '.png', bbox_inches='tight', dpi = 75)
    plt.show()


Here is the current situation in Russia (update of 04/10/2020):

image

Unfortunately, the peak has not yet been passed, so it is difficult to make predictions regarding the timing of the decrease in the incidence. But given that the outbreak lasts 26 days at the moment, we can expect that during the week we will see a peak and the incidence will begin to decline. I would venture to suggest that in early May the situation is normalizing (I have always been an optimist, I must say).

Thank you very much for reading to the end. Be healthy and strength will be with us. Below are charts for the rest of the twenty countries in descending order of the number of infected. If you need more recent data - you can run the scripts given in the text, everything will be recalculated for the current date.

Update 10.04.2020 - the graphics are updated.

USA
image

Spain
image

Italy
image

Germany
image

France
image

China
image

Iran
image

United Kingdom
image

Turkey
image

Switzerland
image

Belgium
image

Netherlands
image

Canada
image

Austria
image

Portugal
image

Brazil
image

South Korea
image

Israel
image

Sweden
image

Norway
image

All Articles