Confidence interval for the number of patients with coronavirus (mortality calculation)

A popular argument for a viral publication about coronavirus - how can one get any statistics for three cases? You can’t draw conclusions on such small samples! Everyone who studied social sciences absorbed this story about sample sizes with mother's alma milk. And this is correct in those situations with which we usually deal - with selective statistics.

For the case of the three dead, these statistics are very indirectly related. In those years when I was still teaching mathematical methods for psychologists at the university, I always tried to stop at this place - something that this whole course is about is not related to actual data. Only to the problem, when we need to make a conclusion about the general population from a random sample.

And here we have the number 3. Three dead, not some kind of vector, not a table or sample. It is a fact. Three of the dead came to us completely by chance. They died.

So, we will consider one of the simplest methods for determining the number of cases - by the mortality rate and the number of deaths. Suppose that we know mortality and it is 1%. In this situation, it would be logical and correct to consider that the number of people recovered will be 297. But what is the reliability of this judgment? Can we just brush aside that we have three dead, stating that three are not statistics?

image

This question will be answered by a negative binomial distribution and its prophet - Wikipedia. There are many Greek letters, if you, like me, are afraid of them, then I will tell you what happens. This distribution just answers the question of how many times it is necessary to roll the die so that the six falls five times. I use the R programming language for calculations, in which there is a ready-made function that allows you to evaluate the confidence interval.

qnbinom(p=c(.025,.975),size=3, prob=0.01)

Here p is 2.5% from below and 2.5% from above, between which the desired range is located.

The result is a confidence interval from 60 to 717. Not so bad! It is likely that the three dead do not mean at all 297 recovered, but only sixty! But maybe for seven hundred. :-(

For the most suspicious, who do not believe in a negative binomial distribution, I can offer numerical modeling. In general, if you don’t know how to calculate by formulas and distributions, model! In any incomprehensible situation, model, Monte Carlo is waiting for you.

We will write the random_infected function, which simulates a situation of disease and death.

random_infected <- function(deaths, fatality_rate)
{
  dead = 0
  all = 1
  while (dead < deaths) {
    if (runif(1) < fatality_rate) {
      all = all + 1
      dead = dead + 1
    } else
      all = all + 1
  }
  return(all)
}

This function does the following - rolls an “n-faced” cube (using uniform distribution). If one is dropped, then it increases the number of dead and the number all by one. And if it didn’t, then only the number all. Each roll of this die is a sick person who can either die or recover. As soon as we have the number of deaths specified by the deaths parameter, we stop and report how many times the die was rolled (the number is all). The likelihood of a one falling out on our imaginary cube is mortality, in our case the fatality_rate parameter.

infected_sizes<-replicate(100000,random_infected(deaths=3,fatality_rate=0.01))

And now let's calculate this number 100 thousand times. I have an old laptop, so I’m reluctant to wait until a million are counted.

After that, you can calculate the arithmetic mean of the numbers obtained. I got 301.2 - very similar to the expected number 300. This is how the distribution of the number of rolls of our death cube looks like:

library(ggplot2)
theme_set(theme_classic())

g <- ggplot(data.frame(infected_sizes=infected_sizes), aes(infected_sizes))
g + geom_density(alpha=0.8,fill="plum")

image

Here it is - a negative binomial distribution, please love and favor. Based on such data, it is possible to give approximate answers to questions - what is the probability that the total number of cases is less than fifty (1.2%) or more than 1000 (0.3%).

Of course, these are just estimates. They are based on data that may be incorrect. We do not know about the true mortality of the coronavirus. But the lower this mortality rate, the more cases of disease occur in one deceased and the greater the estimate of the extent of the pandemic.

Let me remind you that we roll this die instantly. To the model of calculating mortality, which was used in the sensational article by Thomas Pueyo, I have a small complaint. There, we assume that, based on the 3 deaths on day X, a mortality rate of 1% and the knowledge that the average time between infection and death is 17 days, 300 people were infected on day X-17. However, such a calculation is valid only if the number of sick people is the same every day. Since 17 days is not a strict number, it also has confidence intervals and errors. If we have a rapid increase in the number of cases, then among those who died on day X, we have a certain number of people infected not 17 days ago, but 16 or 15 days, and maybe 10 days ago. Perhaps there are even more of them than those who became infected 17 days ago. In this way,in a situation of a rapid increase in the number of cases, such a return calculation can lead to overestimated estimates of the prevalence of the disease. In general, everything is complicated.

PS Thanks to Gregory Demin for a hint about the type of distribution.

All Articles