Differential privacy: comparing libraries



This article is a practical, applied comparison. It does not explain all the details of differential privacy. If you want to study this question more deeply, refer to the links at the end of the article.

The promise of differential privacy (more precisely, ε-differential privacy ) is to provide a measurable way to balance the privacy and accuracy of data when publicly aggregating data in private datasets.

Let's say you have a settings slider:

  • take it to the left - keep complete privacy of people whose data is in the dataset, but at the same time add a lot of noise to the information, reducing the accuracy of the aggregate statistics;
  • move to the right - you get the perfect accuracy of the aggregate statistics, but reveal private information about people from the dataset.

The bottom line is that the slider can be translated anywhere on the scale, choosing a balance for a specific use case.

How data is converted


ε-differential privacy allows you to find a balance between privacy and accuracy using a positive value of ε (epsilon). If ε is small, then we maintain more privacy, but impair accuracy. If ε is large, then privacy suffers for the sake of accuracy. The value of ε varies from 0 to infinity.

Differential privacy libraries implement different techniques that take the epsilon parameter as an input value and add random noise inverse to ε to the values ​​in the original dataset . That is, the smaller the epsilon, the more noise is added.

Some libraries use additional parameters and offer random noise control tools, for example, the probability density from which random numbers are taken (Laplace distribution, normal distribution, etc.).

Some libraries also implement the concept of a budget of privacy (privacy budget): each time a function is called, the library uses the amount of a pre-allocated budget set by the user. The theoretical basis is as follows: with every publication of new data, it is more likely that an attacker will extract information about people from the dataset. And with the help of the privacy budget, the library may return an error instead of a value.

Comparison of Differential Privacy Libraries


Compare the three libraries and their work with a given dataset using the same methods:


We will see how the size of the dataset and the desired level of privacy (epsilon) affect the accuracy. In each case, we will compare the results obtained at different libraries.

Test dataset


We will randomly generate a dataset containing a column with the weights of people in kilograms. We will consider this information confidential, it must be kept secret. Each weight is represented as a double value real number.

Weights were generated in accordance with the normal distribution, in which the average weight in the dataset is 70 kg and the standard deviation is 30.

For the purposes of the study, we will make it possible to generate datasets of different sizes.

Here is the weight generation code:

import random

def generate_weight_dataset(dataset_size):
    outpath = f"weights_{dataset_size}.csv"
    mu = 70  # mean
    sigma = 30  # standard deviation
    with open(outpath, "w+") as fout:
        for i in range(dataset_size):
            random_weight = random.normalvariate(mu, sigma)  # normal distribution
            random_weight = abs(random_weight)  # make sure weights are positive
            random_weight = round(random_weight, 1)  # round to 1 decimal digit
            line = f"{random_weight}"
            print(line, file=fout)

For a dataset of size 10,600, the generated data will look like this:


The actual average should be around 70 because we used the normal distribution with mean=70. However, this is not an exact value because weights are randomly generated. In addition, there are no negative weights in our dataset, and the final values ​​are rounded to one digit after the decimal point. In this case, the actual average value was 70.34812570579449, and the standard deviation was 29.488380395675765.

Please note that we did all this just to make the dataset look like a set of real data. The distribution of values ​​will not affect the estimates that we will discuss.

Using libraries


Let's look at a way to use libraries. We will always compare them using the same parameters. For example, in all cases there will be the same input epsilon value. For measurements, we will use the average value of the dataset. All considered libraries implement such operations as average, deviation, sum and others using random noise adding mechanisms. Often used the Laplace mechanism. The average value search operation is chosen arbitrarily, it allows us to test all operations using the same mechanism.

We calculate the average value of differential privacy using the IBM library (Python):

import diffprivlib.tools.utils

def dp_mean(weights, epsilon, upper):
    dp_mean = diffprivlib.tools.utils.mean(weights, epsilon=epsilon, range=upper)
    return dp_mean

upper = max(weights)
epsilon = 1.0
dpm = dp_mean(weights, epsilon, upper)

The same with the Google library (C ++):

double dp_mean(std::list<double> weights, double epsilon, double lower, double upper) {
    std::unique_ptr<differential_privacy::BoundedMean<double>> mean_algorithm = 
    differential_privacy::BoundedMean<double>::Builder()
    .SetEpsilon(epsilon)
    .SetLower(0.0)
    .SetUpper(upper)
    .Build()
    .ValueOrDie();

    for (auto weight : weights) {
        mean_algorithm->AddEntry(weight);
    }
    double privacy_budget = 1.0;
    auto output = mean_algorithm->PartialResult(privacy_budget).ValueOrDie();
    return differential_privacy::GetValue<double>(output);
}

Please note that we use an entire privacy budget of 1.0 for every use of the Google Library. And the IBM library does not accept this parameter.

Now we calculate the average value of the DP using diffpriv (language R):

library(diffpriv)

dp_mean <- function(xs, epsilon) {
  ## a target function we'd like to run on private data X, releasing the result
  target <- function(X) mean(X)

  ## target seeks to release a numeric, so we'll use the Laplace mechanism---a
  ## standard generic mechanism for privatizing numeric responses
  n <- length(xs)
  mech <- DPMechLaplace(target = target, sensitivity = 1/n, dims = 1)

  r <- releaseResponse(mech, privacyParams = DPParamsEps(epsilon = epsilon), X = xs)
  r$response
}

How privacy affects accuracy


The accuracy of the average value can be measured by simply calculating the difference between the average DP and the actual average.

Let's see what happens with the accuracy of the average weight when the epsilon changes. Since we are adding random noise, we will execute 100 runs and calculate the mean square error to check if the library is generating distorted values.

The result should demonstrate that the error (the difference between the actual average and average DP) is inversely proportional to epsilon. If epsilon is large, then the error should be small, and vice versa. I tested epsilon values ​​ranging from e ^ -10 to e ^ 10 (note the logarithmic scale).

The test used a dataset of arbitrary constant size of 10,600 lines.

Standard error



Indeed, the error decreases as epsilon increases. Interestingly, the Google library has a standard error limit if the epsilon value is very small. In the IBM and diffpriv libraries this is not observed.

The reason for this lies in the source code snippet of the Google library. In numerical-mechanisms.h:

 //      . 
 //   privacy_budget   .  
  //     (0, 1],        .
  // ,         
  //    ,           0.5 ( 0.4,  0.6  ..).
  virtual double AddNoise(double result, double privacy_budget) {
    if (privacy_budget <= 0) {
      privacy_budget = std::numeric_limits<double>::min();
    }
    //  snapping-,  
    // (2012, "On Significance of the Least Significant Bits For Differential Privacy").
    double noise = distro_->Sample(1.0 / privacy_budget);
    double noised_result =
        Clamp<double>(LowerBound<double>(), UpperBound<double>(), result) +
        noise;
    double nearest_power = GetNextPowerOfTwo(diversity_ / privacy_budget);
    double remainder =
        (nearest_power == 0.0) ? 0.0 : fmod(noised_result, nearest_power);
    double rounded_result = noised_result - remainder;
    return ClampDouble<double>(LowerBound<double>(), UpperBound<double>(),
                               rounded_result);
  }

In bounded-mean.h:

   //  :    .
    double normalized_sum = sum_mechanism_->AddNoise(
        sum - raw_count_ * midpoint_, remaining_budget);
    double average = normalized_sum / noised_count + midpoint_;
    AddToOutput<double>(&output, Clamp<double>(lower_, upper_, average));
    return output;

When epsilon is very small (approximately less than 0.013), the discrepancy (equal to 1 / epsilon) will be extremely large, and the added noise will be zero. Therefore, the library returns an average DP equivalent to the middle of the range used for the average. This is explained by the beginning of the red line on the chart.

Diffpriv has a smaller mean square error, which means better accuracy compared to two other libraries when using the same epsilon. To ensure diffpriv has the same level of privacy as competitors, a lower epsilon value must be applied.

Standard deviation


The standard deviation of the error for 100 runs looks like this:


At the Google library, with a small epsilon value, the deviation is approximately constant, and then quickly catches up with the IBM library. In general, diffpriv has a smaller standard deviation than others.

How dataset size affects accuracy


We are talking about the effect on the accuracy of the average value of the DP.

One of the risks of a small dataset is that individuals have a big impact on the aggregated value, which is the average. Thus, if there is only one person in the dataset, then the perfectly accurate average value will be equal to the exact weight of that person. Let's see how libraries help to avoid the disclosure of individual information when resizing datasets.

We will use an arbitrary constant value of epsilon 1.0.

Standard error



As the dataset size decreases, the mean-square error increases, and vice versa. It was expected. If there are few people, then I want to add more noise to avoid the disclosure of private data. The disadvantage is that for small datasets, aggregated values ​​can become completely irrelevant due to very low accuracy.

As for diffpriv, the standard error is much more dependent on changes in the size of the dataset. However, the pattern is still noticeable, as in the other two libraries: as the dataset grows, the mean-square error decreases. But this is true only for datasets up to about 30,000 lines. Then the error changes little. Also, pay attention to the anomalous fall in the size of the error with the size of the dataset 17 912.

Standard deviation


You might ask how the size of the added noise changed over 100 runs for a given dataset size? To answer this, let's plot the standard deviation of the error for 100 runs for each datasate size.


With small datasets, the variation in error is higher. However, the Google Library has a peak with a dataset of size 6. Smaller and larger datasets have less fluctuation. The IBM library does not observe this, the graph is closer to linear.

Again, diffpriv generally has a lower standard deviation.

Conclusion


We saw that the measures taken in three different libraries of differential privacy were in line with initial expectations. However, different implementations lead to different results, especially in boundary cases, as is the case with very small epsilon values ​​and the Google library. In practice, this should not interfere, because users will not use very large and very small datasets, as well as very small epsilon values. However, in such situations, the library may issue a warning.

The behavior of diffpriv differs markedly from the other two libraries, so we recommend that you carefully choose the epsilon value.

References



All Articles