Confidential Machine Learning. PySyft Library

Habr, hello!


This article is about Privacy-Preserved Machine Learning. We will discuss why and how to ensure the privacy of users when training, for example, neural networks.


Most of the article is a retelling of the speeches and lectures of Andrew Trask. He is the leader of the OpenMined community . These are people united by the topic of privacy in machine learning. In particular, OpenMined is working on the PySyft library . This is a wrapper over PyTorch, Tensorflow, or Keras for private machine learning. We will get to know PySyft during this article.


Motivation


Let us want to make a classifier of human tumors. If we can make such a classifier, we will help millions of people. Then, our first step is to find a suitable dataset. Note that this data is private, it is difficult to access it. Most people do not want to publicly talk about their illnesses.


Iโ€™ll clarify why data anonymization is not enough. In 2007, Netflix published 10 million movie ratings from 500,000 users. This dataset was part of a competition to create the best recommendation system. In it, the names of people and the names of films were replaced by identifiers. However, researchers were able to identify individuals using open data with IMDb. More details in the original article .


Then, you need something more than anonymization. Further, I persuade that it is possible to train neural networks on data that we do not have access to. Then privacy will be ensured and we will be able to build a tumor classifier. In addition, we will be able to work on other diseases, such as dementia or depression. If we learn to work with private data in machine learning, we can solve important world problems.


Remote Execution / Federated Learning


Suppose we are Apple for a second. We want to make our services better. For example, we want to improve auto-completion. Then, we need data: what words and in what sequence users type. We could download this data from iPhones and iPads, store it on the company's servers and the dataset is ready. But then we violate privacy!


Since the data does not go to the model, the model goes to the data. This is our first idea. We will send the neural network to users, we will locally learn from their data and get the model with updated weights back. Another feature of 'Remote Execution' is the ability to parallel model training, i.e. simultaneously on different devices.


PySyft โ€” python- . , . . , Torch-, .


#  PyTorch  PySyft
import torch as th
import syft as sy

#     PyTorch-
hook = sy.TorchHook(th)

#  "" ,  " " -    .
bob = sy.VirtualWorker(hook, id="bob")

#    x  y     .    .
x = th.tensor([1,2,3,4,5]).send(bob)
y = th.tensor([1,1,1,1,1]).send(bob)

#      . 
z = x + y

#  ,     
bob._objects
# {5102423178: tensor([1, 2, 3, 4, 5]),
#  6031624222: tensor([1, 1, 1, 1, 1]),
#  4479039083: tensor([2, 3, 4, 5, 6])}

#      ,    
z = z.get()
z
# tensor([2, 3, 4, 5, 6])

bob._objects
# {5102423178: tensor([1, 2, 3, 4, 5]), 
#  6031624222: tensor([1, 1, 1, 1, 1])}

tensor.get(). . , - ?


Differential Privacy


, , . , bag-of-words, , {"": " "}. . Differential Privacy โ€” , .
.


. โ€” . , . , , , . . , , . , . .


, - , 50/50. , . . , 60% ''. , . : 35% '', 25% '', 15% '' 25% ''. , 70% '' 30% ''. .


, . , , . , , .


. , . . , ( ).



. 'Remote Execution' , . , . .


. . , Secure multi-party computation Homomorphic Encryption. ?


:


  • Apple , . .
  • โ€” "Differential Privacy". , . .

:



All Articles