About implementing a deep learning library in Python

Deep learning technologies have come a long way in a short period of time - from simple neural networks to fairly complex architectures. To support the rapid spread of these technologies, various libraries and deep learning platforms have been developed. One of the main goals of such libraries is to provide developers with simple interfaces to create and train neural network models. Such libraries allow their users to pay more attention to the tasks being solved, and not to the subtleties of model implementation. To do this, you may need to hide the implementation of basic mechanisms behind several levels of abstraction. And this, in turn, complicates the understanding of the basic principles on which deep learning libraries are based.



The article, the translation of which we are publishing, is aimed at analyzing the features of the device of low-level building blocks of deep learning libraries. First, we briefly talk about the essence of deep learning. This will allow us to understand the functional requirements for the respective software. Then we look at developing a simple but working deep learning library in Python using NumPy. This library is capable of providing end-to-end training for simple neural network models. Along the way, we'll talk about the various components of deep learning frameworks. The library that we will be considering is quite small, less than 100 lines of code. And this means that it will be quite simple to figure it out. The full project code, which we will deal with, can be found here .

General information


Typically, deep learning libraries (such as TensorFlow and PyTorch) consist of the components shown in the following figure.


Components of the deep learning framework

Let's analyze these components.

▍ Operators


The concepts of “operator” and “layer” (layer) are usually used interchangeably. These are the basic building blocks of any neural network. Operators are vector functions that transform data. Among the frequently used operators, one can distinguish such as linear and convolution layers, sub-sampling layers (pooling), semi-linear (ReLU) and sigmoid (sigmoid) activation functions.

▍Optimizers (optimizers)


Optimizers are the foundation of deep learning libraries. They describe methods for adjusting model parameters using certain criteria and taking into account the goal of optimization. Among the well-known optimizers, SGD, RMSProp and Adam can be noted.

▍ Loss functions


Loss functions are analytic and differentiable mathematical expressions that are used as a substitute for the goal of optimization when solving a problem. For example, the cross-entropy function and the piecewise linear function are usually used in classification problems.

▍ Initializers


Initializers provide initial values ​​for model parameters. It is these values ​​that the parameters have at the beginning of training. Initializers play an important role in the training of neural networks, since unsuccessful initial parameters may mean that the network will learn slowly, or may not learn at all. There are many ways to initialize the weights of a neural network. For example - you can assign them small random values ​​from the normal distribution. Here is a page where you can learn about the different types of initializers.

▍ Regularizers


Regularizers are tools that avoid network retraining and help the network gain generalization. You can deal with retraining the network in explicit or implicit ways. Explicit methods involve structural limitations on weights. For example, minimizing their L1-Norm and L2-Norm, which, accordingly, makes the weight values ​​better dispersed and more evenly distributed. Implicit methods are represented by specialized operators that perform the transformation of intermediate representations. This is done either through explicit normalization, for example, using the packet normalization technique (BatchNorm), or by changing the network connectivity using DropOut and DropConnect algorithms.

The above components usually belong to the interface part of the library. Here, by the “interface part” I mean the entities with which the user can interact. They give him convenient tools for efficiently designing a neural network architecture. If we talk about the internal mechanisms of libraries, they can provide support for automatic calculation of gradients of the loss function, taking into account various parameters of the model. This technique is commonly called Automatic Differentiation (AD).

Automatic differentiation


Each deep learning library provides the user with some automatic differentiation capabilities. This gives him the opportunity to focus on the description of the structure of the model (graph of calculations) and transfer the task of calculating the gradients to the AD module. Let’s take an example that will let us know how it all works. Suppose we want to calculate the partial derivatives of the following function with respect to its input variables X₁ and X₂:

Y = sin (x₁) + X₁ * X₂

The following figure, which I borrowed from here , shows the graph of calculations and the calculation of derivatives using a chain rule.


Computational graph and calculation of derivatives by a chain rule

What you see here is something like a “reverse mode” of automatic differentiation. The well-known error back propagation algorithm is a special case of the above algorithm for the case where the function located at the top is a loss function. AD exploits the fact that any complex function consists of elementary arithmetic operations and elementary functions. As a result, derivatives can be computed by applying a chain rule to these operations.

Implementation


In the previous section, we examined the components necessary for creating a deep learning library designed for creating and end-to-end training of neural networks. In order not to complicate the example, I imitate the Caffe library design pattern here . Here we declare two abstract classes - Functionand Optimizer. In addition, there is a class Tensor, which is a simple structure containing two multidimensional NumPy arrays. One of them is designed to store parameter values, the other - to store their gradients. All parameters in different layers (operators) will be of type Tensor. Before we go any further, take a look at the general outline of the library.


Library UML diagram

At the time of writing this material, this library contains an implementation of the linear layer, the ReLU activation function, the SoftMaxLoss layer, and the SGD optimizer. As a result, it turns out that the library can be used to train classification models consisting of fully connected layers and using a nonlinear activation function. Now let's look at some details about the abstract classes that we have.

An abstract classFunctionprovides an interface for operators. Here is his code:

class  Function(object):
    def forward(self): 
        raise NotImplementedError
    
    def backward(self): 
        raise NotImplementedError
    
    def getParams(self): 
        return []

All operators are implemented through the inheritance of an abstract class Function. Each operator must provide an implementation of the methods forward()and backward(). Operators may contain an implementation of an optional method getParams()that returns their parameters (if any). The method forward()receives input data and returns the result of their transformation by the operator. In addition, he solves the internal problems necessary for calculating gradients. The method backward()accepts the partial derivatives of the loss function with respect to the outputs of the operator and implements the calculation of the partial derivatives of the loss function with respect to the input data of the operator and the parameters (if any). Note that the methodbackward(), in essence, provides our library with the ability to perform automatic differentiation.

In order to deal with all this with a specific example, let's take a look at the implementation of the function Linear:

class Linear(Function):
    def __init__(self,in_nodes,out_nodes):
        self.weights = Tensor((in_nodes,out_nodes))
        self.bias    = Tensor((1,out_nodes))
        self.type = 'linear'

    def forward(self,x):
        output = np.dot(x,self.weights.data)+self.bias.data
        self.input = x 
        return output

    def backward(self,d_y):
        self.weights.grad += np.dot(self.input.T,d_y)
        self.bias.grad    += np.sum(d_y,axis=0,keepdims=True)
        grad_input         = np.dot(d_y,self.weights.data.T)
        return grad_input

    def getParams(self):
        return [self.weights,self.bias]

The method forward()implements the transformation of the view Y = X*W+band returns the result. In addition, it saves the input value X, since it is needed to calculate the partial derivative of dYthe loss function with respect to the output value Yin the method backward(). Method backward()receives the partial derivatives, calculated with respect to the input value Xand the parameters Wand b. Moreover, it returns the partial derivatives calculated with respect to the input value X, which will be transferred to the previous layer.

An abstract class Optimizerprovides an interface for optimizers:

class Optimizer(object):
    def __init__(self,parameters):
        self.parameters = parameters
    
    def step(self): 
        raise NotImplementedError

    def zeroGrad(self):
        for p in self.parameters:
            p.grad = 0.

All optimizers are implemented by inheriting from the base class Optimizer. A class describing a particular optimization should provide an implementation of the method step(). This method updates the model parameters using their partial derivatives calculated in relation to the optimized value of the loss function. A link to various model parameters is provided in the function __init__(). Please note that the universal functionality for resetting gradient values ​​is implemented in the base class itself.

Now, to better understand all this, consider a specific example - the implementation of the stochastic gradient descent (SGD) algorithm with support for adjusting the momentum and reducing weights:

class SGD(Optimizer):
    def __init__(self,parameters,lr=.001,weight_decay=0.0,momentum = .9):
        super().__init__(parameters)
        self.lr           = lr
        self.weight_decay = weight_decay
        self.momentum     = momentum
        self.velocity     = []
        for p in parameters:
            self.velocity.append(np.zeros_like(p.grad))

    def step(self):
        for p,v in zip(self.parameters,self.velocity):
            v = self.momentum*v+p.grad+self.weight_decay*p.data
            p.data=p.data-self.lr*v

The solution to the real problem


Now we have everything necessary for training the (deep) neural network model using our library. For this we need the following entities:

  • Model: calculation graph.
  • Data and target value: data for network training.
  • Loss function: substitute for optimization goal.
  • Optimizer: a mechanism for updating model parameters.

The following pseudo code describes a typical testing cycle:

model # 
data,target # 
loss_fn # 
optim #,         
Repeat:#   ,    ,     
   optim.zeroGrad() #    
   output = model.forward(data) #   
   loss   = loss_fn(output,target) # 
   grad   = loss.backward() #      
   model.backward(grad) #    
   optim.step() #  

Although this is not necessary in the deep learning library, it may be useful to include the above functionality in a separate class. This will allow us not to repeat the same actions when learning new models (this idea corresponds to the philosophy of high-level abstractions of frameworks like Keras ). In order to achieve this, declare a class Model:

class Model():
    def __init__(self):
        self.computation_graph = []
        self.parameters        = []

    def add(self,layer):
        self.computation_graph.append(layer)
        self.parameters+=layer.getParams()

    def __innitializeNetwork(self):
        for f in self.computation_graph:
            if f.type=='linear':
                weights,bias = f.getParams()
                weights.data = .01*np.random.randn(weights.data.shape[0],weights.data.shape[1])
                bias.data    = 0.

    def fit(self,data,target,batch_size,num_epochs,optimizer,loss_fn):
        loss_history = []
        self.__innitializeNetwork()
        data_gen = DataGenerator(data,target,batch_size)
        itr = 0
        for epoch in range(num_epochs):
            for X,Y in data_gen:
                optimizer.zeroGrad()
                for f in self.computation_graph: X=f.forward(X)
                loss = loss_fn.forward(X,Y)
                grad = loss_fn.backward()
                for f in self.computation_graph[::-1]: grad = f.backward(grad) 
                loss_history+=[loss]
                print("Loss at epoch = {} and iteration = {}: {}".format(epoch,itr,loss_history[-1]))
                itr+=1
                optimizer.step()
        
        return loss_history
    
    def predict(self,data):
        X = data
        for f in self.computation_graph: X = f.forward(X)
        return X

This class includes the following functionality:

  • : add() , . computation_graph.
  • : , , , .
  • : fit() . , .
  • : predict() , , .

Since this class is not the basic building block of deep learning systems, I implemented it in a separate module utilities.py. Note that the method fit()uses a class DataGeneratorwhose implementation is in the same module. This class is just a wrapper for training data and generates mini-packages for each iteration of training.

Model training


Now consider the last piece of code in which the neural network model is trained using the library described above. I am going to train a multilayer network on data arranged in a spiral. I was prompted by this publication. Code for generating this data and for visualizing it can be found in the file utilities.py.


Data with three classes arranged in a spiral. 

The previous figure shows the visualization of the data on which we will train the model. This data is nonlinearly separable. We can hope that a network with a hidden layer can correctly find nonlinear decision boundaries. If you put together everything that we talked about, you get the following code fragment that allows you to train the model:

import dl_numpy as DL
import utilities

batch_size        = 20
num_epochs        = 200
samples_per_class = 100
num_classes       = 3
hidden_units      = 100
data,target       = utilities.genSpiralData(samples_per_class,num_classes)
model             = utilities.Model()
model.add(DL.Linear(2,hidden_units))
model.add(DL.ReLU())
model.add(DL.Linear(hidden_units,num_classes))
optim   = DL.SGD(model.parameters,lr=1.0,weight_decay=0.001,momentum=.9)
loss_fn = DL.SoftmaxWithLoss()
model.fit(data,target,batch_size,num_epochs,optim,loss_fn)
predicted_labels = np.argmax(model.predict(data),axis=1)
accuracy         = np.sum(predicted_labels==target)/len(target)
print("Model Accuracy = {}".format(accuracy))
utilities.plot2DDataWithDecisionBoundary(data,target,model)

The image below shows the same data and the decisive boundaries of the trained model.


Data and decision boundaries of the trained model

Summary


Given the increasing complexity of deep learning models, there is a tendency to increase the capabilities of the respective libraries and to increase the amount of code needed to implement these capabilities. But the most basic functionality of such libraries can still be implemented in a relatively compact form. Although the library we created can be used for end-to-end training of simple networks, it is still, in many ways, limited. We are talking about limitations in the field of capabilities that allow deep learning frameworks to be used in areas such as machine vision, speech and text recognition. This, of course, the possibilities of such frameworks are not limited.

I believe that everyone can fork the project, the code of which we examined here, and, as an exercise, introduce into it what they would like to see in it. Here are some mechanisms you can try to implement yourself:

  • Operators: convolution, subsampling.
  • Optimizers: Adam, RMSProp.
  • Regulators: BatchNorm, DropOut.

I hope this material allowed you to at least see from the corner of your eye what is happening in the bowels of libraries for deep learning.

Dear readers! What deep learning libraries do you use?

Source: https://habr.com/ru/post/undefined/


All Articles