Deep learning technologies have come a long way in a short period of time - from simple neural networks to fairly complex architectures. To support the rapid spread of these technologies, various libraries and deep learning platforms have been developed. One of the main goals of such libraries is to provide developers with simple interfaces to create and train neural network models. Such libraries allow their users to pay more attention to the tasks being solved, and not to the subtleties of model implementation. To do this, you may need to hide the implementation of basic mechanisms behind several levels of abstraction. And this, in turn, complicates the understanding of the basic principles on which deep learning libraries are based.
The article, the translation of which we are publishing, is aimed at analyzing the features of the device of low-level building blocks of deep learning libraries. First, we briefly talk about the essence of deep learning. This will allow us to understand the functional requirements for the respective software. Then we look at developing a simple but working deep learning library in Python using NumPy. This library is capable of providing end-to-end training for simple neural network models. Along the way, we'll talk about the various components of deep learning frameworks. The library that we will be considering is quite small, less than 100 lines of code. And this means that it will be quite simple to figure it out. The full project code, which we will deal with, can be found here .General information
Typically, deep learning libraries (such as TensorFlow and PyTorch) consist of the components shown in the following figure.Components of the deep learning frameworkLet's analyze these components.▍ Operators
The concepts of “operator” and “layer” (layer) are usually used interchangeably. These are the basic building blocks of any neural network. Operators are vector functions that transform data. Among the frequently used operators, one can distinguish such as linear and convolution layers, sub-sampling layers (pooling), semi-linear (ReLU) and sigmoid (sigmoid) activation functions.▍Optimizers (optimizers)
Optimizers are the foundation of deep learning libraries. They describe methods for adjusting model parameters using certain criteria and taking into account the goal of optimization. Among the well-known optimizers, SGD, RMSProp and Adam can be noted.▍ Loss functions
Loss functions are analytic and differentiable mathematical expressions that are used as a substitute for the goal of optimization when solving a problem. For example, the cross-entropy function and the piecewise linear function are usually used in classification problems.▍ Initializers
Initializers provide initial values for model parameters. It is these values that the parameters have at the beginning of training. Initializers play an important role in the training of neural networks, since unsuccessful initial parameters may mean that the network will learn slowly, or may not learn at all. There are many ways to initialize the weights of a neural network. For example - you can assign them small random values from the normal distribution. Here is a page where you can learn about the different types of initializers.▍ Regularizers
Regularizers are tools that avoid network retraining and help the network gain generalization. You can deal with retraining the network in explicit or implicit ways. Explicit methods involve structural limitations on weights. For example, minimizing their L1-Norm and L2-Norm, which, accordingly, makes the weight values better dispersed and more evenly distributed. Implicit methods are represented by specialized operators that perform the transformation of intermediate representations. This is done either through explicit normalization, for example, using the packet normalization technique (BatchNorm), or by changing the network connectivity using DropOut and DropConnect algorithms.The above components usually belong to the interface part of the library. Here, by the “interface part” I mean the entities with which the user can interact. They give him convenient tools for efficiently designing a neural network architecture. If we talk about the internal mechanisms of libraries, they can provide support for automatic calculation of gradients of the loss function, taking into account various parameters of the model. This technique is commonly called Automatic Differentiation (AD).Automatic differentiation
Each deep learning library provides the user with some automatic differentiation capabilities. This gives him the opportunity to focus on the description of the structure of the model (graph of calculations) and transfer the task of calculating the gradients to the AD module. Let’s take an example that will let us know how it all works. Suppose we want to calculate the partial derivatives of the following function with respect to its input variables X₁ and X₂:Y = sin (x₁) + X₁ * X₂The following figure, which I borrowed from here , shows the graph of calculations and the calculation of derivatives using a chain rule.Computational graph and calculation of derivatives by a chain ruleWhat you see here is something like a “reverse mode” of automatic differentiation. The well-known error back propagation algorithm is a special case of the above algorithm for the case where the function located at the top is a loss function. AD exploits the fact that any complex function consists of elementary arithmetic operations and elementary functions. As a result, derivatives can be computed by applying a chain rule to these operations.Implementation
In the previous section, we examined the components necessary for creating a deep learning library designed for creating and end-to-end training of neural networks. In order not to complicate the example, I imitate the Caffe library design pattern here . Here we declare two abstract classes - Function
and Optimizer
. In addition, there is a class Tensor
, which is a simple structure containing two multidimensional NumPy arrays. One of them is designed to store parameter values, the other - to store their gradients. All parameters in different layers (operators) will be of type Tensor
. Before we go any further, take a look at the general outline of the library.Library UML diagramAt the time of writing this material, this library contains an implementation of the linear layer, the ReLU activation function, the SoftMaxLoss layer, and the SGD optimizer. As a result, it turns out that the library can be used to train classification models consisting of fully connected layers and using a nonlinear activation function. Now let's look at some details about the abstract classes that we have.An abstract classFunction
provides an interface for operators. Here is his code:class Function(object):
def forward(self):
raise NotImplementedError
def backward(self):
raise NotImplementedError
def getParams(self):
return []
All operators are implemented through the inheritance of an abstract class Function
. Each operator must provide an implementation of the methods forward()
and backward()
. Operators may contain an implementation of an optional method getParams()
that returns their parameters (if any). The method forward()
receives input data and returns the result of their transformation by the operator. In addition, he solves the internal problems necessary for calculating gradients. The method backward()
accepts the partial derivatives of the loss function with respect to the outputs of the operator and implements the calculation of the partial derivatives of the loss function with respect to the input data of the operator and the parameters (if any). Note that the methodbackward()
, in essence, provides our library with the ability to perform automatic differentiation.In order to deal with all this with a specific example, let's take a look at the implementation of the function Linear
:class Linear(Function):
def __init__(self,in_nodes,out_nodes):
self.weights = Tensor((in_nodes,out_nodes))
self.bias = Tensor((1,out_nodes))
self.type = 'linear'
def forward(self,x):
output = np.dot(x,self.weights.data)+self.bias.data
self.input = x
return output
def backward(self,d_y):
self.weights.grad += np.dot(self.input.T,d_y)
self.bias.grad += np.sum(d_y,axis=0,keepdims=True)
grad_input = np.dot(d_y,self.weights.data.T)
return grad_input
def getParams(self):
return [self.weights,self.bias]
The method forward()
implements the transformation of the view Y = X*W+b
and returns the result. In addition, it saves the input value X
, since it is needed to calculate the partial derivative of dY
the loss function with respect to the output value Y
in the method backward()
. Method backward()
receives the partial derivatives, calculated with respect to the input value X
and the parameters W
and b
. Moreover, it returns the partial derivatives calculated with respect to the input value X
, which will be transferred to the previous layer.An abstract class Optimizer
provides an interface for optimizers:class Optimizer(object):
def __init__(self,parameters):
self.parameters = parameters
def step(self):
raise NotImplementedError
def zeroGrad(self):
for p in self.parameters:
p.grad = 0.
All optimizers are implemented by inheriting from the base class Optimizer
. A class describing a particular optimization should provide an implementation of the method step()
. This method updates the model parameters using their partial derivatives calculated in relation to the optimized value of the loss function. A link to various model parameters is provided in the function __init__()
. Please note that the universal functionality for resetting gradient values is implemented in the base class itself.Now, to better understand all this, consider a specific example - the implementation of the stochastic gradient descent (SGD) algorithm with support for adjusting the momentum and reducing weights:class SGD(Optimizer):
def __init__(self,parameters,lr=.001,weight_decay=0.0,momentum = .9):
super().__init__(parameters)
self.lr = lr
self.weight_decay = weight_decay
self.momentum = momentum
self.velocity = []
for p in parameters:
self.velocity.append(np.zeros_like(p.grad))
def step(self):
for p,v in zip(self.parameters,self.velocity):
v = self.momentum*v+p.grad+self.weight_decay*p.data
p.data=p.data-self.lr*v
The solution to the real problem
Now we have everything necessary for training the (deep) neural network model using our library. For this we need the following entities:- Model: calculation graph.
- Data and target value: data for network training.
- Loss function: substitute for optimization goal.
- Optimizer: a mechanism for updating model parameters.
The following pseudo code describes a typical testing cycle:model
data,target
loss_fn
optim
Repeat:
optim.zeroGrad()
output = model.forward(data)
loss = loss_fn(output,target)
grad = loss.backward()
model.backward(grad)
optim.step()
Although this is not necessary in the deep learning library, it may be useful to include the above functionality in a separate class. This will allow us not to repeat the same actions when learning new models (this idea corresponds to the philosophy of high-level abstractions of frameworks like Keras ). In order to achieve this, declare a class Model
:class Model():
def __init__(self):
self.computation_graph = []
self.parameters = []
def add(self,layer):
self.computation_graph.append(layer)
self.parameters+=layer.getParams()
def __innitializeNetwork(self):
for f in self.computation_graph:
if f.type=='linear':
weights,bias = f.getParams()
weights.data = .01*np.random.randn(weights.data.shape[0],weights.data.shape[1])
bias.data = 0.
def fit(self,data,target,batch_size,num_epochs,optimizer,loss_fn):
loss_history = []
self.__innitializeNetwork()
data_gen = DataGenerator(data,target,batch_size)
itr = 0
for epoch in range(num_epochs):
for X,Y in data_gen:
optimizer.zeroGrad()
for f in self.computation_graph: X=f.forward(X)
loss = loss_fn.forward(X,Y)
grad = loss_fn.backward()
for f in self.computation_graph[::-1]: grad = f.backward(grad)
loss_history+=[loss]
print("Loss at epoch = {} and iteration = {}: {}".format(epoch,itr,loss_history[-1]))
itr+=1
optimizer.step()
return loss_history
def predict(self,data):
X = data
for f in self.computation_graph: X = f.forward(X)
return X
This class includes the following functionality:- :
add()
, . computation_graph
. - : , , , .
- :
fit()
. , . - :
predict()
, , .
Since this class is not the basic building block of deep learning systems, I implemented it in a separate module utilities.py
. Note that the method fit()
uses a class DataGenerator
whose implementation is in the same module. This class is just a wrapper for training data and generates mini-packages for each iteration of training.Model training
Now consider the last piece of code in which the neural network model is trained using the library described above. I am going to train a multilayer network on data arranged in a spiral. I was prompted by this publication. Code for generating this data and for visualizing it can be found in the file utilities.py
.Data with three classes arranged in a spiral. The previous figure shows the visualization of the data on which we will train the model. This data is nonlinearly separable. We can hope that a network with a hidden layer can correctly find nonlinear decision boundaries. If you put together everything that we talked about, you get the following code fragment that allows you to train the model:import dl_numpy as DL
import utilities
batch_size = 20
num_epochs = 200
samples_per_class = 100
num_classes = 3
hidden_units = 100
data,target = utilities.genSpiralData(samples_per_class,num_classes)
model = utilities.Model()
model.add(DL.Linear(2,hidden_units))
model.add(DL.ReLU())
model.add(DL.Linear(hidden_units,num_classes))
optim = DL.SGD(model.parameters,lr=1.0,weight_decay=0.001,momentum=.9)
loss_fn = DL.SoftmaxWithLoss()
model.fit(data,target,batch_size,num_epochs,optim,loss_fn)
predicted_labels = np.argmax(model.predict(data),axis=1)
accuracy = np.sum(predicted_labels==target)/len(target)
print("Model Accuracy = {}".format(accuracy))
utilities.plot2DDataWithDecisionBoundary(data,target,model)
The image below shows the same data and the decisive boundaries of the trained model.Data and decision boundaries of the trained modelSummary
Given the increasing complexity of deep learning models, there is a tendency to increase the capabilities of the respective libraries and to increase the amount of code needed to implement these capabilities. But the most basic functionality of such libraries can still be implemented in a relatively compact form. Although the library we created can be used for end-to-end training of simple networks, it is still, in many ways, limited. We are talking about limitations in the field of capabilities that allow deep learning frameworks to be used in areas such as machine vision, speech and text recognition. This, of course, the possibilities of such frameworks are not limited.I believe that everyone can fork the project, the code of which we examined here, and, as an exercise, introduce into it what they would like to see in it. Here are some mechanisms you can try to implement yourself:- Operators: convolution, subsampling.
- Optimizers: Adam, RMSProp.
- Regulators: BatchNorm, DropOut.
I hope this material allowed you to at least see from the corner of your eye what is happening in the bowels of libraries for deep learning.Dear readers! What deep learning libraries do you use?