Regularization? Orthogonalization! Improving compact networks


While other companies are discussing managing the team remotely, we at Smart Engines continue to share our technology stack with you. Today about the optimization of neural networks. To make a recognition system based on neural networks that could quickly work on smartphones and other mobile devices is extremely difficult. And to make sure that the quality is high is even more difficult. In this article, we will talk about a simple method of regularizing neural networks that we use in Smart Engines to improve the quality of "mobile" networks with a small number of parameters. The idea of ​​the method is based on a gradual decrease in the linear dependence of the filters in the convolutional layers during training, due to which each neuron works more efficiently, and therefore, the generalizing ability of the model is improved.To do this, we present filters in the form of one-dimensional vectors and orthogonalize a pair with the longest projection length onto each other.

When designing most modern neural networks, it is understood that they will be performed somewhere remotely on the server, and the data for processing will come through the client on a PC or mobile device. However, this approach is unacceptable when it comes to the security of personal data that you don’t want to transfer somewhere (for example, a passport photo or a bank card for recognition). Fortunately for us, mobile devices today have sufficient capacity to run neural networks, so you can avoid sending data to third parties. Another thing is that these networks should be small and contain a small number of operations so as not to test the patience of the user. Such conditions limit the maximum achievable quality of their work,and how to improve lightweight networks without sacrificing runtime is an open question. Reflecting on this, we came up with a new method of regularization in neural networks, focused on compact networks and consisting in the orthogonalization of convolutional filters.

The post is a short version of the report "Convolutional neural network weights regularization via orthogonalization", presented in November 2019 at the international conference ICMV 2019, Amsterdam, Netherlands.

The idea of ​​regularization using orthogonalization


Since the proposed method relates to regularization, we first recall briefly what it is. Regularization consists in imposing some restrictions on the model based on our ideas about how the task should be solved. As a result, the generalizing ability of the network is increased. For example, L1 regularization contributes to the zeroing of part of the balance by making the network discharged, L2 - holds coefficients within small numbers, Dropout eliminates the dependence of individual neurons, etc. These methods are an integral part of the learning process of many modern networks, especially if they contain a large number of parameters - regularization allows you to deal fairly well with retraining.

Now back to our method. We make a reservation right away that, first of all, we consider the problem of classifying images with a convolutional neural network. The assumption, on the basis of which we came to the use of orthogonalization, is the following: if the network is extremely limited in its resources for the concept of patterns in the data, then each neuron in it must be made to work as efficiently as possible, and so that it performs the function strictly assigned to it. In other words, so that it “hooks up” such features that any other neuron is unable to detect. We solve this problem by gradually reducing the linear relationship between the neuron weight vectors during training. To do this, we modified the classical orthogonalization algorithm, adapting it to the realities of the learning process.

Convolution filter orthogonalization


Define the convolutional layer filters as a set of vectors , where c is the index of the convolutional layer, and N is the number of filters in it. After the weights were updated during the back propagation of the error, in each individual convolutional layer we look for a pair of vectors with a maximum projection length on top of each other:



The projection of the vector f g onto f k can be calculated as . Then, in order to orthogonalize the filters f a and f b , we replace the first step from the Gram-Schmidt algorithm with the



following formula:



where η is the learning speed and wortorthogonalization coefficient, the values ​​of which lie on the interval [0.0, 1.0]. The introduction of the orthogonalization coefficient is due to the fact that the “instantaneous” orthogonalization of filters greatly breaks down the learning process, negating the systematic changes in weights over past iterations. Small wort values preserve the dynamics of learning and contribute to a smooth decrease in the linear relationship between the filters in each layer separately. We note once again an important point in the method: in one iteration we modify only one vector so as not to harm the optimization algorithm.


Fig. Visualization of one iteration.

We consider orthogonalization of convolutional filters only, since in modern neural networks convolutional layers make up a large part of the architecture. However, the algorithm is easily generalized to the weights of neurons in fully connected layers.

Experiments


We pass from theory to practice. For experiments, we decided to use the 2 most popular datasets for evaluating neural networks in the field of computer vision - MNIST (classification of images of handwritten numbers) and CIFAR10 (photos of 10 classes - boats, trucks, horses, etc.).

Since we assume that orthogonalization will be useful primarily for compact networks, we took a LeNet-like architecture in 3 modifications, which differ from each other in the number of filters in convolutional layers. The architecture of our network, which will be called LeNet 1.0 for convenience, is shown in Table 1. The LeNet 2.0 and LeNet 3.5 architectures derived from it are distinguished by a large number of filters in convolutional layers, 2 and 3.5 times, respectively.

Choosing the activation function, we stopped at ReLU not only because it is the most popular and computationally efficient function (we remind you that we are still talking about fast networks). The fact is that the use of non-piecewise linear functions negates the effect of orthogonalization: for example, the hyperbolic tangent strongly distorts the input vectors since it has a pronounced nonlinearity in regions close to saturation.

Table 1. LeNet 1.0 network architecture used in the experiments.
Layers
#A typeParameters
Activation function
1conv8 filters 3x3, stride 1x1, no paddingRelu
2conv16 filters 5x5, stride 2x2, padding 2x2Relu
3conv16 filters 3x3, stride 1x1, padding 1x1Relu
4conv32 filters 5x5, stride 2x2, padding 2x2Relu
5conv32 filters 3x3, stride 1x1, padding 1x1Relu
6conv32 filters 3x3, stride 1x1, padding 1x1Relu
7fully connected10 neuronsSoftmax

We tried 3 values ​​of the orthogonalization coefficient wort : 0.01, 0.05, 0.1. All experiments were carried out 10 times, and the results were averaged (standard deviation ( std ) for the error rate is shown in the table with the results). We also calculated how many percent the number of errors ( benefit ) decreased .

The experimental results confirmed that the improvement from orthogonalization is greater, the smaller the parameters in the network (tables 2 and 3). We also obtained an interesting result that the use of orthogonalization in the case of “heavy” networks leads to poor quality.

Table 2. Experimental results for MNIST
MnistLeNet 1.0 (52k params)LeNet 2.0 (179k params)LeNet 3.5 (378k params)
error ratestdbenefiterror ratestdbenefiterror ratestdbenefit
baseline0.402%0.033-0.366%0.026-0.361%0.028-
wort = 0.010.379%0.0275.72%0.355%0.013.01%0.359%0.0260.55%
wort = 0.050.36%0.02210.45%0.354%0.0183.28%0.356%0.0341.39%
wort = 0.10.368%0.0158.46%3.53%0.0243.55%0.353%0.0182.22%

Table 3. Experimental Results for CIFAR10
Cifar10LeNet 1.0 (52k params)LeNet 2.0 (179k params)LeNet 3.5 (378k params)
error ratestdbenefiterror ratestdbenefiterror ratestdbenefit
baseline22.09%0.65-18.49%1.01-17.08%0.47-
wort = 0.0121.56%0.862.38%18.14%0.651.89%17.33%0.49-1.46%
wort = 0.0521.59%0.482.24%18.30%0.571.03%17.59%0.31-3.02%
wort = 0.121.54%0.412.48%18.15%0.531.85%17.53%0.4-2.63%

However, LeNet networks are now rare and more modern models are usually used. Therefore, we also experimented with the ResNet model, facilitated by the number of filters, consisting of 25 convolutional layers. The first 7 layers contained 4 filters, the next 12 by 8 filters, and the last 6 by 16 filters. The total number of trained parameters of this model was 21 thousand. The result was similar: orthogonalization improves the quality of the network.


Fig. Comparison of ResNet learning dynamics at MNIST with and without orthogonalization.

Despite the achieved improvements in quality, for complete confidence in the correct operation of the proposed method, you need to see what changes have occurred in the filters themselves. For this, the values ​​of the maximum filter projection length in 2, 12 and 25 ResNet layers were written off for all eras of training. We give the dynamics of changes in the graphs below. The most important thing here is that in all layers there is a decrease in the linear dependence of the filters.


Fig. Dynamics of changes in the maximum projection length of filters in a convolutional layer using ResNet as an example.

Orthogonalization-based regularization is extremely easy to implement: on python using the numpy module it takes less than 10 lines of code. At the same time, it does not slow down training and is compatible with other regularization methods.

Conclusion


Despite its simplicity, orthogonalization helps to improve the quality of “lightweight” networks, which impose restrictions on the size and speed of execution. Due to the proliferation of mobile technologies, such restrictions are increasingly common: a neural network should not run somewhere in the cloud, but directly on a device with a weak processor and low memory. The training of such networks runs counter to modern trends in neural network science, where ensembles of models with millions of trained parameters that no smartphone can pull are actively used. That is why, in the framework of solving industrial problems, it is extremely important to invent and develop methods for improving the quality of simple and fast networks.

List of sources used


Alexander V. Gayer, Alexander V. Sheshkus, “Convolutional neural network weights regularization via orthogonalization,” Proc. SPIE 11433, Twelfth International Conference on Machine Vision (ICMV 2019), 1143326 (January 31, 2020); https://doi.org/10.1117/12.2559346

All Articles