In a previous post, I began to understand the two-stage Object Detection models and talked about the most basic and, accordingly, the first of them - R-CNN . Today we look at other models of this family: Fast R-CNN and Faster R-CNN . Go!

Fast r-cnn

Since R-CNN is a slow and not very efficient network, quite quickly the same authors proposed an improvement in the form of a Fast R-CNN network.

The image processing process has changed and looks like this:

Extracting a map of image attributes (not for each hypothesis separately, but for the entire image);
Hypothesis search (similar to R-CNN based on Selective Search);
– .. ( );
( , SVM-).

RoI layer

In the original R-CNN concept, each proposed hypothesis is individually processed using CNN - this approach has become a kind of bottleneck. To solve this problem, a Region of Interest (RoI) layer was developed . This layer allows you to once process the entire image using the neural network, receiving at the output a feature map, which is then used to process each hypothesis.

The main task of the RoI layer is to compare the coordinates of hypotheses (coordinates of the bounding box) with the corresponding coordinates of the feature map. Making a “slice” of the feature map, the RoI layer feeds it into the input of the fully connected layer for the subsequent determination of the class and corrections to the coordinates (see the following sections).

A logical question arises - how to apply hypotheses of different sizes and aspect ratios to the input of a fully connected layer? For this, a RoI layer is needed, which converts the image with dimensions

I_{h} \times I_{w}

$I_{h}×I_{w}$ in size

O_{h} \times O_{w}

$O_{h}×O_{w}$ . To do this, you need to divide the original image into a grid of size

O_{h} \times O_{w}

$O_{h}×O_{w}$ (cell size approximately

\frac{I_{h}}{O_{h}} \times \frac{I_{w}}{O_{w}}

$\frac{I_{h}}{O_{h}}×\frac{I_{w}}{O_{w}}$ ) and from each cell select the maximum number.

Suppose there is a 5 × 5 feature map and the desired hypothesis on this map has coordinates (1,1,4,5) (the first two coordinates are the upper left corner, the last two are the lower right). The subsequent fully connected layer expects a 4 × 1 dimension (i.e., an elongated 2 × 2 matrix). Then we divide the hypothesis into unequal blocks of different dimensions (the Pooling stage) and take the maximum number in each of them (the Pooling stage and, as a result, the Output stage).

Thus, it becomes possible to process the whole image, and then work with each hypothesis on the basis of a feature map.

Total:

Input: coordinates of the hypothesis and a map of features of the original image;
Output: vector representation of the hypothesis.

Fully connected layer and its outputs

In the previous version of R-CNN, separate SVM classifiers were used, in the same implementation they were replaced with one SoftMax dimension output

N_{c} + 1

$N_{c}+1$ . It is noted that the accuracy loss is less than 1%.

The output of the regressors is processed using NMS (Non-Maximum Suppression).

Total:

Input: vector representation of the hypothesis;
Output: probabilities of hypothesis belonging to classes and corrections to the coordinates of the bounding box.

Multi-task loss

In the simultaneous training of the network, a special loss function is used for the tasks of regressing the bounding box and classification:

L (P, u, t^{u}, v) = L_{c l s} (P, u) + λ [u \geq 1] L_{l o c} (t^{u}, v)

$L(P,u,t^{u},v)=L_{cls}(P,u)+\lambda[u≥1]L_{loc}(t^{u},v)$

Here:

$\lambda$ necessary to adjust the balance between the two functions (the authors used $\lambda$ = 1);
$u$ - the correct class;
$L_{cls}$ represents error functions for classification $L_{cls}(P,u)=-logP_{u}$ ;
$L_{loc}$ is a SmoothL1 function and measures the difference between $v=(v_{x},v_{y},v_{w},v_{h})$ and $t^{u}=(t^u_x,t^u_y,t^u_w,t^u_h)$ values:

$S m o o t h L 1 = {\begin{matrix} \frac{1}{2} x^{2}, & i f | x | < 1 \\ | x | - \frac{1}{2}, & o t h e r w i s e \end{matrix}$
$SmoothL1=\left \{ \begin{matrix} \frac{1}{2}x^{2}, & if\left | x \right | <1\\ \left | x \right |-\frac{1}{2}, & otherwise \end{matrix}\right.$

Here, $x$ denotes the difference between the target value and the prediction $t^u_i-v_{i}$ . Such a function combines the advantages of the L1 and L2 functions, since it is stable at large values $x$ and not much fines for small values.

Training

For better convergence, the authors used the following approach to the formation of the batch:

The number of hypotheses in the batch is selected $R$ .
Randomly selected $N$ images.
For each of $N$ images taken $\frac{R}{N}$ hypotheses (i.e. evenly across each image).

At the same time, both positive (25% of the whole batch) and negative (75% of the whole batch) hypotheses are included in R. Hypotheses that overlap with the correct location of the object by more than 0.5 (IoU) are considered positive. Negative ones are taken according to the Hard Negative Mining rule - the most erroneous instances (those with IoU in the range [0.1,0.5).

Moreover, the authors argue that with the parameters

N = 2

$N=2$ and

R = 128

$R=128$ the network learns many times faster than with

N = 128

$N=128$ and

R = 128

$R=128$ (i.e. one hypothesis from each image).

Faster r-cnn

A further logical improvement is a way to eliminate dependence on the Selective Search algorithm. To do this, we will represent the entire system as a composition of two modules - the definition of hypotheses and their processing. The first module will be implemented using the Region Proposal Network (RPN) , and the second is similar to Fast R-CNN (starting with the RoI layer).

Therefore, this time the process of working with the image has changed and now happens this way:

Removing a map of image features using a neural network;
Generation based on the obtained map of signs of hypotheses - determination of approximate coordinates and the presence of an object of any class;
Comparison of hypotheses coordinates using RoI with a feature map obtained in the first step;
Classification of hypotheses (already for the definition of a particular class) and additional refinement of coordinates (in fact, it may not apply).

The main improvement occurred precisely in the place where the hypotheses were generated - now for this there is a separate small neural network, which was called Region Proposal Network .

Region Proposal Network

The ultimate goal of this module is to fully replace the Selective Search algorithm. For faster operation, common weights are needed with a network that extracts the necessary attributes. Therefore, the RPN input is a feature map obtained after this network. The authors of the original article use the VGG16 network to extract features, the output of which is considered the last convolutional layer - conv5_3. Such a network has the following characteristics of the receptive field :

Effective compression (effective strides, $S_{0}$ ): sixteen
Receptive field size $r_{0}$ ): 196

This means that the feature map will be 16 times smaller than the original image size (the number of channels is 512), and each value in its cells is affected by the pixels of the original image lying in a 196 × 196 rectangle. Thus, it turns out that if you use the standard input VGG16 224 × 224, then almost the entire image will affect the formation of the value of the central cell of the feature map (14.14)! Based on the received feature map, RPN for each cell produces

k

$k$ hypotheses (in the original implementation

k = 9

$k=9$ ) different sizes and aspect ratios. So, for the standard size, this is 14 × 14 × 9 = 1764 hypotheses!

Based on the image below, we consider the algorithm of the RPN module in more detail (clickable picture):

$c×\frac{H}{16}×\frac{W}{16}$ .
3×3 ( – ). , ( $P_{0}=106$ , $r_{0}=228$ ).

( $i,j$ ) $c$ ( 512).

1×1 ( ):
1. (cls) $\hat{c}=2k$ – - ( 2 ).
2. (reg) $\hat{c}=4k$ – .
Note that the obtained vectors can be transformed into matrices and . Thus, we obtain the matrices, where the line corresponds to the values for a particular hypothesis.

A logical question arises: how can the absolute coordinates of hypotheses be determined from the vector that enters the reg layer? The answer is simple - no way. For the correct determination of coordinates, it is necessary to use the so-called anchors and corrections to their coordinates.

An anchor is a quadrangle of different aspect ratios (1: 1, 2: 1, 1: 2) and sizes (128 × 128, 256 × 256, 512 × 512). The center of the anchor is the center of the cell (

i, j

$i,j$ ) card signs. So, for example, let's take a cell (7,7), the center of which is the values (7.5,7.5), which corresponds to the coordinates (120,120) of the original image (16 × 7.5). Compare with these coordinates the rectangles of three aspect ratios and three sizes (a total of 3 × 3 = 9). In the future, the reg layer will produce appropriate edits with respect to these coordinates, thereby adjusting the location and shape of the bounding box.

Total:

Input: map of features of the original image;
Output: hypotheses containing an object.

Loss function

For RPN training, the following class notation is used:

Positive are all anchors having an intersection (IoU) of more than 0.7 or having the largest intersection among all anchors (applies if there is no intersection of more than 0.7).
Negative are all anchors having an intersection of less than 0.3.
All other anchors do not participate in the training (in fact, they are neutral).

So the class

p_{i}^{*}

$p^*_i$ anchors are awarded according to the following rule:

p_{i}^{*} = {\begin{cases} 1 & i f I o U > 0.7 \\ 0 & i f I o U < 0.3 \\ n o t h i n g & o t h e r w i s e \end{cases}

$p^*_i = \begin{cases} 1 & if IoU > 0.7 \\ 0 & if IoU < 0.3 \\ nothing & otherwise \end{cases}$

With such notation, the following function is minimized:

L ({p_{i}}, {t_{i}}) = \frac{1}{N_{c l s}} \sum_{i} L_{c l s} (p_{i}, p_{i}^{*}) + λ \frac{1}{N_{l o c}} \sum_{i} p_{i}^{*} L_{r e g} (t_{i}, t_{i}^{*})

$L(\{p_i\}, \{t_i\}) = \frac{1}{N_{cls}} \sum_i L_{cls}(p_i, p^*_i) + \lambda \frac{1}{N_{loc}} \sum_i p^*_i L_{reg} (t_i, t^*_i)$

Here:

$i$ - anchor number;
$p_{i}$ - the probability of finding the object in $i$ anchor;
$p^*_i$ - the correct class number (indicated above);
$t_{i}$ - 4 predicted corrections to the coordinates;
$t^*_i$ - expected (ground truth) corrections to the coordinates;
$L_{cls}(p_{i},p^*_i)$ - binary log-loss;
$L_{reg}(t_{i},t^*_i)$ - SmoothL1 loss. Only activated if $p^*_i=1$ , i.e. if the hypothesis contains at least some object;
$\begin{Bmatrix}p_{i}\end{Bmatrix}$ and $\begin{Bmatrix}t_{i}\end{Bmatrix}$ - outputs of the classification and regression models, respectively;
$\lambda$ - coefficient for adjusting the balance between classification and regression.

Both parts of the combined loss normalize to

N_{c l s}

$N_{cls}$ and

N_{l o c}

$N_{loc}$ respectively. The authors used

N_{c l s}

$N_{cls}$ equal to the size of the mini-batch (256), and

N_{l o c}

$N_{loc}$ equal to the number of anchors.

To regress corrections to the bounding box, the values are initialized and calculated as follows:

t_{x} = \frac{(x - x_{a})}{w_{a}}, t_{x}^{*} = \frac{(x^{*} - x_{a})}{w *} t_{y} = \frac{(y - y_{a})}{h_{a}}, t_{y}^{*} = \frac{(y^{*} - y_{a})}{h_{a}} t_{w} = \log \frac{w}{w_{a}}, t_{w}^{*} = \log \frac{w^{*}}{w_{a}} t_{h} = \log \frac{h}{h_{a}}, t_{h}^{*} = \log \frac{h^{*}}{h_{a}}

$t_x = \frac{(x - x_a)}{w_a}, \quad\quad t^*_x = \frac{(x^*-x_a)}{w*} \\ t_y = \frac{(y - y_a)}{h_a}, \quad\quad t^*_y = \frac{(y^* - y_a)}{h_a} \\ t_w = \log{\frac{w}{w_a}}, \quad\quad t^*_w = \log{\frac{w^*}{w_a}} \\ t_h = \log{\frac{h}{h_a}}, \quad\quad t^*_h = \log{\frac{h^*}{h_a}}$

Here

x

$x$ ,

y

$y$ ,

w

$w$ and

h

$h$ indicate the center, width, and height of the bounding box. Variables

x

$x$ ,

x^{*}

$x^{*}$ and

x_{a}

$x_{a}$ denote prediction, ground truth and the meaning of anchors (for

y

$y$ ,

w

$w$ and

h

$h$ similarly).

Training on the full list of anchors will have a bias towards the negative class (there are many more hypotheses with this class). In this regard, the mini-batch is formed in a 1: 1 ratio of positive to negative anchors. If it is not possible to find the appropriate number of positive anchors, the mini-batch is supplemented with the help of negative classes.

General network training

The main objective is the joint use of scales between the two modules - this will increase the speed of work. Since it is impossible (or rather difficult) to train two independent modules at once, the authors of the article use an iterative approach:

Training RPN network. Convolutional layers are initialized with weights previously obtained during training on ImageNet. We’ll retrain on the task of defining regions with any class (the specification of the class is part of Fast R-CNN).
Training Fast R-CNN Network. As in step 1, we initialize Fast R-CNN with the weights previously obtained during training on ImageNet. We retrain using hypotheses about objects using the RPN network trained in item 1. This time the task of training is to clarify the coordinates and determine a specific class of the object.
Using the weights from p. 2, we train only the RPN part (layers going to the RPN networks belonging to feature extractor are frozen and do not change at all).
Using the weights from p. 3 (that is, the finer-tuned RPN), we train the layers for Fast R-CNN (the remaining weights — those that go earlier or related to RPN — are frozen).

With the help of such iterative training, it turns out that the entire network is built on the same scales. You can continue to train the network on this principle, but the authors note that there are no major changes in the metrics.

Prediction process

When using neural networks for predictions, image propagation looks like this:

The image enters the input of the neural network, generating a feature map.
Each cell of the feature map is processed using RPN, giving as a result of the correction to the position of the anchors and the probability of the presence of an object of any class.
The corresponding predicted frames are then further based on the feature map and the RoI layer for further processing of the Fast R-CNN part.
At the output, we get a specific class of objects and their exact position in the image.

Summary differences

Here is a brief summary of the models among themselves (basic ideas are inherited from the younger to the older):

R-CNN:

Using Selective Search as a hypothesis generator.
Using SVM + Ridge for classification and regression of hypotheses (moreover, their parallel operation is not possible).
Running a neural network to process each hypothesis individually.
Low speed.

Fast R-CNN:

A neural network is launched only once per image - all hypotheses are tested on the basis of a single feature map.
Smart processing of hypotheses of different sizes due to the RoI layer.
Replacing SVN with a SoftMax layer.
The possibility of parallel work classification and regression.

Faster R-CNN:

Hypothesis generation using a special separately differentiable module.
Changes in image processing associated with the advent of the RPN module.
The fastest of these three models.
It is one of the most accurate to this day.

Conclusion

In conclusion, we can say that the development of R-CNN moved from disparate algorithms that solve one problem towards a single end-to-end solution. This combination allows you to make almost any approach more accurate and most productive, Object Detection was no exception.

Bibliography

R. Girshick, J. Donahue, T. Darrell, and J. Malik. «Rich feature hierarchies for accurate object detection and semantic segmentation.» In CVPR, 2014. arXiv:1311.2524
R. Girshick, J. Donahue, T. Darrell, and J. Malik. «Region-based convolutional networks for accurate object detection and segmentation.» TPAMI, 2015
R. Girshick, «Fast R-CNN,» in IEEE International Conference on Computer Vision (ICCV), 2015.
S. Ren, K. He, R. Girshick, and J. Sun, «Faster R-CNN: Towards real-time object detection with region proposal networks,» in Neural Information Processing Systems (NIPS), 2015.

Object Detection Recognize and rule. Part 2