FarSee-Net article review - a new approach to real-time semantic segmentation

In this paper, the authors propose the Cascaded Factorized Atrous Spatial Pyramid Pooling (CF-ASPP) architecture for real-time semantic segmentation. The new CF-ASPP module and the use of super-resolution improve latency-accuracy trade-off. The review was prepared by the leading developer of MTS Andrey Lukyanenko.

image

Real-time semantic segmentation is essential for many tasks performed on limited resources. One of the big difficulties is working with objects of different sizes and using context. In this paper, the authors propose the Cascaded Factorized Atrous Spatial Pyramid Pooling (CF-ASPP) architecture.

Nowadays, a common approach is to quickly reduce the size of images in the initial stages, and then the mask of the original size is obtained using upsampling. The authors propose using super-resolution approaches instead of simple upsampling.

The new module and the use of super-resolution can improve latency-accuracy trade-off.

In the authors' terminology, the trained network for feature extraction is called the front-end network, and the rest is called the back-end network.

image

Justification for improvements

Since the same object in different pictures may have different sizes, it is very important to be able to effectively use contextual information, especially for small and narrow objects. The front-end typically does context aggregation from multiple scales. But usually these modules work at deep levels of neural networks, where the number of channels is high. As a result, even convolutional layers with a kernel size of 3 require quite a lot of computing resources. Therefore, the authors propose their own module, which makes it more effective.

Another back-end problem for semantic segmentation is that feature maps have a significantly smaller spatial dimension after front-end. Plus, many approaches use images with a reduced size to increase speed. As a result, the size is even smaller. The authors suggest using an original size mask for supervision during training. Super-resolution allows you to efficiently restore a high-resolution mask from a low-resolution mask.

The essence of the improvements

Any trained mesh can be used as the front-end, for example, VGG, ResNet, MobileNet.

The whole point is back-end:

image

Cascaded Factorized ASPP

Atrous convolutions are often used in semantic segmentation - their difference from the standard approach is that r - 1 zeros are added between the filters. This allows you to significantly increase the visibility of each filter without increasing computational costs. But since atrous convolutions apply to large feature maps, computing is still costly.

The authors suggest decomposing a 3 Γ— 3 atrous convolution into 2 parts: point-wise convolution to reduce the number of channels, and then depth-wise and atrous convolution to reduce computational overhead. As a result, approximately 8.8 times less calculations are required.

In addition, the ASPP module is applied twice in cascade. On the one hand, the model receives more contexts of different scales, on the other hand, smaller images come to the second ASPP, so the grid does not slow down very much, but the accuracy increases.

Feature Space Super-resolution

As a result of the front-end operation, the image size is greatly reduced, and we need to obtain a high-resolution result based on this reduced image. The authors use the super-resolution approach for this.

At the training stage, a thumbnail is used as input, and the original image is used as ground truth.

In the module back-end upsampling is done using sub-pixel convolution, which is just used in super-resolution tasks.

image

Experiments

As a dataset, cityscapes were used. The code was written in Pytorch 1.1, CuDNN v7.0. The interference was done on Nvidia Titan X (Maxwell). Used ResNet-18 as a trained mesh. Features were taken from the last layer before average pooling and from conv3_x layer.
SGD, 400 eras and many augmentations.

Ablation Study on Network Structure

image

Tested 4 approaches:

  1. Front-end - ResNet-18, back-end - ASPP, decoder - DeeplabV3 +
  2. Front-end - ResNet-18, back-end - one F-ASPP, decoder - DeeplabV3 +
  3. Front-end - ResNet-18, back-end - CF-ASPP (without feature space resolution)
  4. The full approach.

Comparison with other approaches.

image

The quality is really high and the rate of inference is almost the best.

image

All Articles