🌭 ⚗️ 🧑🏼‍🤝‍🧑🏼 MASK-RCNN for finding roofs from drone images 👩🏼‍🚀 📡 😂

In a white and white city on a white and white street there were white and white houses ... And how quickly can you find all the roofs of houses in this photo?

Increasingly, one can hear about the government’s plans to conduct a complete inventory of real estate in order to clarify cadastral data. For the primary solution to this problem, a simple method can be applied based on the calculation of the roof area of capital buildings from aerial photographs and further comparison with cadastral data. Unfortunately, manual search and calculation takes a lot of time, and since new houses are demolished and built continuously, the calculation needs to be repeated again and again. The hypothesis immediately arises that this process can be automated using machine learning algorithms, in particular, Computer Vision. In this article I will talk about how we are at NORBIT solved this problem and what difficulties they encountered.

Spoiler - we did it . The ML service developed is based on a deep machine learning model based on convolution neural networks. The service accepts images from unmanned aerial vehicles as an input; at the output, it generates a GeoJSON file with the markup of found capital construction objects with reference to geographical coordinates.

As a result, it looks like this:

Problems

Let's start with the technical problems we encountered:

there is a significant difference between winter and summer aerial photographs (a model trained only in summer photographs is completely unable to find roofs in winter);
, ;
, ( ), ( ) , ;
, , ( ). ;
(, ) .

And drones sometimes bring these photos:

I would also like to note the problems that could have been, but we did not touch:

we had no task to perform inference for a limited time (for example, directly at the time of the flight), which immediately solved all possible problems with performance;
at the input for processing, we immediately received high-quality high-resolution images (using lenses with a focal length of 21 mm at a height of 250 m, which is 5 cm / px) from our customer, the Shakhty company, could use their expertise in the geolocation of objects on maps, and they also had the opportunity to establish a specific set of requirements for future UAV flights, which ultimately greatly reduced the likelihood of very unique tiles that were not in the training set;

The first solution to the problem, stroke using the Boundary box

A few words about what tools we used to create the solution.

Anaconda is a convenient package management system for Python and R.
Tensorflow is an open source machine learning software library developed by Google.
Keras is an add-on for the frameworks Deeplearning4j, TensorFlow and Theano.
OpenCV is a library of algorithms for computer vision, image processing, and general-purpose open-source numerical algorithms.
Flask is a framework for creating web applications in the Python programming language.

As the OS used Ubuntu 18.04. With drivers on the GPU (NVIDIA) in Ubuntu, everything is in order, so the task is usually solved with one command:

> sudo apt install nvidia-cuda-toolkit

Tile Preparation

The first task we faced was to split the flyby images into tiles (2048x2048 px). You could write your own script, but then you would have to think about maintaining the geographical location of each tile. It was easier to use a ready-made solution, for example, GeoServer - it is open source software that allows you to publish geodata on the server. In addition, GeoServer solved another problem for us - convenient display of the result of automatic marking on the map. This can be done locally, for example, in qGIS, but for a distributed command and demonstration, a web resource is more convenient.

To perform tiling, you need to specify the required scale and size in the settings.

For translations between coordinate systems, we used the pyproj library:

from pyproj import Proj, transform

class Converter:
    P3857 = Proj(init='epsg:3857')
    P4326 = Proj(init='epsg:4326')
...
    def from_3857_to_GPS(self, point):
        x, y = point
        return transform(self.P3857, self.P4326, x, y)
    def from_GPS_to_3857(self, point):
        x, y = point
        return transform(self.P4326, self.P3857, x, y)
...

As a result, it was possible to easily form one large layer from all the polygons and lay it on top of the substrate.

To install GeoServer software, you must complete the following steps.

Java 8.
GeoServer.
, , /usr/share/geoserver
echo «export GEOSERVER_HOME=/usr/share/geoserver» >> ~/.profile
:

sudo groupadd geoserver
, , :

sudo usermod -a -G geoserver <user_name>
- :

sudo chown -R :geoserver /usr/share/geoserver/
:

sudo chmod -R g+rwx /usr/share/geoserver/
GeoServer

cd geoserver/bin && sh startup.sh

GeoServer is not the only application that allows us to solve our problem. As an alternative, for example, you can consider ArcGIS for Server, but this product is proprietary, so we did not use it.

Next, each tile had to find all the visible roofs. The first approach to solving the problem was to use the object_detection from the models / research Tensorflow set. In this way, classes on images can be found and localized with a rectangular selection (boundary box).

Training data markup

Obviously, for training the model you need a labeled dataset. By a lucky coincidence, in addition to circling around, in our bins the dataset for 50 thousand roofs was preserved from the good old days, when all datasets for training were still in the public domain everywhere.

The exact size of the training sample required to obtain acceptable model accuracy is rather difficult to predict in advance. It can vary depending on the quality of the images, their degree of dissimilarity to each other, and the conditions in which the model will be used in production. We had cases when 200 pieces were enough, and sometimes 50 thousand marked samples were also missing. In the event of a shortage of marked-up images, we usually add augmentation methods: turns, mirror reflections, color grading, etc.

Now there are many services available that allow you to mark up images - both with open source code for installation on your computer / server, and corporate solutions that include the work of external assessors, for example Yandex.Tolok. In this project, we used the simplest VGG Image Annotator . Alternatively, you can try coco-annotator or label-studio . We usually use the latter for marking up text and audio files.

For training on markup of various annotators, you usually need to perform a small shift of fields, an example for VGG .

In order to correctly calculate the area of the roof that fell into the area of rectangular allocation, it is necessary to observe several conditions:

/ . :

, , :

To solve the second problem, you can try to train a separate model that would determine the correct angle of rotation of the tile for marking, but everything turned out a little easier. People themselves strive to reduce entropy, so they align all man-made structures in relation to each other, especially with dense buildings. If you look from above, then in a localized area, fences, walkways, planting, greenhouses, arbors will be parallel or perpendicular to the boundaries of the roofs. It remains only to find all the clear lines and calculate the most common angle of inclination to the vertical. For this, OpenCV has a great HoughLinesP tool.

...

lines = cv2.HoughLinesP(edges, 1, np.pi/180, 50, minLineLength=minLineLength, maxLineGap=5)
if lines is not None:
    length = image.shape[0]
    angles = []
    for x1, y1, x2, y2 in lines[0]:
        angle = math.degrees(math.atan2(y2 — y1, x2 - x1))
        angles.append(angle)
    parts_angles.append(angles)
    median_angle = np.median(angles)
...

#     

for x in range(0, image.shape[0]-1, image.shape[0] // count_crops):
    for y in range(0, image.shape[1]-1, image.shape[1] // count_crops):
        get_line(image[x:x+image.shape[0]//count_crops, y:y+image.shape[1]//count_crops, :])
...

#      

np.median([a if a>0 else 90+a for a in np.array(parts_angles).flatten()])

After finding the angle, we rotate the image using the affine transformation:


h, w = image.shape[:2]
image_center = (w/2, h/2)

if size is None:
    radians = math.radians(angle)
    sin = math.sin(radians)
    cos = math.cos(radians)
    size = (int((h * abs(sin)) + (w * abs(cos))), int((h * abs(cos)) + (w * abs(sin))))
    rotation_matrix = cv2.getRotationMatrix2D(image_center, angle, 1)
    rotation_matrix[0, 2] += ((size[0] / 2) — image_center[0])
    rotation_matrix[1, 2] += ((size[1] / 2) — image_center[1])
else:
    rotation_matrix = cv2.getRotationMatrix2D(image_center, angle, 1)

cv2.warpAffine(image, rotation_matrix, size)

The full example code is here . Here's what it looks like:

The method of turning tiles and marking with rectangles works faster than marking with masks, almost all roofs are found, but in production this method is used only as an auxiliary one due to several drawbacks:

there are many overflights where there are a large number of non-rectangular roofs, because of this there is too much manual work to refine the area,
sometimes found at home with different orientations on the same tile,
sometimes there are many false lines on the tiles, which ultimately leads to a wrong turn. It looks like this:

The final solution based on Mask-RCNN

The second attempt was to search for and highlight roofs by masks pixel by pixel, and then automatically outline the contours of the masks found and create vector polygons.

There are already enough materials on the principles of operation, types and tasks of convolutional neural networks, including those in Russian, so we will not go into them in this article. Let us dwell only on one specific implementation, Mask-RCNN - an architecture for localizing and highlighting the contours of objects in images. There are other excellent solutions with their advantages and disadvantages, for example, UNet, but it was possible to achieve better quality on Mask-RCNN.

In the process of its development, it went through several stages. The first version of R-CNN was developed in 2014. The principle of its work is to highlight small areas in the image, for each of which an estimate is made of the probability of the presence of a target object in this area. R-CNN did an excellent job with the task, but its speed left much to be desired. The logical development was the Fast R-CNN and Faster R-CNN networks, which received improvements in the image crawl algorithm, which allowed to significantly increase the speed. At the exit to Faster R-CNN, a marking appears with a rectangular selection indicating the boundaries of the object, which is not always enough to solve the problem.

Mask R-CNN also adds a pixel-by-pixel mask overlay to get the exact outline of the object.

The boundary box and masks can be clearly seen on the result of the model's operation (the filter by the minimum building area is enabled):

Conventionally, there are 4 stages in the operation of this network:

standard for all convolutional neural networks, the allocation of features in the image, such as lines, bends, contrasting boundaries and others;
Region Proposal Network (RPN) scans small fragments of the image, called anchors (anchors) and determines whether this anchor contains signs characteristic of the target class (in our case, the roof);
Region of Interest Classification and Bounding Box. At this stage, the network, based on the results of the previous stage, is trying to highlight large rectangular areas in the photograph, presumably containing the target object;
Segmentation Masks. At this stage, the mask of the desired object is obtained from the rectangular area obtained by applying the boundary box.

In addition, the network turned out to be very flexible in configuration, and we were able to rebuild it to process images with additional information layers.

The use of exclusively RGB images did not allow us to achieve the necessary recognition accuracy (the model missed entire buildings, there was an average error of 15% in calculating the roof area), so we supplied the model with additional useful data, for example, height maps obtained by photogrammetry.

Metrics used to evaluate model quality

When determining the quality of models, we most often used the Intersection over Union (IoU) metric

Sample code for calculating IoU using geometry.shapely library:

from shapely.geometry import Polygon

true_polygon = Polygon([(2, 2), (2, 6), (5, 6), (5, 2)])
predicted_polygon = Polygon([(3, 3), (3, 7), (6, 7), (6, 3)])
print(true_polygon.intersection(predicted_polygon).area / true_polygon.union(predicted_polygon).area)

>>> 0.3333333333333333

Tracking the training process of models is conveniently controlled using Tensorboard, a convenient metric control tool that allows you to receive real-time data on model quality and compare them with other models.

Tensorboard provides data on many different metrics. The most interesting for us are:

val_mrcnn_bbox_loss - shows how well the model locates objects (i.e. imposes a boundary box);
val_mrcnn_mask_loss - shows how well the model segments objects (i.e. imposes a mask).

Model training and validation

When training, we used the standard practice of randomly dividing a dataset into 3 parts - training, validation and test. In the learning process, the quality of the model is evaluated on a validation sample, and upon completion passes the final test on test data that were closed from it in the learning process.

We did our first training starts on a small set of summer shots and, deciding to check how good our model will be in winter, we expectedly received a disappointing result. The option of using different models for different seasons, of course, is an excellent way out of the situation, but it would entail a number of inconveniences, so we decided to try to make the model universal. By experimenting with different configurations of the layers, and also closing the weight of individual layers from changes in weight, we found the optimal strategy for training the model by applying alternately summer and winter pictures to the input.

Creating a background service for recognition

Now that we have a functioning model, we can make a background API service from a recognition script that takes an image as input and generates json with roof polygons found at the output. This does not directly affect the solution of the problem, but it may be useful to someone.

Ubuntu uses systemd, and an example will be given specifically for this system. The code of the service itself can be viewed here . User units are located in the / etc / systemd / system directory, where we will create our service file. Edit the file:

cd /etc/systemd/system

sudo touch my_srv.service

sudo vim my_srv.service

The systemd unit consists of three sections:

[Unit] - describes the order and condition of the start (for example, you can tell the process to wait for the start of a certain service and only then start it yourself);
[Service] - describes startup parameters;
[Install] - describes the behavior of the service when adding it to startup.

As a result, our file will look like this:

[Unit]
Description=my_test_unit

[Service]
WorkingDirectory=/home/user/test_project
User=root
ExecStart=/home/user/test_project/venv/bin/python3 /home/user/test_project/script.py

[Install]
WantedBy=multi-user.target

Now reload the systemd configuration and run our service:

sudo systemctl daemon-reload
sudo systemctl start my_srv.service

This is a simple example of a background process, systemd supports many different parameters that allow you to flexibly configure the behavior of the service, but nothing more complicated is required for our task.

findings

The main result of the project was the ability to automatically detect inconsistencies in the actual development and information contained in the cadastral data.

As a result of evaluating the accuracy of the model on the test data, the following values were obtained: the number of found roofs - 91%, the accuracy of the roof outline polygons - 94%.

It was possible to achieve an acceptable quality of the models in summer and winter flights, but the recognition quality may decrease in the pictures immediately after a snowfall.

Now even the Sydney Opera House will not slip away from the eyes of our model.

We plan to put this service with a trained model on our demostand. If you are interested in trying the service on your own photos, send applications to ai@norbit.ru.

MASK-RCNN for finding roofs from drone images