❇️ 👨‍👧‍👧 👊🏾 How we recognize personal protective equipment 🧥 🎮 🕕

Probably, you have been wondering all your life how to train a neural network to recognize people in helmets and orange vests! No? But we will tell you anyway.

Our name is Tatyana Voronova and Elvira Dyaminova. We are engaged in data analysis in the Center 2M company, we work a lot with the most real factories and enterprises. Due to safety violations, they suffer multimillion-dollar losses, employees are injured, so it would be nice to be able to detect such violations systematically and as early as possible. Best of all - automatically. So we have problems associated with recognizing personal protective equipment (PPE) on video and identifying people or equipment in the danger zone.

For the most part, orders come to us for determining helmets (more precisely, their absence) and workwear. We have already gained experience in carrying out such tasks and now we can describe the problems we have encountered and how to solve them.

Since under the terms of cooperation we do not have the right to publish footage from customer’s objects, we will illustrate the article with images from the Internet, on which people in helmets often smile and look great. Unfortunately, in the public domain not for all the features of the tasks that we face in reality, you can find good examples. In particular, in life people in helmets are less likely to smile, and the problem of bald workers (we will talk about it a little later) on the Internet has not really been revealed!

Image from the Internet (size 1920x1280):

The recognition of PPE can be reduced to one of two classical problems of computer vision: classification of images and detection of objects. In practice, it turned out that it was better not to use one of these approaches, but to choose the most suitable for each particular case, as well as flexibly combine them. For example, we can first determine where people are in the image, then classify the images cut by silhouette into classes “in workwear” and “without”, and detect the presence of a helmet by the second pass.

On pre-cut figures of people, the classification of the presence of helmets and workwear looks like this (view of the original picture):

The result of the work of the models for the classification of workwear and helmets

On the same previously selected human figures, the application of the approach this time with detection for helmets.

The result of the model for the classification of workwear and a model for detecting helmets:

Stage one: human detection

The quality of definition of small objects (helmets / glasses / gloves) on large frames is so-so. It is much easier for a computer, like a person, to first understand where people are, and only then to figure out what they are wearing. So, it all starts with identifying the people in the frame.

As a result of the experiments, we found out that the Faster R-CNN neural network with Inception v2 as a feature extraction is well suited for detecting people. TensorFlow already has pre- trained neural networks for detecting objects.

For us, Faster R-CNN Inception v2 (trained on the COCO dataset) is the basic method that we try first when solving such problems.

Initially, we detect people on the frame (and then on the found people we find PPE):

Note that we have increased the bounding box “with a person” along the y axis :

In this photograph, the worker was shot in good light and against a contrasting background (with images found on the Internet, this happens all the time). Therefore, the bounding box with the person was well built. However, in our practice there are frequent cases (especially in conditions of insufficient visibility) when the detection model cuts off a helmet in a person, after which it is useless to look for it on a cropped image. In this regard, along the y axis, we increase the predicted bounding box by 15% before moving on to the second stage.

When detecting people, we encounter small unpleasant problems. Firstly, when two people walk or stand behind each other, often they begin to be detected as one person. Secondly, it happens that a static object enters the field of view of the camera, in which the model can recognize a person, like a hydrant. These problems can be solved in various ways. For example, how we did it: reconcile and accept them, since in general, the model is suitable for us in terms of productivity and quality.

A more fundamental problem is that industrial premises in which there is a “danger zone” are often huge and, accordingly, the people in the frames are very small. Our basic method based on Faster R-CNN Inception v2 showed poor results in such cases, and in the end we triedFaster R-CNN Nas . The results were impressive, people were well recognized even in the distance, but the speed was much lower than the base model. With sufficient resources and the need for high accuracy, you can use Faster R-CNN Nas .

Second stage: determination of malicious violators

Depending on the task, the following are often used:

Image Classification Model - Inception v3
Object Detection Model - Faster R-CNN Inception v2

Classification of workwear and helmets

We tested different neural network architectures to classify images, and eventually settled on Inception v3, deciding to take advantage of the fact that it is designed to work with variable image sizes. We already had a lot of cut out pictures with people, and it was not difficult to calculate the median values for height and width. So we came to the conclusion that for the training of classifiers began to bring images to a size of 150x400.

In order to train the network to recognize PPE, first of all, it is necessary to collect a dataset from labeled examples. In this process, there are subtleties, the realization of which comes with experience. For example, it is better to remove people who are cut above the hips from the dataset. This will bring the dataset closer to the real conditions, since most of the time people are seen at full height on video from surveillance cameras. Cases of overlapping, of course, also happen, but full silhouettes for the target sample are much more characteristic.

Examples from our workwear dataset:

We have not invented anything specific as a metric; we use recall and precision.

Model for classifying the presence / absence of workwear:

Results on a validation sample

PPE detection

The classification model works faster than the model for detecting objects, but due to the fact that safety glasses and gloves are small in the image, it is difficult to create a good classifier for such PPE. Therefore, we trained the Faster R-CNN neural network on a dataset with six classes:

glasses / not_glasses
gloves / not_gloves
helmet / not_helmet

Data collection and markup

The main problems were related to the helmets dataset. It was a fascinating way: we went through bald people, people with helmets in their hands, and even through bald people with helmets in their hands.

Since at the very beginning of the journey we didn’t have many frames from real conditions, we collected the dataset as best we could: filmed ourselves, took images from the Internet or from construction sites. A little later, we began to receive a lot of videos from various enterprises, so we began to enrich the dataset only with frames of real conditions. At some point, the number of tagged images exceeded 5k, and the quality from adding new examples ceased to improve, in this regard, we revised the approach to markup.

We will describe the stages of improving the helmet dataset using the example of images from the Internet, so the angle and quality do not quite match what we had.

In addition to the above image, cropped above the hips, we removed images in which the helmets are cropped more than half to avoid confusion with caps.

We also faced the fact that if a person has a helmet in his hands, then often the model did not see any violations: is there a helmet? There is. Therefore, we removed from the training dataset all frames in which a person holds a helmet with his hand, even if the helmet is on his head at that moment.

In general, we tried to remove images with a lit background or in dark rooms, and then we minimized the number of photos taken by us, leaving mostly footage from the production. As a result, we reduced the dataset by half.

In addition, we enriched the dataset with bald people, otherwise they will always be in helmets, even if this is not so, and with blondes with squares, for which, with a certain angle, the detector also determines the helmet.

After removing unsuitable images, we proceeded directly to the markup (for detecting objects). It turned out to be not so simple. It turns out that the quality of the final detector largely depends on what exactly the area in the image is marked as a "helmet" or "gloves". Initially, we allocated helmets and goggles without grabbing faces, and gloves with grabbing hands. However, with experience, we gradually improved our approach by looking at errors of the first and second kind, where people hold helmets in their hands, and something round on something long turns out to be a “glove”. Now, when marking helmets and glasses, we try to grab the face to the tip of the nose, and when marking gloves, on the contrary, we limited ourselves to a brush.

As a result of our manipulations on the dataset, we got the following results.

Model for detecting the presence / absence of PPE using helmets as an example:
Results on a validation sample before the start of “global work” on the dataset

Final results on the validation sample

The completeness of the recognition of helmets slightly subsided, but at the same time, the metrics for detecting violations improved, and this is what we wanted to achieve.

Model for classifying the presence / absence of helmets:
Results on a validation sample before the start of “global work” on the dataset

Final results on the validation sample

It should be noted that we do not have a division into goggles and glasses for vision, they go under the same tag “glasses”, and gloves of light shades can be perceived as a bare brush. We tried to maximize the color gamut of helmets and work clothes in our datasets, but for reliability we added the simplest and most reliable technique to this: if necessary, to detect gloves, we tell customers that bright colors help increase accuracy.

At the moment, we have universal models that we use for the initial show to the customer. However, it should be understood that it is impossible to create a universal model for everyone, it is necessary to adapt to each customer, identify and take into account new nuances, enrich datasets or create them anew to meet specific requirements.

Bonus

Typically, customers want to process as many cameras as possible, using as few resources as possible. Butch, of course, is a good thing, but additional tricks to optimize the process are not prohibited.

For example, my colleagues and I from the Moscow IBM client center had a hypothesis that putting several people cut out together to further detect helmets would increase the number of cameras per server with an unprincipled loss in accuracy.

As a basis, we decided to take the size of 1000x600 for the canvas on which people will be "applied". Two layout options were initially considered:

Fixed width and height (200x600), with this approach, there are 5 people on the frame.
Fixed width and height (125x600), 8 people.

This decision was due to the fact that with fixed data, we know exactly the number of people in the photo, which gives us a forecast of the load. However, during the development, we considered such an option:

Fixed height and proportional width (*** x600), different number of persons.

It was assumed that with increasing sizes and maintaining proportions, the results will be better compared to other layout options. The number of persons ranged from 3 to 5 (+/–).

As a result, we obtained that the option with a fixed width and height (200x600) is the best among those considered. Of course, this method is not suitable for detecting glasses and gloves, because the objects are small, but for detecting helmets / lack of helmets, this method showed good results.

For example, in a validation sample:

: (tvoronova), (elviraa)