How do we count people using computer vision

image
Photos from open sources

Mass gatherings of people create problems in various fields (retail, public services, banks, developers). Customers need to combine and monitor information about the number of people in many places: in service offices, administrative buildings, on construction sites, etc.

People counting tasks have ready-made solutions, for example, using cameras with built-in analytics. However, in many cases it is important to use a large number of cameras previously installed in different departments. In addition, a solution that takes into account the specifics of a particular customer will be better for him.

Our names are Tatyana Voronova and Elvira Dyaminova, we are engaged in data analysis at Center 2M. Although the topic seems to be the simplest of what is currently being considered in computer vision problems, even in this problem, when it comes to practice (implementation), many complex and non-trivial subtasks have to be solved. The purpose of our article is to show the complexity and basic approaches to computer vision problems using the example of solving one of the basic problems. For the following materials, we want to attract colleagues: devops, engineer, project managers on video analytics, so that they talk about the involved computing resources, speed measurements, the nuances of communicating with customers and project implementation stories. We will focus on some of the data analysis methods used.

Let's start with the following statement: you need to display the number of people in the queue at the service office. If the queue, according to the internal rules of the customer company, is deemed critical, the internal scenario will begin to be worked out:
  • notification of the need to open an additional entrance / cash desk;
  • manager call;
  • informing about the need to redirect flows of people to other (more free) cash desks.

Thus, our work will save customers a lot of nerves.

Machine Learning Models Used


People silhouettes detection


Initially, we decided to use an already trained model for detecting people (silhouettes), since such tasks have fairly good solutions, for example, the definition of silhouettes .

So, in the TensorFlow library there are a large number of pre-trained models .

After conducting the tests, we first settled on two architectures: Faster R-CNN and YOLO v2. Later, after the new version appeared, we added YOLO v3.

Description of models .

An example of a recognition result for YOLO v2 (hereinafter, images are taken from free sources - we cannot publish frames from customer cameras):

image

An example of a recognition result for Faster R-CNN:

image

The advantage of YOLO is that the model responds faster, and in some tasks this is important. However, in practice, we found out that if it is not possible to use a pre-trained version of the model, and retraining is required on your specialized training set, it is more correct to use Faster R-CNN. If the camera was installed far enough away from people (the height of the silhouette is less than 100 pixels for a resolution of 1920 by 1080) or it was required to additionally recognize personal protective equipment on a person: helmets, fasteners, protective clothing elements, in such situations the quality of the training result on your own dataset (up to 10 thousand different objects) for YOLO v2 we were not satisfied.

YOLO v3 showed acceptable results, however, speed tests did not give a significant advantage for YOLO v3 compared to Faster R-CNN. In addition, we found a way to increase recognition speed by using batch (group processing of images), selective analysis of images (more on this below).

For all types of models, we improved accuracy using post-processing of results: we removed outliers in values, took the most common values ​​for a set of consecutive frames. One second from one camera usually corresponds to 25-50 frames. Of course, to improve performance (with an increasing number of cameras), we analyze not every frame, but it is often possible to give a final answer over an interval of several seconds, that is, use several frames. This decision can be made dynamically, taking into account the total number of cameras (video streams for processing) and available computing power.

An example of using the Faster R-CNN model, trained on our own dataset:

image

Now we are conducting tests with the SSD-300 model. We hope that it will give us an increase in productivity while maintaining an acceptable quality of recognition.

Creating your own training dataset


In cases where you want to create your own learning set, we have developed for ourselves the following procedure:
  • we collect video clips with the required objects: customers’s videos, videos in the public domain (laid out videos, surveillance cameras);
  • we cut and filter video fragments so that the resulting dataset is balanced across various recognition objects;
  • We distribute frames between markers to highlight the necessary objects. An example of a markup tool ;
  • selectively check the results of the markers;
  • if necessary, we perform augmentation: usually we add turns, reflection, change the sharpness (form an extended marked dataset).

Using detection zones


One of the problems with counting people in line is the intersection of the visibility areas of several cameras. More than one camera can be installed in the room, therefore it is important to maintain the area of ​​overlapping images, and when a person enters the scope of several cameras, he should be taken into account once.

In some situations, people need to be detected only in a certain area of ​​the room (near the service windows) or platform (near the equipment).

For obvious reasons, it is wrong to verify that the border-rectangle (box / frame), restricting the whole person, falls into the zone (polygon). In this situation, the bottom (third / half) of the rectangle is divided into points - nodes (a grid of 10 by 10 nodes is taken) and the falling into the zone of the individual selected nodes is checked. “Significant” nodes are allocated by the system administrator based on the geometry of the room (the default values ​​are also selected - if the setting for a particular room is not entered).

image

In addition, the application of the Mask R-CNN architecture for our tasks is being tested . The method allows you to determine the silhouette outline - this will make it possible to get away from using a border-rectangle when analyzing the intersection with a zone .

image

Another approach: head detection (model training)


Quality is not always achieved by choosing a model, increasing / changing the training set, and other, purely ML methods. Sometimes a decisive improvement can only be obtained by changing the entire formulation of the problem (for example, in our problem). In these queues, people crowd and therefore overlap each other, so the quality of recognition is often insufficient to use only this method in real conditions.

Take the image below. We close our eyes to the fact that the picture was taken on the phone, and the angle of its inclination does not correspond to the angle of inclination of CCTV cameras. There are 18 people on the frame, and the silhouette detection model identified 11 people:

image

To improve the results, we moved from defining silhouettes to defining goals. For this, the Faster R-CNN model was trained on a dataset taken fromlink (the dataset includes frames with a different number of people, including large clusters, among which there are people of different races and ages).

Plus, we enriched the dataset with frames from the material (from the cameras) of the customer by about a third (mainly due to the fact that the original dataset had few heads in hats). A tutorial was useful for self-learning a model .

The main problems we encountered are image quality and the scale of objects. The heads have different sizes (as can be seen from the image above), and the frames from the customer’s cameras had a resolution of 640x480, because of this, interesting objects (hoods, Christmas balls, backs of chairs) are sometimes detected as heads.

For example, in the training dataset, we have labeled heads:

image- these are heads in a dataset;

image- and this is the back of the chair, but the model wants to believe that this is the head.

However, in general, this model copes quite well in cases where there is a massive concentration of people. So, in the frame above, our model identified 15 people:

image

Thus, in this image, the model could not find only three heads, which were significantly blocked by foreign objects.

To improve the quality of the model, you can replace the current cameras with cameras with a higher resolution and additionally collect and mark the training dataset.

Nevertheless, it should be borne in mind that with a small number of people, the method of detecting by silhouettes rather than by heads is more suitable, since the silhouette is more difficult to completely overlap or confuse with foreign objects. However, if there is a crowd, there is no way out, so for counting people in line, it was decided to use two models in parallel - for heads and silhouettes - and combine the answer.

Silhouettes and heads, an example of a recognition result:

image

Accuracy rating


When testing the model, frames were selected that did not participate in the training (dataset with a different number of people on the frame, in different angles and different sizes), to assess the quality of the model, we used recall and precision.

Recall - completeness shows what proportion of objects that actually belong to the positive class, we predicted correctly.

Precision - accuracy shows what proportion of objects recognized as objects of a positive class, we predicted correctly.

On frames from cameras on test sites (images in these rooms were on the dataset) metrics:

image

On frames from new cameras (these rooms were not on the dataset):

image

When the customer needed one digit, a combination of accuracy and completeness, we provided a harmonic mean , or F-measure :
image

Reporting


An important part of the service is statistics. Together with individual frames (and dedicated people counted by them), customers want to see the results in the form of ready-made reports (dashboards) with average / maximum occupancy for different time intervals. The result is often interesting in the form of graphs and charts characterizing the distribution of the number of people over time.

For example, in our solution for the frame, the number of people for both models (silhouettes and heads) is calculated and the maximum is selected. If there are several cameras in the room, the image overlap zone (pre-set via the interface) is saved, and when a person enters the scope of several cameras, he is taken into account once.
Next, the value of the number of people in the queue is formed for several consecutive frames - for the interval Δt . Within an hour, values ​​for several such intervals are unloaded for each room.

The size of the time interval and the number of intervals are determined based on the number of rooms and used computing power. For each interval, an array of values ​​is formed with the number of people in the room.

The most common value (mode) is selected. If there are several values ​​with the same frequency, the maximum is selected.

The resulting value is the number of people in the queue at time timmediately following the interval in question. In just an hour, a set of values ​​is obtained for different intervals - that is, values ​​at time instants t_1, t_2 .... t_n .

Further for t_1, t_2 .... t_n, the maximum and average values ​​of the number of people are calculated — these values ​​are displayed in the report as peak and average loads for a given hour.

Diagram of the distribution of people by time for maximum load (simple example):

image

Diagram of the distribution of people by time for average load (simple example):

image

Crowds


In conclusion, for completeness of the topic, I would like to mention cases of very large crowds, for example, crowds in stadiums, in places of intense human traffic.

These tasks are about estimating the size of the crowd: if it is a crowd of 300 people, an answer of 312 or 270 is considered acceptable.

In practice, we did not have to solve such problems with the help of video analytics (if this is an organized event, then it is easier for each person to issue a label). However, we conducted testing. For such tasks, separate methods are used, an overview of the methods .

The result of the model from the review (model pre-trained on CSRNet) was reproduced:

image

The angle is important for the settings of this model, that is, if the shooting location is fixed, the result will be better than when applied to diverse images. Generally speaking, there is an opportunity to retrain this model - the quality can be improved during the model’s operation when real video from installed cameras is on.

Authors of the article: Tatyana Voronova (tvoronova), Elvira Dyaminova (elviraa)

All Articles