✋ 👨🏾‍🤝‍👨🏻 👩🏽‍🍳 Some facts about cascading classifiers, which are rarely seriously considered in scientific articles. 👩🏾‍🤝‍👨🏽 🌞 🐭

Hi Habr! Today we’ll talk about recognition again. Namely, about such a simple recognizer model as a cascading classifier. That cascade is used in the popular method of Viola and Jones, about which they have already written so many times on Habré (for example, here , here and here ). The sad thing is that despite the abundance of articles, nobody seriously studied cascading classifiers. And not only on Habré, but also the scientific community. Although the cascading classifier seems simple, there are a lot of pitfalls and interesting features. Therefore, we are in a hurry to share our knowledge with you. So, if interested, welcome to cat.

The cascading classifier is a very simple model. It consists of several successive levels, each of which can be represented as a binary classifier. The investigated precedent is fed to the input of the first level and then “travels” level by level. If at the next level the classifier recognizes the precedent as negative, then it is no longer analyzed by the remaining classifiers of the cascade. Such a simple model became popular after the publication of the Viola and Jones method [1], where, as stated, it was used to provide high performance. But is it just for this? Is it just a cascading classifier? Let's figure it out!

We will build today's article on Habré in a format, new for us. We will select several statements that we will reveal in detail and even refute somewhere. So let's get started!

The cascading classifier in the Viola and Jones method is simply used to speed up the operation of the object detector

In the original article [1] on the very first page there is such a phrase:

The third major contribution of this paper is a method for combining successively more complex classifiers in a cascade structure which dramatically increases the speed of the detector by focussing attention on promising regions of the image.

Indeed, the original method of Viola and Jones is designed to search for objects in the image. Moreover, the detection problem is solved by the sliding window method using a binary classifier, which is applied at each point of the investigated image at different scales. The imbalance of the studied data at the stage of detection (“empty” regions without the desired object in each image under study, millions or even billions of times more than regions with objects) prompted the use of a cascade — a mechanism that allows you to quickly cut off “empty” regions.

But it was about using an already trained classifier. Now let's turn to the classifier training procedure. It turns out that there is exactly the same problem of sample imbalance: the number of negative examples is many times greater (millions and even billions times more) than the number of positive examples. But thanks to its architecture, each new level of the cascade is trained by the AdaBoost method not on the entire negative training sample, but only on a limited set of errors from previous levels of the cascade! This allows you to run the AdaBoost training machine on a balanced and limited sample!

As you can see, the use of the cascade classifier in the Viola and Jones method shoots twice:

It allows you to easily train the classifier, naturally avoiding the problem of "infinite" training set;
It provides fast clipping of “empty” regions during object detection, due to which high average productivity is achieved.

Well, let's continue to study the classic cascade and turn to the issue of performance.

With this in mind, a cascading classifier is an acceleration tool.

Let us return once more to the question of the purpose of the cascade, but now on the other hand. If you look at the cascade classifier mathematically, you can see that the cascade is a conjunctive form of strong classifiers (each of which can be represented as a linear composition of attributes):

C a s c a d e (x) = \prod_{i = 1}^{N} [S_{i} (x) > 0], S (x) = [\sum_{t = 1}^{T} α_{t} \cdot h_{t} (x) > 0],

$Cascade\left(x\right)=\prod_{i=1}^{N}\left[S_i\left(x\right)>0\right], \\ S\left(x\right)=\left[\sum_{t=1}^{T}{\alpha_t\cdot h_t\left(x\right)}>0\right],$

Where

[∙]

$\left[\bullet\right]$ - indicator function.

Given the limited number of available attributes (which, in practice, when chasing performance, turns out to be a normal situation), the conjunctive form of strong classifiers has a more expressive ability than a single linear classifier. This is easy to understand if you imagine a simple example, where the feature space consists of two elements, and the positive and negative objects expressed in the coordinates of these characteristics are located as shown in the figure below (the green objects are the positive ones and the red ones are the negative ones). It is clear that there is no such single linear classifier that can correctly divide this sample. But a cascade of four levels would cope with this task guaranteed.

Thus, the use of a cascade classifier, in addition to increasing productivity, also provides greater expressive power than a single linear classifier, in conditions of a limited number of features.

Cascading classifier provides consistently high performance and can be easily used in real-time recognition systems

As we said above, the cascade scheme allows you to achieve high performance due to the rapid screening of "empty" regions, since their number is several orders of magnitude greater than the number of regions containing the object. The processing time of the “empty” region differs from the processing time of the region with the object by several times (in proportion to the length of the cascade - the number of levels of the cascade).

Since the number of regions containing the object differs from image to image, the processing time of each frame is also different. Due to the fact that there are significantly fewer regions with an object on the frame than regions without an object, the difference in processing time is measured not tens of times, but tens of percent, which, nevertheless, is significant in industrial recognition systems.

Thus, the operating time of the cascade classifier in different pictures can vary significantly. Hence, when performing serious measurements of the performance of classifiers, measurements should be made of the operating time on average and in the worst cases. And always be prepared for such temporary “inconsistencies” when using cascading classifiers.

In our practice, we have already faced serious problems due to a significant discrepancy in the cascade operating time in the average and worst cases. As part of the toll road automation project, we solved the problem of recognizing the type of car, where one of the main sub-tasks was the problem of counting wheelsets. Of course, we used the Viola and Jones method to detect wheels on individual frames. Due to the great variability of the wheels (see the figure below), the trained cascade was quite long (20 levels). We watched live unpleasant problems associated with different processing times for each frame, which seriously prevented us from building a recognition system in real time.

Then we developed the idea of a classic cascading classifier to a full-fledged decision tree, developing a unique learning technology for such a decision tree (remember that it was necessary to propose an algorithm that would allow us to avoid the problems associated with the “endless” training set). Details of this algorithm can be found in our scientific work [3]. Here we report only a few facts:

The longest path in a trained tree consists of 6 strong classifiers (the scheme of a trained tree classifier is shown in the figure below).
A trained tree classifier provided a better quality of work compared to a previously trained cascade. This is logical and follows from the fact that the expressive power of the tree-like cascade (which is a conjunctive-disjunctive form) is higher than the expressive power of the cascade (conjunctive form).
A trained tree-classifier seriously circumvented the cascade in the worst case, almost without losing on average.

The table below presents numerical comparisons of cascade and tree classifiers.

	Sensitivity	Specificity	Time on average, μs	Time at worst, ms
Cascading classifier	93.55%	99.98%	58159	67432
Tree classifier	94.02%	99.99%	58717	63552

Thus, if you really want to use cascading classifiers in recognition systems in real time, then always remember about the features associated with the speed of work in the average and worst cases.

The cascading classifier training technologies are so obvious that there is nothing to seriously bother with.

Oh, this is probably one of the most difficult topics related to cascading classifiers. The bottom line is that in all the articles that we have met, the cascade learning process is described so poorly, superficially and without proper justification of the effectiveness of the cascade learning algorithm. Usually, the cascade learning algorithm looks something like this:

Decide on the values of the share of false recognition ( $F$ ) for the entire cascade.
Decide on the values of the share of true recognition ( $d$ ) and fractions of false recognition ( $f \lt F$ ) for each recognition level.
Decide on a validation sample to honestly evaluate the quality indicators of the final classifier.
Train each new level of the cascade (which, we recall, is trained on all available positive examples, as well as on false positive errors of the current cascade) so that its performance $d_i$ and $f_i$ were no worse than given, that is $d_i>d$ and $f_i \lt f$ . By the way, the process of ensuring these indicators in itself is also of great interest.
Add the newly trained level to the cascade and evaluate its quality indicators in the validation sample. If the false recognition rate is less than the target $F$ , then finish the training. Otherwise, go to step 4 to learn a new level of cascade.

If as a result of the above algorithm trained

K

$K$ levels of the cascade, then you can estimate the average complexity of the share of correct recognition of the cascade as follows:

N = n_{1} + \sum_{i = 2}^{K} (n_{i} \prod_{j = 2}^{i} p_{j}), D = \prod_{i = 1}^{K} d_{i},

$N=n_1+\sum_{i=2}^{K}\left(n_i\prod_{j=2}^{i}p_j\right),\\ D=\prod_{i=1}^{K}d_i,$

Where

n_{i}

$n_i$ - complexity

i

$i$ cascade level

p_{i}

$p_i$ - probability of calculation

i

$i$ level cascade, and

d_{i}

$d_i$ - share of correct recognitions

i

$i$ level cascade.

As you can see, the complexity of the cascade does not participate in the training algorithm presented, therefore, of course, it cannot be called effective in performance. At the same time, we know for sure that scientists all over the world are firmly convinced of the importance of learning algorithms for effective cascades. As a confirmation, here is a quote from an article by Paul Viola and Michael Jones [4]:

The overall training process involves two types of tradeoffs. In most cases classifiers with more features will achieve higher detection rates and lower false positive rates. At the same time classifiers with more features require more time to compute. In principle one could define an optimization framework in which
– the number of classifier stages,
– the number of features, $n_i$ , of each stage,
– the threshold of each stage
are traded off in order to minimize the expected number of features $N$ given a target for $F$ and $D$ . Unfortunately finding this optimum is a tremendously difficult problem.

It is interesting that our review of relevant literature, made in 2016, showed that at that time there were no effective training algorithms for cascading classifiers. By the way, it was then that we at Smart Engines solved this problem. We studied the following functional, which depends on detection errors of the first kind (

E_{1}

$E_1$ ), detection errors of the second kind (

E_{2}

$E_2$ ) and medium difficulty (

N

$N$ ):

F (E_{1}, E_{2}, N) = {β_{1} \cdot E}_{1} + β_{2} \cdot E_{2} + β_{3} \cdot N .

$F\left(E_1,\ {\ E}_2,\ N\right)=\ {\ \beta_1\cdot\ E}_1+\ \beta_2\cdot E_2+\ \ \beta_3\cdot N.$

Selection of parameters

β_{1}

$\beta_1$ ,

β_{2}

$\beta_2$ and

β_{3}

$\beta_3$ , set the relative cost of errors of the first and second kind, as well as the complexity of the resulting detector. Further, in the process of training the cascade classifier, a greedy algorithm was used to enumerate parameters at each level in order to obtain cascades that maximize the selected functional. A detailed description of the developed algorithm is beyond the scope of this article, but we are always ready to share it with you, our reader, by providing a bibliographic link [5].

Conclusion

To summarize everything that was said on our part, using the cascading classifier model as an example, we want to draw the following conclusions:

It is rarely possible to meet a scientific work in which all the necessary details are thoroughly described.
It is very useful to re-read scientific articles, seriously thinking about the properties and limitations of the models presented in them, even if at first glance it seems that everything is “chewed” in the article.
There will always be a place for worthy scientific research, even if the studied model was proposed 20 years ago.

We are very pleased if the material presented in this article is useful and, of course, always ready for a fruitful discussion in the comments.

Thank.

List of sources used

Paul Viola Michael J. Jones. Robust real-time object detection // International journal of computer vision. – 2001.
Bourdev L., Brandt J. Robust object detection via soft cascade //2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05). – IEEE, 2005. – . 2. – . 236-243.
Minkina A. et al. Generalization of the Viola-Jones method as a decision tree of strong classifiers for real-time object recognition in video stream //Seventh International Conference on Machine Vision (ICMV 2014). – International Society for Optics and Photonics, 2015. – Vol. 9445. – P. 944517.
Paul Viola Michael J. Jones. Robust real-time face detection // International journal of computer vision. – 2004. – Vol. 57. – No. 2. – P. 137-154.
. . . - «» // . – 2016. – . 30. – №. 3. – . 241-248.

Some facts about cascading classifiers, which are rarely seriously considered in scientific articles.

The cascading classifier in the Viola and Jones method is simply used to speed up the operation of the object detector

With this in mind, a cascading classifier is an acceleration tool.

Cascading classifier provides consistently high performance and can be easily used in real-time recognition systems

The cascading classifier training technologies are so obvious that there is nothing to seriously bother with.

Conclusion

More articles: