🛎️ 🤴🏻 🤘🏼 When to stop the video sequence recognition process? 🐈 🕴🏻 🔺

Hi Habr! Today we would like to tell you about a very interesting task that we have been dealing with since the very start of our research on document recognition in the video stream - the problem of finding the optimal stopping time.

Fig. 1. Images of text fields of ID documents on frames of a video sequence.

As many people know, the main product of Smart Engines is a system of recognition of identity documents, Smart IDReader, applicable both on servers and desktops, and on mobile devices. In a large number of cases, the recognition of documents on a mobile phone must take place offline, often we can not control the conditions and quality of shooting, and the cost of a recognition error, especially when recognizing identification documents, is very high. At the same time, we have an important advantage - we can use not one image, but a sequence of captured frames (see Fig. 1), to increase recognition accuracy.

When using multiple images of text fields, two important questions arise. The first question is how to combine observations? Is it worth combining the source frames to get one image, a higher “quality”, or choose one better frame (then where is the guarantee that there will be a frame that is good)? Or, perhaps, first recognize the field on each frame, and then select the most “confident” result (by what criteria?), Or combine frame-by-frame recognition results, etc. We adhere to the latter approach (frame-by-frame recognition with interframe combination of results), but the optimal approach may differ both depending on the recognition engine used and other task parameters.

The second question that arises independently of the first is when to stop the observation process? In other words, by what criterion do you decide that the capture process can be stopped and the result accumulated by the current moment is recognized as final? How to compare such criteria with each other and is there an optimal one?

It is about the problem of finding the moment of stopping the observation process that will be discussed in this article.

What we want to achieve

Recognizing a text string in a video sequence, in which after capturing another observation, the result improves in one way or another, can be considered as an Anytime algorithm - an algorithm with a sequentially improving result, the calculation of which can be stopped at any time. A convenient tool for visualizing the quality of such algorithms is “ Expected performance profiles ” - graphs of the dependence of the quality of the result, measured in one form or another, on the calculation time.

Fig. 2. Profiles of text line recognition efficiency in a video sequence (lower is better). The black dotted line is frame-by-frame quality, the black solid line is the result of interframe combining. The orange line is what we want from the “good” stop rule.

In Fig. Figure 2 shows the efficiency profiles for recognizing a text string - the dependence of the average error (in terms of the distance of Levenshtein to the correct answer) on the number of processed frames. Black graphs were obtained using Tesseract v4.0.0 on the text fields of the MIDV-500 dataset . It can be seen that the use of interframe combining of recognition results allows one to achieve a much lower error value (which, in general, is not surprising).

What do we want from a “good” stop rule? Since we expect, quite reasonably, that the longer we continue the process, the better the result will be on average, it would be great if the rule for stopping on some video sequences considered “longer”, if there was a chance to minimize the error, and on some it would stop would be earlier if the result is already good, or there is no chance to improve it. Due to this, with the same average quality of the combined result, on average, fewer processed frames can be achieved, and vice versa, with the same average number of frames, the average result is better. In other words, it is important to understand that the stopping rule is not only about minimizing time, it is also about maximizing quality, for the same (average) time.

Let us look for the stop rule in the following form: after processing the next frame and receiving the combined recognition result, we consider some characteristic and cut off its threshold - if, say, it is below the threshold, then we stop, otherwise we continue. Then, with a fixed stop rule, but varying the threshold, we will also get an efficiency profile, with the only exception that the average axis will not be located on the horizontal axis, but the average (see the orange graph in Fig. 2). The lower this chart is, the more effective we can consider the stop rule. We can consider the initial profile of the “combined result” to be the profile of the effectiveness of the trivial stop rule - in which we simply cut the number of frames processed by the threshold.

What the theory says

In mathematical statistics, the problems of finding the optimal stopping time are well known and have long been investigated. Perhaps the most famous tasks of this class are the task of a picky bride (she was much involved, for example, Boris Abramovich Berezovsky), the task of selling a house, and others. Without going deep into theory, let us briefly pose the problem of recognition in video sequences as the optimal stop problem.

We denote the error of the combined result at the

ith step of the process as $\ epsilon (R_n)$ . We assume that if we stop at the

ith step, we will incur a loss of the following form:, $L_n = \ epsilon (R_n) + c \ cdot n$ where

- some predetermined relative cost of each observation. The task of finding the optimal stopping time can be formulated as a search for a random variable

- the stopping time, the distribution of which depends on the input observations (in some literature the value

is called the Markov moment ), and at which the expected loss is minimized $\ mathrm {E} (L_N)$ .

Monotonous problems

Under certain conditions, in such problems, the optimal stop rule can be explicitly expressed. An example is the so-called monotone problem. The criterion for the monotonicity of the problem is this: if at some step the

loss

does not exceed the expected loss at the next step, then this will be performed at all subsequent steps. In other words, from what happened the event $L_n \ le \ mathrm {E} (L_ {n + 1} | \ text {received} ~ n ~ \ text {frames})$ follows that the event will happen $L_ {n + 1} \ le \ mathrm {E} (L_ {n + 2} | \ text {received} ~ n + 1 ~ \ text {frames})$ . For monotone problems, the so-called “short-sighted” stop rule is optimal: stop at the very first step where the condition is fulfilled $L_n \ le \ mathrm {E} (L_ {n + 1} | \ text {received} ~ n ~ \ text {frames})$ .

Suppose our task is monotonous. Having written the myopic rule in terms of our function, L_n

we get that we need to stop when the following condition is fulfilled:

$\ epsilon (R_n) - \ mathrm {E} (\ epsilon (R_ {n + 1}) | \ text {received} ~ n ~ \ text {frames}) \ le c.$

This, of course, is great, but to implement such a rule, we need to be able to evaluate in runtime not only the “correctness” of the current result, but also the expected correctness of the next, which is not so simple (not to mention what we imperiously demanded from the task monotony). Can we somehow apply this rule without directly assessing the “correctness” of the result? .. You can try to evaluate the left side of the inequality from above.

How can the myopic rule be used?

Let's assume that a function $\ epsilon (R_n)$ is the distance $\ rho (R_n, R ^ *)$ from the combined recognition result R_n

to the “correct answer” R ^ *

, and like any distance that respects itself, the triangle inequality holds. The Levenshtein distance mentioned above satisfies the triangle inequality, as well as a simple function in the form of “right / wrong” (if we assume that $\ rho (R_n, R ^ *)$ for the correct answer it is zero and for the wrong answer it is constant). According to the triangle inequality, the left side of our stopping criterion does not exceed $\ mathrm {E} (\ rho (R_n, R_ {n + 1}) | \ text {received} ~ n ~ \ text {frames})$ - the expected distance from the current combined result to the next.

Let's also require our algorithm for interframe combining recognition results so that, on average, the distance between adjacent combined results does not increase over time (that is, we will consider the sequence of combined results as some “converging” process, although not necessarily to the correct answer). Then if the expected distance from the current result to the next becomes less than threshold c, two things are done at once. Firstly, the short-sighted stopping criterion is fulfilled (since its left part is bounded from above by this same distance). And secondly, the task becomes monotonous: in the next step, the expected distance to the next answer will not grow, which means it will continue to be less than threshold c, and the short-sighted criterion will be fulfilled again.

This means that if we expect that the distances between adjacent combined results do not increase on average, then we need to stop by cutting off the expected distance from the current result to the next, thus approximating the optimal rule. You need to understand that such a rule is no longer optimal (since the “real” optimal rule could stop earlier), but at least we will not stop earlier than necessary.

There are several ways to evaluate the expected distance from the current combined result to the next. For example, if the Levenshtein distance is used as the metric over the results, then even just the distance between the last two results of the layout is a good approximation. Another possible approach is to simulate a possible next observation (for example, based on the ones already obtained), conduct “idle” combinations and calculate the average distance to the predictions obtained.

Fig. 3. Comparison of performance profiles for several stopping rules.

In Fig. 3 shows performance profiles for several stopping rules. N_K

- the same “mentioned above” “trivial” rule - with a threshold cut-off of the number of processed observations. $N_ {CX}$ and $N_ {CR}$ - threshold cut-offs of the maximum cluster size of the same frame-by-frame ( $N_ {CX}$ ) and combined ( $N_ {CR}$ ) results.

- the rule described in this paper, with an estimate of the expected distance to the next result using modeling and “idle” combinations.

Instead of a conclusion

The aim of the article was to show that in document recognition systems on frames of video sequences there are many interesting problems, not only directly from computer vision, but also from other interesting areas. Although we showed the problem of finding the optimal stopping time only in its simplest form, the most interesting part begins when, for example, we want to take into account not only the number of frames in the model, but also the real execution time. We hope you were interested, and thank you for your attention!

- The

publication was prepared on the basis of the article On optimal stopping strategies for text recognition in a video stream, K. Bulatov, N. Razumnyi, VV Arlazarov, IJDAR 22, 303-314 (2019) 10.1007 / s10032-019-00333-0 .

If the reader is interested in the theory of the optimal stopping time, in particular, the proof of the optimality of the myopic rule for monotone problems, we highly recommend the published Optimal Stopping and Applications course (Thomas Ferguson, UCLA).

Some other interesting sources on this subject:
Chow YS, Siegmund D. Great expectations: The theory of optimal stopping , Houghton Mifflin, 1971, 139 p.
Berezovsky B.A., Gnedin A.V. The Best Choice Problem , Science, 1984, 200 pp.
Ferguson TS, Hardwick TS Stopping rules for proofreading , Journal of Applied Probability., V. 26, N. 2, 1989, p. 303-313.
Ferguson TSMathematical statistics: a decision theoretic approach . Academic press, 1967, 396 p.

When to stop the video sequence recognition process?

What we want to achieve

What the theory says

Monotonous problems

How can the myopic rule be used?

Instead of a conclusion

More articles: