📽️ 👩🏼‍🍳 👩🏿‍🤝‍👨🏻 How we use computer vision algorithms: video processing in a mobile browser using OpenCV.js ✒️ 👍🏽 👩‍👩‍👧‍👦

There are already all the possibilities for identifying a person online, but so far they are rarely used. Perhaps we were one of the first to implement the optimal scenario for the user - log in to the site from a smartphone, take a photo of your driver’s license or passport and send data to the system.

Let's consider how computer vision algorithms help to recognize documents in a video stream directly in mobile browsers. In this article, we share our experience of how we used OpenCV.js for this at SimbirSoft, what difficulties are possible, how to ensure speed and get a “smooth” UX without slowing down.

What was the task

The business scenario for the algorithm being developed is as follows. A user accessing the site from a mobile phone should be able to photograph his documents and send them to the system for further processing. This may be part of the identity process when applying for the use of any services.

A web application in this scenario is preferable to a mobile application because of its availability and reduced time to complete the operation. The web page does not need installation and is ready to work immediately after loading. The user can proceed to perform the actions he needs - submitting an application - immediately after receiving the link, without being distracted by additional actions. From a business perspective, these factors increase the conversion and commercial effectiveness of the process.

From an architectural point of view, the algorithm is required to directly detect the boundaries of the document and crop the excess background in the image. Verification of identity, authentication and fraud checks will be implemented by other components. However, it is advisable to carry out at least minimal checks to exclude sending business cards, empty paper rectangles and other obviously irrelevant images for processing images.

Requirements

As part of our project, there were the following additional requirements for the algorithm:

the ability to work in real time: the video stream from the camera should not “slow down” during the operation of the algorithm;
the ability to work in a wide range of contrast and background texture: low-contrast and contrast, homogeneous and heterogeneous background;
Support for a wide range of smartphone models, including budget models released several years ago.

Finally, there was no dataset for training machine learning algorithms in the project, and there was no way to collect and mark it up. We only had a few test samples from Google’s search results.

Given this statement of the problem, we decided to develop based on the classical computer vision algorithms from the opencv library. An alternative possibility was the use of machine learning algorithms and neural networks, however, it was discarded already in the early stages of work due to performance requirements: when applied, it would not be possible to provide real-time frame processing on all target devices.

General approach and algorithm structure

The main idea of the algorithm is a reference frame, along which it is necessary to align the document. Its use pursues several goals at once. Firstly, it will provide a suitable image size, sufficient for further processing of documents. Secondly, as we will see later, it can be used as one of the candidate filters when searching for document borders. Thirdly, it can be used to capture and crop the image if the borders of the document could not be found.

Fig. 1. The general structure of the algorithm

The general structure of the algorithm is shown in Fig. 1. Frames from the video stream are processed in a cycle, between iterations of which a timeout is set to comply with the desired FPS - we stopped at 30 frames per second. This allows you to avoid “slowdowns" and reduce the load on the processor and power consumption of the device.

Each processed frame undergoes preprocessing, during which two main operations are performed. Firstly, a copy of a frame of a fixed size of 640x480 is created, with which the further steps of the algorithm work. The original image also remains, the detected document will be cut out of it. This will save the quality of the final image. Secondly, the created copy is translated in shades of gray. The color of the document being processed is ignored by the algorithm, since it can vary from country to country and even in different regions within the country - an example is a driver’s license in the United States.

The first step in detecting a document is to search for the face in the image. The use of this heuristic eliminates the capture of business cards and other obviously irrelevant images. The search is performed using the standard opencv'shash CascadeClassifier.detectMultiScale () and the pre- trained cascade haarcascade_frontalface_default . The minimum and maximum sizes of detected faces are limited, which allows to reduce computational costs, and also further limits the scale of the document in the image. A face is considered detected in the image when it is in the left - or lower left, for passports - part of the area inside the reference frame (Fig. 2). This is an additional measure to ensure the correct alignment of the document in the image.

The examples in this article do not contain personal data.

Fig. 2. The area of the expected position of the face in the image. The support frame is shown in red, the borders of the area of the expected location of the face are shown in green.

After face detection, we proceed to the border detection. Often findContours () is used here . However, this approach works well only in contrasting cases, for example, for a sheet of paper lying on a dark desk. If the contrast is lower, or the lighting is worse, or someone is holding a sheet in their hands, covering part of the border with their fingers, the detected contours break up into separate components, “lose” significant sections or are not detected at all.

Therefore, we took a different approach. After binarization, we first pass the image through the border filter using Canny () , and then look at the resulting picture for the line using the Huff transform HoughLines () . The threshold parameter is immediately set large enough, equal to 30 - to filter detected short and other irrelevant segments.

The resulting set of lines is additionally filtered, leaving only lines close to the reference frame. To do this, we first translate the equations of the frame lines to points in the polar coordinate system (rho, theta) - theta will always be 0 or pi / 2, and rho will be unique for each line. After that, we select from the lines obtained from the Huff transform only those that lie in the vicinity of the control points - according to the Euclidean metric, taking into account the difference in the scale of the values.

We distribute the set of lines obtained after filtering into four groups corresponding to the four lines of the reference frame, find the intersections of the lines in pairs between the groups, average and obtain the coordinates of the four points - the corners of the detected document (Fig. 3).

Fig. 3. Filtering lines and defining document corners. Green lines - the result of filtering, yellow dots - detected corners of the document.

Next, you need to make sure the quality of the frame. To do this, we verify that the frame has remained stationary for the last time. To do this, subtract the frame at the beginning of the period from the current frame using absdiff () and compare it with the threshold. Before subtraction, we additionally smooth the images with a Gaussian filter GaussianBlur () to reduce the influence of noise and other random factors. We also evaluate the focus of the frame by calculating its Laplacian Laplacian () , estimating its variance and comparing the obtained value with a threshold.

If all the checks are successful, you can proceed to the final part. We recalculate the detected coordinates of the angles into the coordinate system of the original, underexposed image and cut the resulting region using the roi () method . The document was detected successfully.

Implementation Features

During the development of the algorithm, its main components were assembled in a python script. After that, the algorithm was ported to opencv.js and javascript, and then to wasm. This approach is dictated by considerations of convenience at all stages. On python, it was more convenient for our team to experiment with various variants of the algorithm and carry out rough parameter settings. Porting to javascript made it possible to test the operation of the algorithm on the target platform, including on various devices and browsers. Based on the results of these checks, fine tuning of the algorithm parameters was carried out. Finally, rewriting critical sections of code on wasm allowed us to get an additional performance boost.

During the migration, a number of differences were discovered in the OpenCV API, which resulted in minor changes in the implementation. For example, the variance of a Laplacian in python is considered simply as Laplacian (). Var () . With OpenCV.js, there is no way to use NumPy, but no alternative implementation of the var () method has been provided. Solution: Count the meanStdDev () function as the standard deviation (Listing 1).

private isImageBlurry(image: cv.Mat): boolean {
		const laplacian = new cv.Mat();
		cv.Laplacian(image, laplacian, cv.CV_64F);
		const s_mat = new cv.Mat();
		cv.meanStdDev(laplacian, new cv.Mat(), s_mat);
		const s = s_mat.data64F[0];
		const v = Math.pow(s, 2);
		return (v < this.laplacianVarianceThreshold);
	}

Listing 1. Assessing the focus on the image through the variance of the Laplacian in opencv.js (TypeScript)

Another feature was the need to reduce the size of the library. In its original form, OpenCV.js has a capacity of 7.9 MB. Its download via the Internet slows down the initialization of the algorithm. The solution to this problem is to “trim” the unused modules during the library assembly process, which can significantly reduce the output file size: we managed to achieve a size of 1.8 MB. The list of components included in the assembly can be configured in the configuration file platforms / js / opencv_js.config.py (Listing 2).

white_list = makeWhiteList([core, imgproc, objdetect, video, dnn, features2d, photo, aruco, calib3d])

Listing 2. The original white list of opencv modules included in the assembly for javascript

Finally, an important contribution to ensuring the required performance of the algorithm was made by moving it to Web Worker. This step, together with the restriction of FPS, allowed us to get rid of the “slowdowns” of the video stream during the operation of the algorithm, which had a positive effect on UX.

results

Examples of capturing and cropping images are shown in Fig. 4. It can be seen that the highest quality cropping is achieved on a dark uniform background, and the lowest quality is obtained with a light inhomogeneous background. This is the expected effect associated with the gradients obtained on different backgrounds and used to detect the borders of a document. Against a dark background, the gradients are larger than on a light background, a uniform background leads to less variability of the gradient values. This leads to reliable detection of boundaries and, as a result, to better cropping.

Fig. 4. Examples of cropping documents using an algorithm

Conclusion

The article presents an algorithm for detecting documents on frames from a video stream, suitable for use in mobile browsers, and also considers the features of its implementation using the opencv.js library. The algorithm allows you to get the output image of documents in a quality sufficient for further use by algorithms for authentication, identity verification, etc. The speed of the resulting implementation allows you to get a “smooth” UX without “slowdowns” and frame loss.

Thank you for the attention! We hope you find this article useful.