🤾🏼 👩🏼‍🤝‍👨🏽 🎀 Using ML Algorithms to Classify Multipage Documents: VTB Experience 🐣 📧 🎹

As part of the credit conveyors of legal entities, banks request originals of various documents from companies. Often scans of these documents come in the form of a single multi-page file - “stream”. For ease of use, flows need to be segmented into separate documents (single-page or multi-page) and classified. Under the cut, we talk about the application of machine learning algorithms in the classification of already segmented documents.

The type of document is determined by both text and visual information. For example, a passport or work book is easy to distinguish visually without analyzing the text inside. Moreover, the quality of text recognition in such documents is rather low if non-specialized solutions are used. Therefore, the visual component carries much more relevant information for classification. The lease agreement and the charter of the company may be visually similar, however, the textual information that they contain is significantly different. As a result, the task of classifying documents is reduced to a data fusion model, which should combine two sources of unstructured data: a visual representation of the document and the results of recognition of text information.

Note that in banking, the classification of documents is also used in the conveyors of individuals on scans or photographs of documents, for sorting the accumulated funds of documents, for filtering customer reviews in order to improve the quality of service, for sorting payment documents, for additional filtering of news flows, etc. .

BERT Model

To solve our problem, we used the BERT (Bidirectional Encoder Representations from Transformer) model - this is a language model based on a multilayer bidirectional coding Transformer . The transformer receives a sequence of tokens (codes of words or parts of words) as an input and, after internal transformations, produces a coded representation of this sequence - a set of embeddings. Further, these embeddings can be used to solve various problems.

Transformer Model Architecture

If in more detail, then a sequence of tokens is fed to the input, summed with the codes of the positions of these tokens and codes of the segments (offers) in which the tokens are located. For each input sequence, the Transformer generates a context-sensitive representation (a set of embeddings for the entire sequence) based on the adaptive mechanism of “attention”. Each output embedding encoded "attention" of some tokens relative to others.

We encode the word it, part of the “attention” mechanism focused on The Animal and fixed part of its representation in the it encoding (from The Illustrated Transfomer blog )

The BERT model is built in two steps: pre-training and file tuning. During pre-training, the model solves two problems: MLM (Masked Language Model) and NSP (Next Sentence Prediction). In the MLM task, a certain proportion of tokens in the input sequence is randomly labeled (masked), and the task is to restore the values of the tokens that have been masked. The NSP task is a binary classification on pairs of sentences when it is necessary to predict whether the second sentence is a logical continuation of the first. During the tuning, pre-trained Transformers retrain on these specific tasks. Transformer based tuning has proven itself in many NLP ( Natural Language Processing ) tasks : automatic chat bots, translators, text analyzers, and others. Transformer

circuitryfor an automatic translator from French to English (from The Illustrated Transfomer blog )

Before the BERT model appeared, methods for paging scans were used: convolutional signs from pictures (obtained using CNN convolutional neural networks ), frequency text attributes ( TF-IDF ) , thematic text tags ( LDA topics), convolutional text tags (1-D convolution), word embeddings ( WordToVec , GloVe ) and their combinations.

Previously developed methods give good quality. But the closer the quality is selected to the maximum, the more difficult it is to improve it. As we will show later, when we already had a quality close to maximum, the use of the BERT model helped to make it even higher.

Since we work mainly with Russian texts, we used the BERT model, pre-trained on some cases of Russian texts ( RuBERT, Russian, cased from DeepPavlov).

Our dataset

Description

The selection of documents on which we solved the classification problem consists of scans of corporate documents of companies accumulated by VTB Bank over many years. Multi-page corporate documents were segmented semi-automatically from the scanned stream, and their pages were classified by paid solutions.

Most scans are black and white, and a small proportion are color (mainly due to color prints).

Customers of business units identified 10 main categories of documents (about 30,000 already segmented multi-page documents, ~ 129,000 pages). Documents had to be cleaned manually due to errors during segmentation. One category “Other” was also introduced, into which all other categories of less significant documents were combined (about 300 categories, ~ 43,000 multi-page documents already segmented, ~ 128,000 pages). As a result, we will build a classifier with 11 classes. We also added about 18,000 images from the ImageNet dataset to the “Other” category (for “protection from the fool”).

The main 10 categories are:

Lease contract
Extract from the register of participants
Company Charter
Certificate of registration with the tax authority
Questionnaire for legal entities
Russian passport
Incorporation sheet
Certificate of state registration of legal entity
Orders / Orders
Decisions / Protocols

Various other identification cards (foreign passports, migration cards, etc.), other certificates, IP questionnaires, statements, acts, powers of attorney, questionnaires, decisions of the arbitral tribunal, images from ImageNet, and others were included in the Other category.
The train was taken about 81% of already segmented multi-page documents (of the number of all such documents), dev - 9%, test - 10%. For the purity of the experiment, the selection was divided so that the pages of any segmented multi-page document fell entirely in one part: either train, or dev, or test.

Certified-Stitched Pages

Often, corporate clients do not provide originals, but copies of documents, many of which are certified by a notary public or by company executives. In addition, multi-page documents are often stapled, prescribe the firmware date, and again certified on the last page.

Therefore, in our dataset there are many such multi-page documents, where on the last scan (page) there are seals, dates and other information regarding the firmware or details of the witnesses, but not related to the class of the document. Below are the last pages of two different multi-page documents segmented from the stream, which are almost impossible to classify correctly if you do not look at the rest of the pages.

Identical last pages of documents of various classes

Scan quality

Although document scanning usually takes place at bank offices (using good copying equipment), customers often bring repeatedly scanned copies of documents. And the quality of such copies suffers greatly: on scans there is a lot of noise and artifacts that can appear from poor quality of toners, from holograms and textures on many documents and for other reasons.

Orientation

There are a lot of documents in the dataset with the wrong orientation of the scan, this is especially true for ID cards and text documents created in landscape mode. But basically, the text is rotated by a multiple of 90 degrees (± 5 degrees). When extracting the text, we additionally determined the “correct” orientation of the picture so that most of the text was oriented vertically.

Baseline

Since most documents begin to scan from the first page, there is usually enough information on it to determine the class, and many multi-page documents differ well in one first page.

Therefore, we will build our baseline classifier for multi-page documents only on their first pages.

Note that although we do not consider the problem of segmentation of multi-page streams (PSS - Page Stream Segmentation) in this article, but if we add the remaining pages of documents to the training of our classifier, and not just the first, we can easily obtain a solution to the problem of PSS segmentation with binary classification : for pages from the stream, two classes are predicted in turn: “new document” or “same document”.

Preprocessing

Since many images of scans are large, and this affects the processing speed, we initially compress all scans so that both image sizes (width and height) are no more than 2000 pixels.

To extract text from images, we used the free Tesseract 4.0 package from Google. Version 4.0 (and higher) of this package works pretty well with noise (unlike previous versions), so we did not remove noise from texts, but determined the “correct” orientation before extracting text from the scan image, for which we also used special functions in Tesseract 4.0.

Convolutional classifier in pictures

From each document, we obtained convolutional signs using a pre-trained convolutional neural network ( ResNet34 ). For this, the outputs of the last convolutional layer — a vector of 512 convolutional signs — were taken. Before running through a neural network, pictures of scans from train underwent some augmentation to resist retraining.

As a model of the classifier on convolutional signs, logistic regression and boostings were tried with the selection of parameters on dev.

The quality of the best convolutional classifier model on test was about 76.1% (accuracy) on logistic regression.

This method allowed us to classify scans that are apparently well different from each other. But to run the pictures through the neural network, they were compressed to the size of the input of the neural network (ResNet34 has a size of 224x224 pixels at the input), and therefore the classification quality is low: the fine print of documents becomes unreadable, and the classifier can “catch” only for some convolutional signs, obtained from a large font, some objects on a page with a special arrangement, etc., but such a classifier does not take into account the essence of the text.

The scan of the first page of the lease agreement and the first page of the charter of the company are visually well different

But we are solving the problem of classifying corporate documents, where many types of documents contain mostly textual information, and it is difficult to distinguish visually - it is difficult to visually “catch” only on “elongated spots” of lines with identical document headers:

Reduced copies of scans of certificates from two different categories visually almost indistinguishable

We assume that text attributes will improve quality, and therefore add text attributes, or rather, create a text classifier for the baseline model.

Text classifier

For the baseline model, we will build a text classifier only on the signs TF-IDF (Term Frequency - Inverse Document Frequency) on texts extracted from scans. Before compiling the thermal matrix TF-IDF, the texts were reduced to lower case; punctuation, stop words were deleted from texts; words were checked for spelling and reduced to the initial form by lemmatization (Pymystem3 package).

As a classifier model, we tried again logistic regression and boostings, the parameters were selected on dev. Since the thermal matrices are large in size and very sparse, logistic regression showed good quality, and the quality was 85.4% (accuracy) per test.

Ensemble of classifiers

To obtain the ensemble, we took a blend of convolutional and text classifiers with weights selected on the dev sample. That is, for each scan S we take with the weight α the probability set Y _CNN (11-digit by the number of categories), issued by the convolutional classifier, we also take the 11-digit probability set Y _TF-IDF , issued by the text classifier, with the weight 1 – α, and summarize these weighted sets to get the output of the mixed baseline classifier:

Y _{CNN + TF-IDF} (S) = α Y _CNN + (1 - α) Y _TF-IDF

As a result, we got the quality of the mixed classifier 90.2% (accuracy) on test.
Classifier results: convolutional (Y _CNN ), text based on tf-idf (Y_TF-IDF ) and their ensemble (Y _{CNN + TF-IDF} ):

Y _CNN - 76.1%
Y _tf-idf - 85.4%
Y _{CNN + TF-IDF} - 90.2%

Two-step classification

When analyzing the results of the ensemble of classifiers, it turned out that he often makes mistakes on scans from the “Passport (RF)” category, classifying passports as “Other”, since this category contains a lot of identity cards. Moreover, their scans, as well as passport scans, are also often of poor quality, which interferes with the qualitative classification.
Therefore, we decided to carry out the classification in two steps.

Step 1

We transferred to the category “Passport of the Russian Federation” all identification cards from the category “Other” in accordance with the initial splitting into train, dev and test.

The main 10 categories:

Lease contract
Extract from the register of participants
Company Charter
Certificate of registration with the tax authority
Questionnaire for legal entities
Passport of the Russian Federation + various ID cards (foreign passports, migration cards, etc.)
Incorporation sheet
Certificate of state registration of legal entity
Orders / Orders
Decisions / Protocols

Category "Other":

Other evidence
IP questionnaires
Statements
Acts
Power of attorney
Questionnaires
Decisions of the arbitration court, etc.

We trained an ensemble of classifiers on such a modified sample.

Step 2

As a second step, we conducted a binary classification within category 6 obtained in the first step: “Passport of the Russian Federation” (class 1) versus “Various identity cards” (class 0). To do this, by analogy, we trained convolutional and text classifiers (in both models there was a logistic regression) and weighed their outputs, having received the ensemble.

The overall classification quality at two steps turned out to be 95.7% (accuracy) per test. In this case, the quality meets the requirements of our business customers (threshold - 95%).

BERT signs

We built a two-step classification, similar to what we did above, but at each step, instead of the TF-IDF features, we used text embedding of pages obtained from the RuBERT model. For each page, the text was tokenized, and a sequence of the first 256 tokens was supplied to the input of the RuBERT model (with pad padding up to 512, i.e., to the size of the model input).

For greater efficiency, before receiving text embeddings, we pre-trained the Masked Language Model (MLM) model on the texts from our dataset, similar to how the authors of the BERT model did it: when we fed a sequence of tokens to the input of the RuBERT model, we replaced a certain fraction with the [MASK] token taken tokens. For the purity of the experiment, pre-training was carried out only on texts from train. Token sequences were taken on all pages of segmented documents, and not just on the first. The beginning of the sequences was randomly selected from the tokenized text.
At the stage of embedding, the average vector of the resulting outputs of the RuBERT model was taken as the text embedding of the page.

Pre-training gave an improvement in a two-step classification: when using text embeddings obtained from the RuBERT model, the quality increased to 96.3% (accuracy) by test. Note the fact that the closer accuracy is to 100%, the more difficult it is to improve. Therefore, the resulting increase of 0.6% can be considered significant.

An increase in the length of the input token sequences to 512 (up to the input size of the BERT model) did not produce a noticeable increase.

What we got

The final scheme of the model:

The quality of all tested models:

Y _CNN - 76.1%,
Y _TF-IDF - 85.4%,
Y _{CNN + TF-IDF} - 90.2%,
Y _{CNN + TF-IDF + 2steps} - 95.7%,
Y _{CNN + RuBERT + 2steps} - 96.3%,

where Y _CNN is a convolutional classifier, Y _TF-IDF is a text classifier on the attributes of TF-IDF.

Y _{CNN + TF-IDF} - ensemble of classifiers (Y _{CNN + TF-IDF} (S) = α Y _CNN + (1 - α) Y _TF-IDF, α = 0.45).

Y _{CNN + TF-IDF + 2steps} - two-step classification: 1) ID cards are transferred to the category “Passports of the Russian Federation + ID cards”, and an ensemble of classifiers is built on the resulting sample with 11 classes; 2) in the category “Passports of the Russian Federation + Identity Cards”, an ensemble of classifiers with two classes is being built: class 1 - the Passport of the Russian Federation, class 0 - Identity Cards.

Y _{CNN + RuBERT + 2steps}- two-step classification; instead of TF-IDF signs, text embeddings of the RuBERT model pre-trained on our dataset are taken.

Using ML Algorithms to Classify Multipage Documents: VTB Experience