Automation of identification of modifications in the image of contract documents using the N-gram model



Every modern person knows that you need to sign a document no earlier than he has read it. Violators of this simple rule are sometimes surprised by unexpected consequences that could have been avoided if you examined the document before signing, including what is written in small print. Tricks in contracts by service providers are used as part of jokes and movies. For example, in the movie Blinded by Desiresthe protagonist terminated a very unfavorable deal with the devil, despite the ignorance of the terms of termination of the contract described in article 147, paragraph 3, part 3 of the contract. A similar situation is sometimes possible in real life with service providers. On the Internet you can find a description of curious cases when a bank client changed the terms of the agreement in his favor, and this was a surprise for the bank. In today's article, we will talk about an algorithm that is extremely useful for banks and other credit organizations, which automatically detects modifications made to the images of contract documents. So look under the cat!

Currently, many organizations that attract a large number of new customers, offer to download a contract template from their website for self-preparation. The printed, completed and signed contract is transferred for signing to the second party. Of course, the second party checks the contracts prepared by potential customers, for example, manually checking the submitted documents. The number of such documents can be quite large, therefore several attentive employees are involved in checking them. However, when checking several dozens of identical (in meaning) documents per day, even a neat employee can skip errors. This explains the cases of fraud that were not detected during manual verification.

We will talk about automation of the routine verification of a large flow of contract documents described above using optical recognition technologies, which have been the subject of Smart Engines articles on Habr more than once (for example, once and twice ).

Contract documents belong to the class of business documents that are created for circulation in some office management and document management systems. A distinctive feature of business documents is the limitation of the vocabulary used and the way it is designed. This is due to the desire to unify the forms of documents to simplify the understanding of business documents, first of all, by a person.

The template or form of the document is described in advance and consists of static text and fields for entering information. Consider two common classes of templates: a fixed template and a flexible template. The fixed template does not allow modification of static texts, for example, when using the PDF format. Flexible templates can be used to allow modification of static texts, for example, templates in the Microsoft Office format. Accordingly, we will distinguish between fixed and flexible documents.

Known methods for automated comparison of the image (scan or photo) of a signed document with its prototype [1]. They check for possible modifications to the content:

  • replacement of one or more characters in a word;
  • replacing one word with another;
  • Add a character, word, or group of words
  • Delete a character, word, or group of words.

Modifications to the design of the document are also possible:

  • changing the style of words (size, fonts, type);
  • changing the fields of a word document;
  • change in the number of paragraphs;
  • change fields.

Modification of a fixed template is an intentional falsification, because there is no other way to explain the desire to change the protected text. Modification of a flexible template can be either a falsification, an accidental typo or a result of improved formatting.

Next, we describe the models and methods of searching for fraud in copies of business documents printed using both fixed and flexible templates.

The basis for comparing the test image (copy) and the reference image (original) are the images of words found by any method. The image of the word is represented by some description (descriptor), the most obvious descriptor is the recognized characters of the word. WordWdefined as a text feature point W=(T(W),B(W))where - T(W)- the core of the text feature point, that is, a sequence of characters of a word consisting of characters of a particular alphabet or a sequence of familiarity with conformity ratings of familiarity with alphabet characters, B(W)- frame of the text feature point, consisting of the coordinates of the border Bx1(W), By1(W), Bx2(W), By2(W)which can be normalized in a certain range, as well as F(W)- Signs of a text feature point (for example, a headset and font modification).

A text feature point is similar to a “graphic” feature point in an image, which means a point that satisfies several conditions:

  • a neighborhood that differs from points in its vicinity;
  • noise immunity;
  • resistance to certain transformations (for example, to affine transformations or scaling) [2].

The properties of the singular points are:

  • repeatability – , ;
  • distinctiveness/informativeness – ;
  • locality – ;
  • quantity – ;
  • accuracy – , , ;
  • efficiency – .

It is assumed that a text feature point is different from neighboring text feature points in its vicinity. If by neighborhood we mean a text line, then most of the words in business documents differ from the neighbors in the line. Multiple identical words placed on the same line will not be text singular points. However, if by neighborhood we mean one or two neighboring words, then two identical words placed on the same line and distinguished by neighboring words will be text singular points. The comparison of singular points is carried out using the similarity measure d, which should take values ​​close to zero in the case of comparing two points corresponding to one place in the image, and large values ​​when comparing points from different places in the image.Comparisons of two cores of text singular points in this paper are based on Levenshtein distanceρLev[3] and its modifications. Thresholdd(W) word comparisons T(W)with other words it is calculated in advance. IfρLev(W,Wr)<d(W)then the word Wrand text feature point Ware identical , otherwise different.

A feature point descriptor is an identifier used when matching feature points. The descriptor is expected to be invariant when matching singular points with respect to image transformations.

The method of extracting singular points from an image is called a detector. DetectorA text feature point is a recognition procedure using some OCR that extracts feature point descriptors from a document image. The properties of the feature points listed above are valid for text feature points in the case of the ability of modern OCR to compensate for different types of image distortions. The uniqueness of text singularity descriptors is determined by the structure of documents (unambiguous splitting of a document into constellations - sections, paragraphs and lines) and natural language properties (a rare coincidence in documents of two adjacent words). Various relations between textual singular points (relations above - below, right - to the left or geometric distance between the frames) allows you to combine points into constellations using clustering algorithms.

Ideally, OCR extracts all text-specific dots from the copy image and document template without errors. This allows you to form constellations, especially the line. Comparison of a copy and a reference consists in establishing an unambiguous correspondence between all or part of the text singular points of the reference and a set of text specific points of the copy. The process of establishing correspondence between points or constellations of points is called coordination.

Coordination of fixed documents includes:

  • the search for correspondence of any point in the reference point in the copy;
  • search for correspondence of any point in the copy at the points of the standard;
  • search for correspondence of any static line of the standard at the points of the copy;
  • search for correspondence of any static copy line at the points of the standard;
  • verification of the identity of the images of each pair of coordinated images.

Any inconsistency found is a potential modification. Of course, the inconsistency found may be due to detector errors (OCR) or document image distortions. The statement of the problem is to find all the modifications in the copy of the document.

Coordination of flexible documents involves establishing a correspondence between all the words of a static text. However, unlike fixed documents, no correspondence between lines of static text of a flexible document is assumed. In flexible documents, legitimate changes are possible that do not change the meaning of the text, such as changing the font, changing the boundaries of lines, line breaks. Such modifications can lead to line breaks on another page, so a comparison of multi-page flexible documents should be carried out for the entire sequence of pages.

In the general case, without knowledge of the structure of the document, coordination of all the words of the test and reference documents is necessary. A definite disadvantage of full word coordination is the inevitable recognition errors, especially for photos (see an example of a fragment of a text image with distortions in the figure below), interpreted as modifications. The person responsible for the verification will be forced to spend extra time checking for false modifications.



With full coordination of the words of the copy and the original, in addition to false recognition errors, there may be other insignificant differences. The fact is that from the point of view of the functional user of the program for comparing the copy and the original, not all words have the same value. As a matter of fact, a subset of the words of a page of a document, which determines the essential terms of the contract, is valuable. It is assumed that the task of the fraudster is to make such modifications that, in court or in pre-trial proceedings, may cause damage to the organization that signed the contract with the fraudster. Give a formal definition of such significantwords are hardly possible, they are determined by experts. Moreover, some words become significant in combination with neighboring words. For example, the particle “not” in combination with the adjacent word “guarantees” are significant. Modification of the word “contract” to the word “non-contract” is insignificant, since in a court proceeding it cannot bring benefits to a fraudster.

Thus, another formulation of the problem is possible, using knowledge of both the structure of the document and the placement of words essential for verification. In this statement, the document model consists of paragraphs and text strings. Each text line and each paragraph is represented by a set of text singular points whose sequence is unique to a given paragraph or line. Lines and paragraphs may also contain words that are not unique, that is, repeated or even located nearby. In special cases, it is possible to know the distance between unique words, determined by the number of intermediate characters or the geometric distance between the images of words.

Using a simple N-gram word model has proven effective. The N-gram model is used in various tasks, such as compression or coding of texts. In the processing of texts written in natural language, N-grams are useful for finding and correcting errors (we already wrote about this before ).

To search for keywords, N-grams of words are used in the following forms:

n2(wi)=wi,r1(wi)
n3(wi)=wi,r1(wi),r2(wi)
n2(wi)=l1(wi),wi
n3(wi)=l1(wi),wi,r1(wi)
n4(wi)=l1(wi),wi,r1(wi),r2(wi)
n3(wi)=l2(wi),l1(wi),wi
n4(wi)=l2(wi),l1(wi),wi,r1(wi)
n5(wi)=l2(wi),l1(wi),wi,r1(wi),r2(wi),

Where rk(wi), lq(wi)a word to the right or left of the central word wiallowable distances are also known ρBT(wi,r1(wi)), ρBT(r1(wi),r2(wi)), ρBT(l1(wi),wi), ρBT(l2(wi),l1(wi))between adjacent words. Indexk in the designation of N-grams nk(wi)call the length of the N-gram.

A paragraph model consists of an ordered sequence of N-grams
n1(w1),n2(w2),,nm(wm)with predefined tuples of words ni(wi), with known distances between pairs {nj1(wj1),nj(wj)}. Note that some N-grams are unique to a paragraph, and some may be repeated. To ensure uniqueness, N-grams of various lengths can be used: bigrams, trigrams, tetragrams and pentagrams.

When building a paragraph model, N-grams are formed so as to maximize the number of unique N-grams. The use of N-grams in comparison with individual keywords ensures uniqueness for most paragraphs of contractual documents, primarily because of the above-mentioned significant limitation of the set of words in a static text.

It makes sense to conduct training and optimization of parameters on real datasets. Note that even on real datasets, we will not see possible modifications, first of all, due to the classification of such data by dataset owners. I have to make modifications with my own hands.

The trigram search algorithm comes down to selecting several consecutive words. Of course, first you need to form a set of text singular points. To do this, we have taken the following steps:

  • halftone processing (MinImage library);
  • normalization of the image by angle using methods based on the fast Hough transform [4] (Smart IDReader API);
  • highlighting word boundaries using the operations of "erosion" and "dilatation" (MinImage library);
  • recognition of characters within the boundaries of the words found (Smart IDReader API).

The paragraph was presented as one long line.

A comparison of ideal words and recognized words of a paragraph was carried out using the modified Levenshtein distance. The Levenshtein distance calculation algorithms are well known, they allow you to find not only the number of editorial prescriptions, but the prescriptions themselves.

The modified Levenshtein distance was used. First, a unique threshold was chosen to compare a particular word with other words. For the refusal to identify pairs of words of the type “SEA” = “MOUNTAIN” or for identifiers of the type “IDENTIFIER196”, “IDENTIFIER296”, “IDENTIFIER199”, another rule was applied. For such words segments were indicated that were to remain unchanged. That is, at the beginning of the words “IDENTIFIER ddd” a large number of errors were allowed, but identification was forbidden with the found editorial instructions in the last 3 characters of the word.

Another modification was to compensate for the replacement of the OCR of some characters with similar characters. Formally replace characters for the Latin alphabetB8, DO, 1Iare errors, however, reducing the price of such substitutions can improve the accuracy of word identification. The price of replacing a letter for characters with similar styles was chosen during training.
Based on several distances of the center and neighbors of the N-gram to the selected analogues, a heuristic estimate of the binding of the N-gram as a whole is formed.
The model parameters (thresholds, N-gram lengths) were chosen during training so as to minimize the number of N-gram binding errors and maximize the number of correctly bound N-grams.

After binding the N-grams to the words of the paragraph, the following checks can be carried out:

  • the presence of all expected N-grams;
  • the presence of all unique N-grams in one copy;
  • the sequence of N-grams;
  • distance between adjacent N-grams.

Failure to perform any of the checks means finding a modification to the important keyword.

The described method was tested on a dataset consisting of 161 images of a document of the “Agreement” type scanned with a resolution of 100 to 300 dpi. We investigated a model of 33 keywords. Some of the keywords in the dataset images were deliberately deleted or modified. 740 deletions and 140 word modifications were made. OCR Smart IDReader [5] was used for recognition.

The quality of the algorithm was evaluated by the criteria of accuracy (Precision) and completeness (Recall), in the determination of which the numbers were used:

  • found modified words tp;
  • correct words classified as modifications fp;
  • modified words not found fn;
  • correct words classified as correct tn.

The results are presented in the table. The table shows the characteristics calculated for several thresholds.d(wi) assessment of the correctness of the word in comparison with the reference word.

d (w i )tpfptnfnPrecisionRecall
121641473800.341.00
221690106200.701.00
3 and more21654109800.801.00

Note that when OCR Smart IDReader was recognized, all modified words were found. Metol errors are associated with recognition errors, primarily due to scanning defects (presence of overexposed areas).

It is easy to guess that the described method has a limitation associated with the accuracy of distinguishing word boundaries. The indicated scanning defects led to a small number of word boundary search errors (about 1-1.5% for some keywords). To eliminate this limitation, we offer an additional way to search for words. For some undetected N-gram, a subset of the words of the recognized paragraph was selected in which the presence of this N-gram was expected. Gaps were removed from the selected subset of words and a string of characters was formed. The words of the N-gram concatenated, forming a substring for the search. Next, we searched for substrings, for example, using a modified bitup algorithm using a modified Levenshtein distance. This allows to reduce by 2-3 times the number of errors of N-gram checks associated with errors in the search for word boundaries.

Brief conclusion


We talked about one tool for searching forgeries of contract documents. Of course, this tool does not completely solve the problem and manual checks of the found supposedly modified words are required. The method allows you to reliably automate the search for modifications and significantly reduce the number of routine manual checks. The complexity of developing the described method was the difficulty of obtaining real datasets with fakes.

Bibliography
  1. Sidere N. et al. A dataset for forgery detection and spotting in document images // 2017 Seventh International Conference on Emerging Security Technologies (EST). - IEEE, 2017 .-- P. 26-31.
  2. Bertrand R. et al. A conditional random field model for font forgery detection // 2015 13th International Conference on Document Analysis and Recognition (ICDAR). – IEEE, 2015. – P. 576-580.
  3. . . , // . – , 1965. – . 163. – №. 4. – . 845-848.
  4. Bezmaternykh P. V., Nikolaev D. P. A document skew detection method using fast Hough transform // Twelfth International Conference on Machine Vision (ICMV 2019). – International Society for Optics and Photonics, 2020. – Vol. 11433. – P. 114330J.
  5. Bulatov K. et al. Smart IDReader: Document recognition in video stream // 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR). – IEEE, 2017. – Vol. 6. – P. 39-44.


All Articles