TextRadar fuzzy search algorithm. Artifacts (Part 2)

In a previous publication, the TextRadar Fuzzy Search Algorithm. The main approaches (Part 1) are considered the main approaches. When solving practical problems, situations were identified where the use of only the basic methodology does not provide the required quality of the search. As a result, add-ons (algorithm options) were created, which will be discussed.

Fragments of one word of a search string are located in several words of a data string


Let the search string be ABCD and the data string be ABCE DFG.

image

Application of the basic algorithm will give the following result:

image

When such collisions occur, extra groups are deleted. The choice of groups to delete is a separate, complex question. As a result, the most significant groups should remain. In the case of the above example, the answer is obvious - the smallest of the groups should be deleted.

image

The start fragment of the search string word is located inside the data string word


Consider the example of finding the string ABC in the string XYZAB.

image

The basic algorithm will produce the following:

image

As a rule, such a result has no practical value and should be discarded.

Insufficient coverage of a word in a data row with fragments found


If we search in the line ABCDEF for the word ABXYZ

image

we get:

image

This result is also of no value and should be discarded.

Overgroups


Given the search string ABCXEF and the data string ABCEF ABCDEF.

image

From the point of view of the basic algorithm, both words of the data string are equivalent, but the first of them will have priority (if, ceteris paribus, the first word that is encountered is selected) and then the search result will be as follows:

image

If we introduce the concept of a supergroup as a union of word groups located on the same diagonal (or shift, see the previous publication, which discusses the basics of the algorithm) and increase the priority of the groups included in the supergroup through the parameter the size of the supergroup calculated for each group and assume that if the group is not part of the supergroup, then the size of the supergroup for it will be equal to the size of the group itself - for our example, we get the following result:

image

Overstated Long Words


When searching for a phrase containing a long and short word, found fragments of a long word can “obscure” fragments of a short word. This is due to the use of a quadratic function in calculating the relevance coefficient.

image

Moreover, from the point of view of search quality, a longer word is not always more meaningful.

Consider an example. Suppose you need to find the string ABCDEFG XYZ in a text consisting of two pages:

1. ABCDEFG ... QWE

image

image

2. ABCDEFO ... XYZ

image

image

The numerator of the group composition coefficient (the denominator does not have a qualitative effect on the result, see the formula above) for the first page is 49, for the second - 36 + 9 = 45. That is, if we omit the influence on the length factor result, the search result on the first page will be of greater relevance, which does not meet expectations - in some cases, the search result on page 2 will be more valuable.

One of the solutions can be to introduce restrictions on the maximum length of groups . In our example, if the maximum group length is limited to 6, for example, we get 36 for the first page and 45 for the second, which will provide the result we expect - the relevance of the second page will be higher than the first.

Another way to solve the problem of the inconsistency of the significance of the words of the search phrase in the overall result of their length is to calculate the relevance of the search phrase as the average of the relevance of the words included in it , calculated independently. But here the opposite problem arises - the need to reduce the significance of short words, which can be solved in a similar way - by setting the threshold value to the length of the words involved in the formation of the general relevance, but already to the minimum value.

Repeated repetitions of the desired fragments


As follows from the statement, the task is to search for the most relevant search string for a set of fragments, therefore the fact of repeated repetition of the desired fragments in the text does not affect the result - the first suitable fragment will be used as the search result, the rest will be discarded during processing. Consider an example of finding the string ABC in the string ABCD ABCE ABCF ABCG.

image

Application of the basic algorithm will give the following result:

image

From the point of view of the search for the most relevant page, this is correct, but when forming a detailed presentation of the search results, it is necessary to find and select all the suitable fragments. To do this, you can use multiple repetition of the search procedure on the displayed page with a sequential shutdown of the fragments found in previous iterations. In our example, this will allow us to find all suitable occurrences of the desired string.

image

Data processing speed


On-the-fly data processing has certain speed requirements - the search should not take too long.

Limiting the minimum size of processed groups, disabling search options

To increase the search speed, you can limit the minimum length of processed groups.

In practice, the following approach is applied - first the first “rough” pass is made with a restriction on the minimum group size and disabling some options of the entire search data array and identifying the first data portion (the size of the optimal data portion is determined from the context of the problem being solved), and then the second, more thorough bypassing only the pages of this portion, already without limitation on the size of groups and with the inclusion of all the necessary options.

Parallel data processing

Another way to increase the search speed is parallel data processing. As a result, with a large search database, an acceptable total processing time can be achieved by increasing the number of parallel processes, which naturally will require increasing the capacity of the equipment.

results


The application of the described approaches allowed us to significantly improve the quality of the search, to reduce the dependence of its results on various types of typos - superfluous, missing characters, permutations, etc., and, more importantly, no matter where these typos are located - in the search bar itself or in the text, which is being searched.

The results of applying the described approaches can be seen on the demo stand deployed on the website textradar.ru . On the example of a search in the text of the novel L.N. Tolstoy's “War and Peace” can compare search results using the basic and advanced versions of the algorithm.

image

Download or watch the demo project with advanced functionality (C # ASP.NET MVC) here .

All Articles