☃️ 🤙🏽 👴🏼 TextRadar fuzzy search algorithm. Artifacts (Part 2) 🦁 ⛷️ 🍜

In a previous publication, the TextRadar Fuzzy Search Algorithm. The main approaches (Part 1) are considered the main approaches. When solving practical problems, situations were identified where the use of only the basic methodology does not provide the required quality of the search. As a result, add-ons (algorithm options) were created, which will be discussed.

Fragments of one word of a search string are located in several words of a data string

Let the search string be ABCD and the data string be ABCE DFG.

Application of the basic algorithm will give the following result:

When such collisions occur, extra groups are deleted. The choice of groups to delete is a separate, complex question. As a result, the most significant groups should remain. In the case of the above example, the answer is obvious - the smallest of the groups should be deleted.

The start fragment of the search string word is located inside the data string word

Consider the example of finding the string ABC in the string XYZAB.

The basic algorithm will produce the following:

As a rule, such a result has no practical value and should be discarded.

Insufficient coverage of a word in a data row with fragments found

If we search in the line ABCDEF for the word ABXYZ

we get:

This result is also of no value and should be discarded.

Overgroups

Given the search string ABCXEF and the data string ABCEF ABCDEF.

From the point of view of the basic algorithm, both words of the data string are equivalent, but the first of them will have priority (if, ceteris paribus, the first word that is encountered is selected) and then the search result will be as follows:

If we introduce the concept of a supergroup as a union of word groups located on the same diagonal (or shift, see the previous publication, which discusses the basics of the algorithm) and increase the priority of the groups included in the supergroup through the parameter the size of the supergroup calculated for each group and assume that if the group is not part of the supergroup, then the size of the supergroup for it will be equal to the size of the group itself - for our example, we get the following result:

Overstated Long Words

When searching for a phrase containing a long and short word, found fragments of a long word can “obscure” fragments of a short word. This is due to the use of a quadratic function in calculating the relevance coefficient.

Moreover, from the point of view of search quality, a longer word is not always more meaningful.

Consider an example. Suppose you need to find the string ABCDEFG XYZ in a text consisting of two pages:

1. ABCDEFG ... QWE

2. ABCDEFO ... XYZ

The numerator of the group composition coefficient (the denominator does not have a qualitative effect on the result, see the formula above) for the first page is 49, for the second - 36 + 9 = 45. That is, if we omit the influence on the length factor result, the search result on the first page will be of greater relevance, which does not meet expectations - in some cases, the search result on page 2 will be more valuable.

One of the solutions can be to introduce restrictions on the maximum length of groups . In our example, if the maximum group length is limited to 6, for example, we get 36 for the first page and 45 for the second, which will provide the result we expect - the relevance of the second page will be higher than the first.

Another way to solve the problem of the inconsistency of the significance of the words of the search phrase in the overall result of their length is to calculate the relevance of the search phrase as the average of the relevance of the words included in it , calculated independently. But here the opposite problem arises - the need to reduce the significance of short words, which can be solved in a similar way - by setting the threshold value to the length of the words involved in the formation of the general relevance, but already to the minimum value.

Repeated repetitions of the desired fragments

As follows from the statement, the task is to search for the most relevant search string for a set of fragments, therefore the fact of repeated repetition of the desired fragments in the text does not affect the result - the first suitable fragment will be used as the search result, the rest will be discarded during processing. Consider an example of finding the string ABC in the string ABCD ABCE ABCF ABCG.

Application of the basic algorithm will give the following result:

From the point of view of the search for the most relevant page, this is correct, but when forming a detailed presentation of the search results, it is necessary to find and select all the suitable fragments. To do this, you can use multiple repetition of the search procedure on the displayed page with a sequential shutdown of the fragments found in previous iterations. In our example, this will allow us to find all suitable occurrences of the desired string.

Data processing speed

On-the-fly data processing has certain speed requirements - the search should not take too long.

Limiting the minimum size of processed groups, disabling search options

To increase the search speed, you can limit the minimum length of processed groups.

In practice, the following approach is applied - first the first “rough” pass is made with a restriction on the minimum group size and disabling some options of the entire search data array and identifying the first data portion (the size of the optimal data portion is determined from the context of the problem being solved), and then the second, more thorough bypassing only the pages of this portion, already without limitation on the size of groups and with the inclusion of all the necessary options.

Parallel data processing

Another way to increase the search speed is parallel data processing. As a result, with a large search database, an acceptable total processing time can be achieved by increasing the number of parallel processes, which naturally will require increasing the capacity of the equipment.

results

The application of the described approaches allowed us to significantly improve the quality of the search, to reduce the dependence of its results on various types of typos - superfluous, missing characters, permutations, etc., and, more importantly, no matter where these typos are located - in the search bar itself or in the text, which is being searched.

The results of applying the described approaches can be seen on the demo stand deployed on the website textradar.ru . On the example of a search in the text of the novel L.N. Tolstoy's “War and Peace” can compare search results using the basic and advanced versions of the algorithm.

Download or watch the demo project with advanced functionality (C # ASP.NET MVC) here .

TextRadar fuzzy search algorithm. Artifacts (Part 2)