☂️ 🈂️ 🧜 How we teach Yandex to answer questions and save users 20 thousand hours a day 👩‍👧‍👦 🛴 ✝️

When we enter a query in the search bar, we are looking for information, not links. Moreover, often we need a short sentence or a well-known fact. For example, [the formula for the volume of the truncated pyramid ] is the same on all sites - links are not needed, just give an answer.

No one can surprise anyone with factual (informational) answers, but few people know how they are formed, how they differ, and what has happened in this area recently. My name is Anton Ivanov. Today, together with my colleague Mikhail Ageevdminerwe’ll tell the story of the answers in the search and share some of the details that we haven’t talked about before. Hope it will be helpful.

The history of the Internet is the history of simplifying the search for information. Once upon a time, people visited online catalogs to find answers where links to sites were grouped by topic. Over time, search engines appeared, they learned how to search for sites by keywords. The demand for a quick search for information stimulated the development of technology: a word search gradually evolved into a search by meaning, when the answer could be found on a page with zero intersection by keywords. But even in this case, I had to click on the links. People have always dreamed of more.

First facts

Now it’s hard to remember how Yandex's factual answers began. We can say that the solution was a special format of the sorcerer, which assumes a short text response without interactivity (as opposed to responding to requests [ my ip address ] or [ aqua color ]). As you know, implementing such a format is not difficult. The main question is different: where to get the answers?

We started with the easiest technical way. Special people (assessors) analyzed the most popular queries, chose those for which you can find a short answer. A classic example of such a query is [ how many paws a fly has ].

In this way, it was possible to cover only the most popular queries, and the long tail of other queries was ignored. Partly, we solved this problem with the help of crowdsourcing.

A few years ago, tolokers began to help us replenish the database of factual answers. Frequent requests were uploaded to the platform, tolokers saw the task: “Is it true that you can give an exhaustive answer to this request? And if true, then give it. ” Of course, other tolokers checked the adequacy of the answers, and we caught the mistakes with the help of a search guard . By the way, tolokers also helped us find out that actual answers with a picture are usually liked by users more than just text.

The help of tolokers is significant, but even they will not help to cover the long tail of low-frequency queries. There are simply too many such requests for any manual markup: there are not tens of thousands, but millions! To solve this problem, the search ranking experience was useful to us.

Fact snippet

When you look for something in Yandex search, you see not only 10 links, but also a title, description, icon and other data.

We focus on the description. Our search creates it automatically. To highlight the best fragment of text, the lightweight CatBoost model is used, which estimates the proximity of a fragment of text and a request. It turns out that link descriptions sometimes already contain factual answers. It would be strange not to take advantage of this - but not so simple.

It may seem that the task is to choose the “most factual” description among all the descriptions of the pages found by request, but this approach will not work well. The reason is that the informative description of the page does not always coincide with a good answer to the direct question of a person. Therefore, our Fact Snippet technology builds facts in parallel with page descriptions, but based on other parameters so that the result is similar to the answer. And now among them you need to choose the most high-quality answer.

We already toldon Habré about search algorithms "Palekh", "Korolev" and about the DSSM approach. The task then came down to finding texts that were close in meaning when ranking pages. In fact, we compared two vectors: the query vector and the document text vector. The closer these vectors are in multidimensional space, the closer are the meanings of the texts. To choose the best quality facts, we did the same. Our neural network model, trained on the answers we already know, builds response vectors for the pages found in the search and compares them with the query vector. So we get the best answer.

It is clear that answering all-all requests in this way is not worth it: most requests do not require a factual answer. Therefore, we use another model for dropping out “non-factual” requests.

Fact Snippet 2.0

All that we talked about above concerned “classical” factual answers: short, comprehensive, as in the encyclopedia. This direction has long been the only one. But the farther, the more we saw that the division on the basis of the existence of an exhaustive answer, on the one hand, is very shaky, and on the other - opaque to the user: he just needs to solve his problem faster. It took me to go beyond the usual facts. So the project appeared Fact Snippet 2.0.

To simplify things, Fact Snippet 2.0 is the same Fact Snippet, but without the requirement to find a “comprehensive answer”. In fact, everything is somewhat more complicated.

Let me remind you that Fact Snippet works in two stages. At the first stage, using an easy model, we evaluate the “factual nature” of the request: does it mean a factual answer or not. If yes, at the second stage we are looking for an answer, it appears in the search results. For Fact Snippet 2.0, we adapted both steps to find answers to a wider range of questions. Such answers do not claim to be encyclopedic in their entirety, but are still useful.

It is possible, but not always necessary, to select a paragraph of text for any request. Sometimes the found texts are not relevant enough to the query. Sometimes we already have good answers from other sources - and we need to decide which one to choose. For example, why offer the address of the organization in text if you can show an interactive map, phone number and reviews. We solve this problem with the help of a blender classifier, with which Andrei Styskin already acquainted readers of Habr . And the answer should not be rude, insulting. Almost every such reasonable restriction has its own classifier, and making it work in runtime in a split second is another quest.

Query reformulations

They covered another part of the long tail, but many “unique” requests remained behind. A significant proportion of them are other formulations of queries already known to us. For example, [ when a pike changes teeth ] and [at what time the pike changes teeth ] are almost the same thing.

To solve this problem, we came up with a mechanism that on the fly understands that the incoming request is an alias (means the same) of another request, the answer to which we already have. This is easier and faster than independently generating two factual answers.

We take all requests for which there are answers, convert them into vectors and put them in the index k-NN (more precisely, in its optimized version of HNSWwhich allows you to search much faster). Next, we construct query vectors for which there is no answer by direct coincidence, and look for the top N most similar queries in our k-NN.

Next, we go through this top and run through the katbust classifier of the triple:

- user request;
- request from k-NN;
- response to a request from k-NN.

If the verifier verifier is positive, the request is considered an alias of the request from k-NN, we can return the already known answer.

The main creative part of this design is in writing factors for the classifier. Here we tried a lot of different ideas. Among the strongest factors:

- query vectors;
- Levenshtein distances;
- word by word embeddings;
- factors based on a variety of sorcerers for each of the requests;
- distance between query words.

Separately, I will talk about a trick using the BERT neural network. We have rather strong restrictions on the time for searching an alias: a maximum of a few milliseconds. It is impossible to perform BERT in such a time with a load of several thousand RPS on current resources. Therefore, with our BERT-model, we collected a lot of (hundreds of millions) of artificial estimates and trained on them a simpler neural network DSSM, which works very fast in runtime. As a result, with some loss of accuracy, a strong factor was obtained.

In fact, one can determine the semantic proximity of requests in other ways. For example, if two queries differ from each other in one word - check how the search results for these queries differ (look at the number of matching links in the top). If you repeat this many millions of times and average the results, you get a pretty good estimate of how much the meaning of the query changes if you change one word for another in it. After that, you can add all the data into one structure (for example, trie) and calculate the measure of the proximity of queries through the generalized Levenshtein distance. You can expand this approach and consider not only words, but also pairs of words (but trie is obtained much more due to the exponential growth of data).

What's next

According to our estimates, thanks to the factual / informational answers, we save users 20 thousand hours every day, because they do not have to look through the links in the search results (and this is not counting the time that they would have spent on finding the answer on the sites). This is good, but there is always room to grow. For example, now we use the text that we find on the Internet for answers, but the finished piece of text can not always be found in one place or in the correct form. With the help of neural networks this problem can be solved: generate a response so that it matches the request and does not contain unnecessary. This is our project of search neurosummarization, which, I hope, we will talk about next time.

How we teach Yandex to answer questions and save users 20 thousand hours a day

First facts

Fact snippet

Fact Snippet 2.0

Query reformulations

What's next

More articles: