How to help find an organization and not spend a week on it



When people enter the name of a car repair shop, clinic or store in Yandex search, they want to find information about them. For example, a work schedule or a phone number. It depends on the accuracy and relevance of these data whether a person will solve his problem quickly or lose time and nerves.

My name is Alexander, and I represent the team of Geopoisk and Yandex.Directory, the data of which is used by more than 46 million people a month. Today I will briefly talk about how we managed to reduce the time it took to update data in Yandex search from a few days to several hours, sometimes to minutes. You will also find out who Ricardo Milos is and what problems he caused us.



Directory is a database of organizations. Any company or person can add information there: indicate the address, hours of operation, phone and everything else - and Yandex will convey this to users. The Directory data is used in Search, Alice, Maps, Taxi, Navigator, and even in our caller ID, which we already talked about on Habré.

And everything would be fine, but the data is becoming outdated: organizations are closing, moving, changing numbers and all that. We ourselves can track changes and make edits, but today we’ll talk about those edits that users or companies send us. To do this, we have forms and other feedback mechanisms. So we get several thousand edits per day. But we cannot just take and publish them.

Errors are found in corrections - due to carelessness or malicious intent. The latter are especially numerous. Some distort the data of competitors and "close" the organization. Other, ordinary vandals add mat and other absurdities to company names and descriptions.



So, if you publish edits as is, users will suffer. Therefore, we check everything. Call center operators call the organization and clarify the changes. Dockers reach companies and verify data live. But such methods are not fast enough, and the stream of edits is large. Therefore, we also came up with a robot.

We use the automatic classifier of edits - Auto moderator. This is the machine based on our CatBoost technology .. She is trained on examples of good and bad edits. Fortunately, we have plenty of such data.

When an edit arrives, the Auto Moderator takes into account dozens of factors (for example, the history of previous user edits) and decides whether to approve the edit, reject it, or send it to a person for re-examination. The car moderator can check the Directory base and make sure that they are not trying to create a duplicate, or look at the organization’s website in search of new information, or even call the organization, introduce himself as Snezhana and clarify the changes.

One example. In 2018, a wave of “renaming” schools, monuments and other organizations began in mapping services and reference books: on the maps they were named after Ricardo Milos (there is an article on TJabout this flash mob). So against our will, we met with a popular at that time meme about a Brazilian stripper (not that we wanted it, but who asked us). And it was the combination of the Auto Moderator and other verification mechanisms that helped us defend the true names.

So, the automatic classifier has reduced the time it takes to update the data. But we did not stop there. Even taking into account the help of the Auto Moderator, edits could reach service users for several days. This is a long time. To reduce this time, it was necessary to solve two technological problems.

Previously, the Auto Moderator looked like a batch process, it started on a schedule and required large resources for local computing (working with tables for tens of millions of records). We have changed that.

Now this is a service in which editing and information about its sender are received in real time. Then the car moderator calculates the factors and renders a verdict. Before verdicts on applications, we could wait hours. Now, minutes.

But this does not mean that the changes will reach the user in minutes. And here the second task awaited us.

The change falls into the Directory base, but it takes time to “sprout” into the service. For example, Search must update the search index to reflect changes from the Directory. To get around this, we developed an outline for storing the states of objects. Simply put, now you can replace the phone number in the object answer of the Search without rebuilding the search index. Now, when building search results, Search knows which objects are outdated, and can pull up more recent information. Of course, there are still situations where a change in data affects the ranking of the organization, but there is no way without rebuilding the index.



So, after improvements and implementations, we were able to reduce the average time for updating data about organizations in Yandex services from a few days to hours, and sometimes to minutes. I want to believe that you noticed this.

Today I put a rather long history of work in a short review post. Tell us about which sides or decisions you would like to read in more detail in the future. We will be glad to receive feedback and appeals, we will continue to work on the Directory and tell Habr readers about its news.

All Articles