☘️ ✌🏻 🙆🏼 Search knowledge graph: building from multiple sources 👩‍✈️ ⏯️ 🐲

I want to talk about what is a graph of knowledge and about one of the ways to build it from several thematic sources.

A large number of queries in the search contain a single entity - the object about which the user asks. It can be requests about some people, films, series, musical or geographical objects. When the user makes such a request, in issuing it, he can be shown an additional information card in the hope that the information in the card will be of interest to the user. Cards decorate the issue and increase its visibility. With the help of information cards, we let a person understand that he is using an intelligent service, because the search engine realized that he had in mind what kind of object he was asking about. Moreover, this intelligence can be expanded by responding to a user’s request directly on the issue page. For example, in response to “what to see in Prague”, we can immediately show the sights of this city.

There are times when the user wants to know something about an object, but knows absolutely nothing about it, except for the fact of its existence. Or the user wants to continue research on a topic. Especially for this case, the card shows what this object is known for, as well as possible options for continuing the research in the “See Also” block.

Knowledge graph

All these semantic technologies are based on a knowledge graph - a graph whose nodes are objects of the real world, and transitions are typed relations between objects (for example, the relationship for the place of birth is different from the relationship for the date of birth). An example of a small knowledge graph is shown in Fig. 1.

Fig. 1. Example of a knowledge graph

There are open knowledge graphs that can be downloaded and used. Examples and some statistics on the most famous open knowledge graphs are presented in table. 1. The

table. 1. Statistics on open knowledge graphs

Knowledge graph	Number of records	Number of objects	Update frequency
Freebase	Over 3 billion	49 million	Not updated
Wikidata	748 million	18 million	Once a week (incremental - once a day)
Dbpedia	411 million	4 million	Last update in august 2019

The problem with open graphs is that not all areas of knowledge are covered deeply enough. For example, Russian TV shows. They are not very popular all over the world, but they are especially dear to us because users ask about them. This means that we need to find some information about these series and bring it to the SERP (Search Engine Result Page). The second problem is rather rare updates. But when an event happens, we want to bring this information as soon as possible and show it on the search results pages.

There are at least three possible solutions to these problems.

First: do nothing, reconcile, close your eyes to it and continue to live. Second: carefully manually add information to open knowledge columns, and then use the data from there. And the third option: automatically merge the knowledge from some thematic source with the graph. For the same films and TV shows, there are quite a few such sources: KinoPoisk, IMDb, Kino Mail.ru. Moreover, in knowledge graphs, as a rule, objects have links to popular thematic resources.

We began to implement the third approach. Before you begin to solve this problem, you need to prepare. The fact is that the data in the sources are presented in different formats. For example, in Wikidata it is JSON, in Cinema Search - HTML. They need to be converted to the same format. We convert to N-Triples, because it is convenient to process it in parallel, and in this case it is really important, since we work with big data. The expanded JSON dump of Wikidata takes about 720 GB, and the Movie Search HTML page is 230 GB, so almost all tasks are solved on a cluster using the MapReduce paradigm.

Double gluing

One of the most difficult tasks when merging knowledge graphs is gluing duplicates. For example, we have graphs and two objects in them, which are one object of the real world. We want to glue them together in our resulting graph so that it is one object in the knowledge graph. The solution that we came up with from the very beginning is quite simple. Let's take and glue all the objects that have the same name. So we learned that there are at least two famous people with the name Brad Pitt, a little less than forty films and series called “The Bridge”, and about the same number of Ivanovs about which there is an article on Wikipedia.

Obviously, such a simple approach does not work, something more complicated is required. And then we came up with a three-part solution. The first part is a candidate generator for linking. The second is a classifier that tries to answer the question of whether a couple of objects should be glued together. And the third is an additional gluing step: we are gluing something that did not work for some reason to glue in the previous step.

A generator is needed in order to reduce the number of candidates for linking (trying to link each object with each is too expensive). Therefore, we confine ourselves to only a small number of candidates and then run the classifier only for this small set. The basic idea is that the names of objects that can be related should be similar in some way. We decided to start with a complete match of substrings, but with the ability to rearrange words. This means that if two objects do not have the same word order (for example, the person’s name and surname are mixed up), they will still be matched. We use both Russian names and original foreign ones. We also use nicknames, aliases, and synonyms for the objects we collected from Wikipedia redirects.

Before training the model, it would be nice to get data from somewhere. We were lucky, everything needed was already in the open graphs of knowledge. For example, in Wikidata there are a little less than 200 thousand objects, links to films and TV shows of KinoPoisk, and a little less than 100 thousand links to people of KinoPoisk. We will be trained on this data.

All the attributes are based on the idea that the same object in two columns should have a similar context. A similar idea is used to solve the Entity Linking problem. Moreover, the context in the graph does not mean neighboring objects to this object, but their string descriptions. More formally, the context of an object in a graph will be considered all adjacent lines to it, as well as the names of objects incident to it and types. In fig. 2 lines that belong to the context of this object are highlighted in bold.

Fig. 2. Object context There

are several ways to work with strings. Firstly, to consider the coincidence of strings, matching them as a whole, that is, taking into account only complete coincidence. Secondly, we can use the string as a bag of words, break the string into words and sacrifice order. We use both approaches.

Strings can either be used all at once together, or grouped by the types of relations by which they are associated with the object. Absolute and relative matches by group are also considered.

Let's take two objects: Lionel Messi (Lionel Messi) and Lionel Richie (Lionel Richie), tab. 2. The

table. 2. Some relationships for the Lionel Messi and Lionel Richie objects.

Relationship name	Lionel Messi	Lionel Richie
name	Lionel Messi	"Lionel Richie"
birthday	June 24, 1987	"June 20, 1949"
football_club	FC Barcelona	-

Names do not match entirely, so a complete match in the name will be zero. But the number of word matches in the name is 1, because the original name “Lionel” matches both objects. The Jacquard coefficient (the ratio of the size of the intersection to the size of the union) for the name will be 0.2:

J (M e s s i, R i c h i e) = \frac{| « L i o n e l » |}{| « Л и о н е л ь », « М е с с и », « L i o n e l », « M e s s i », « R i c h i e » |}

$J(Messi, Richie) = \frac{\left |«Lionel» \right |}{\left |«», «», «Lionel», «Messi», «Richie» \right |}$

We did not immediately understand that if an object does not have the right relationship, then it makes no sense to rush and write zero in this sign, it is better to write Missing value. After all, when we add a thematic graph to a large graph, it often does not have some kind of relationship, and you need to be able to distinguish between cases when the two relations did not match, and when one of the objects simply does not have the necessary relationship . Following this logic, the number of matching words for the football_club relationship will be Missing value. The number of complete line matches in all relations is zero, and the number of matching words is two (“Lionel” and “June”).

The rating of the best features used by the model is shown in Fig. 3.

Fig. 3. The most important signs used.

The best attribute of our model is the total relative coincidence of words. This complex name hides the sum of the Jacquard coefficients for all relations. It is interesting that if we completely discard all machine learning from this task and leave a more or less adequate cut-off by the value of this attribute, then the quality of the model by

F_{1}

$F_{1}$ -measure will fall by only 20%. If we take the signs described above, teach them XGBoost out of the box, 250 trees with a height of 4, then we get pretty good metrics.

Table. 3. Metrics of quality model on test data.

Metric Name	Value
Precision	0.961531
Completeness (recall)	0.963220
$F_{1}$	0.962375
Auc	0.999589

But there is a problem: on these metrics some objects that we do not glue are not visible. These are the so-called "large objects" which have many relationships. Cities, countries, occupation - these are the objects that are associated with a large number of other objects. In fig. Figure 4 shows how this might look on SERP. Everything seems pretty harmless:

Fig. 4. The problem of "large" objects

. We have duplicates in the card. It would seem that it’s okay, but in fact the problem is much deeper, data consistency suffers. This is because in our training data there are no such objects at all. Here it is necessary to apply a different approach, to use a large number of relations of these objects for their own benefit. In fig. 5 shows a diagram of this approach. Let there be two objects X and Y, which are connected by the classifier. These objects belong to different graphs and each object is associated with some other objects in its graph. Object X is associated with objects A, B, C through relationships

r_{1}

$r_{1}$ ,

r_{2}

$r_{2}$ ,

r_{3}

$r_{3}$ , and object Y with objects D, E, and F using relations

r_{1}

$r_{1}$ ,

r_{2}

$r_{2}$ and

r_{4}

$r_{4}$ . But now we know that the objects X and Y are actually one object of the real world.

Fig. 5. Additional gluing

Let's put forward two hypotheses:

$H_{1}: P(D | A) = p = P(D | ¬A)$ - an object $D$ not connected in any way and occurs independently of the object $A$ .
$H_{2}: P(D | A) = p_{1}\neq p_{2} = P(D | ¬A)$ - some kind of relationship between $D$ and $A$ everything is just like that.

Then we can apply the Likelihood-ratio test and accept one of two hypotheses. Thus, we can stick together large objects.

The final solution turned out to be quite universal, so in a similar way it is possible to glue duplicates for other thematic sources, constantly increasing the coverage and depth of topics in the knowledge graph.

Conclusion

The article considers one of the possible solutions to the problem of gluing identical objects of different knowledge graphs. When the knowledge graph is already built, we can show information cards on the page with search results, use the knowledge from the graph to answer factoids, or come up with some other interesting application to please users.

Hope it was interesting to read a little about how it all works from the inside out. If there are any incomprehensible moments, please ask questions, I will try to answer them.

Search knowledge graph: building from multiple sources

Knowledge graph

Double gluing

Conclusion

More articles: