🐆 😭 👩🏾‍💻 How we learned to recommend films and why you shouldn’t rely only on ratings 👦 😾 💆🏼

Imagine that you want to spend the evening watching a movie, but don’t know which one to choose. Yandex users often find themselves in the same situation, so our team develops recommendations that can be found on Search and Air. It would seem that this is complicated: we take user ratings, with their help we train the car to find films that are likely to give 5 points, we get a ready-made list of films. But this approach does not work. Why? That's what I’ll tell you about today.

A bit of history. Back in 2006, Netflix launched the Netflix Prize machine learning contest. If you suddenly forgot, then the company has not yet been streaming on the Internet, but rented films on DVD. But even then, it was important for her to predict the user's rating in order to recommend something to him. So, the essence of the competition: to predict viewers ’ratings is 10% better (according to the RMSE metric) than Cinematch, the Netflix recommendation system. This was one of the first mass competitions of this kind. Interest heated up a huge dataset (more than 100 million ratings), as well as a prize of $ 1 million.

The competition ended in 2009. The teams of BellKor's Pragmatic Chaos and The Ensemble reached the finish line with the same result (RMSE = 0.8567), but The Ensemble took second place because they sent the solution 20 minutes later than the competitors. Results and work can be found here . But the most interesting thing is different. If you believe the repeated stories at specialized conferences, the winning algorithms did not appear in production in the video streaming service that was launched soon. I can’t talk about the reasons for other people's decisions, but I’ll tell you why we did the same.

Personal rating

For quite some time now, Yandex users can rate movies they’ve watched. And not only at KinoPoisk, but also in the search results. Over time, we have accumulated hundreds of millions of estimates of tens of millions of people. At some point, we decided to use this data to help users understand how much they would like a movie. In fact, we solved the same problem as at the Netflix Prize contest, that is, we predicted what rating the user would give the film. At that time, KinoPoisk and IMDB ratings already existed, which were built on the basis of people's ratings. We built a personal rating, so we decided to use a separate scale as a percentage in order to avoid visual similarities and confusion.

By the way, the need to correlate the point scale and the percentage is a separate, unobvious headache, so I will briefly talk about it. It would seem that for a 10-point scale, take 10% for each point - and the point is in the hat. But no. The reason is psychology and habits. For example, from the point of view of users, a rating of 8/10 is much better than 80%. How to correlate this? With crowdsourcing! And so we did: we launched the task in Tolok, in which the shooters described the expectations of films with a certain score or percentage of personal rating. Based on this markup, we selected a function that translates the prediction of the score from point to percentage so that user expectations are maintained.

Sample quests in Tolok

Predicting expectations from the movie is useful, but it would be nice to build recommendations as well. That is, immediately show the person a list of good films. At that moment, many of you might think: “And let's just sort the films that the user has not watched by personal rating.” We thought so too at first. But then I had to solve two problems.

Tool problem

When a user searches for a particular movie (or wants to rent a specific DVD), the service must predict the rating of this particular movie. Exactly this problem was solved at the Netflix Prize contest, where the RMSE metric was used. But the recommendations solve a different problem: you need not to guess the grade, but to find a movie that will eventually be watched. And the RMSE metric does a poor job of this. For example, the penalty for predicting a rating of 2 instead of 1 is exactly the same as for 5 instead of 4. But our system should never recommend movies to which the user puts 2! List-based metrics are much more suitable for this task, such as Precision @ K, Recall @ K, MRR, or NDCG. I can’t help but tell you a little more about them (but if you are not interested in the metrics, just skip the next paragraph).

Let's start with the metric.MRR (mean reciprocal rank). We will look at what position in the ranking will be the film with which the user interacted (for example, watched or praised) in the test period. The MRR metric is the user-averaged reverse position of such a movie. I.e

M R R = \frac{1}{| U |} \sum_{u = 1}^{| U |} \frac{1}{r a n k_{u}}

$MRR = \frac{1}{|U|} \sum_{u=1}^{|U|} \frac{1}{rank_u}$ . Such a metric, unlike RMSE, evaluates the entire list, but, unfortunately, looks only at the first guessed element. However, it is easy to modify the metric to get rid of this drawback. We can calculate the sum of the reverse positions of all the films with which the user interacted. This metric is called Average Reciprocal Hit Rank. This metric takes into account all the guessed films in the issue. Note that the position k in the payout receives a weight of 1 / k for a guessed movie and a weight of 0 for another movie. Often, 1 / log (k) is used instead of 1 / k: this is more likely to correspond with the probability that the user will scroll to the k position. The result is a DCG metric (discounted cumulative gain)

D C G = \frac{1}{| U |} \sum_{u = 1}^{| U |} \sum_{p o s = 1}^{N} \frac{g a i n_{p o s}}{max (1, \log (p o s))}

$DCG = \frac{1}{|U|} \sum_{u=1}^{|U|} \sum_{pos=1}^{N} \frac{gain_{pos}}{\max(1, \log(pos))}$ . But the contribution of different users to the metric is different: for someone we guessed all the films, for someone we did not guess anything. Therefore, as a rule, this metric is normalized. Divide each user's DCG into the best ranking DCG for that user. The resulting metric is called NDCG (normalized discounted cumulative gain). It is widely used to assess the quality of ranking.

So, each task has its own metric. But the next problem is not so obvious.

Problem of choice

It's hard enough to describe, but I'll try. It turns out that people give high marks not to the films that they usually watch. Rare movie masterpieces, classics, arthouse get the highest marks, but this does not stop people in the evening after work with pleasure to watch a good comedy, a new action movie or a spectacular space opera. Add to this that users on Yandex did not appreciate all the films that they had once and somewhere watched. And if we focus only on the highest ratings, then we run the risk of getting a recommended movie feed that will look logical, users may even recognize its quality, but in the end they won't watch anything.

For example, this is how my movie feed might look like if we ranged it by personal rating and did not know anything about my views in the past. Great movies. But I don’t want to review them today.

It turns out that in the conditions of scattered ratings and a shortage of films with high ratings, it’s worth looking not only at the rating. Well, then we train the car to predict the viewing of the recommended film, and not the rating. It would seem logical, because the user wants this. What could go wrong? The problem is that the tape will be filled with films, each of which is quite suitable for an easy pastime in the evening, but their personal rating will be low. Users, of course, will pay attention to the fact that there are no “masterpieces” in the tape, which means that the credibility of the recommendations will be undermined, they will not watch what they would otherwise see.

As a result, we came to the understanding that a balance between the two extremes is needed. It is necessary to train the machine so that both the potential for viewing and the perception of the recommendation by a person are taken into account.

How our recommendations work

Our system is part of the Search, so we need to build recommendations very quickly: the response time of the service should be less than 100 milliseconds. Therefore, we try to perform as many heavy operations as possible offline, at the stage of data preparation. All films and users in the recommendation system are represented by profiles (it is important not to be confused with the account), which include object keys, counters and embeddings (in other words, vectors in a certain space). Profiles of films are prepared every day on YT ( read as “Yt” ) and loaded into the RAM of machines that respond to requests. But with users, everything is a little more complicated.

Every day we also build the main user profile on YT and send it to the Yandex store, from which you can get the profile in a couple of tens of milliseconds. But the data quickly becomes outdated if a person actively watches and evaluates the video. It is not good if the recommendations start to lag. Therefore, we read the stream of user events and form the dynamic part of the profile. When a person enters a request, we combine the profile from the repository with a dynamic profile and get a single profile, which can lag behind reality for only a few seconds.

This happens offline (that is, in advance), and now we go directly to runtime. Here the recommendation system consists of two steps. Ranking the entire database of films is too long, so at the first step we simply select several hundred candidates, that is, we find films that may be of interest to the viewer. This includes both popular paintings and those close to the user in some embeddings. There are several algorithms for quickly finding the nearest neighbors, we use HNSW (Hierarchical Navigable Small World). With it, we find the movies closest to the user in just a few milliseconds.

In the second step, we extract features (sometimes they are also called factors) by movies, user and user / movie pair and rank candidates using CatBoost. Let me remind you: we already realized that we need to focus not only on views, but also on other quality characteristics of recommendations, so for ranking we came to a combination of several CatBoost models trained for various targets.

To find candidates, we use embeddings from several matrix decompositions: from the classic version of ALS, which predicts assessment, to more complex variations based on SVD ++. As features for ranking, both simple user event counters and films, CTRs for various events, as well as more complex pre-trained models are used. For example, ALS prediction also acts as a feature. One of the most useful models is the Recommender DSSM neural network, which I will probably talk about a little more.

Recommender DSSM

DSSM is a two-tower neural network. Each tower builds its own embedding, then the cosine distance is considered between the embeddings, this number is the network output. That is, the network learns to evaluate the proximity of objects in the left and right towers. Similar neural networks are used, for example, in a web search to find documents relevant to a query. For the search task, a request is sent to one of the towers, and a document to the other. For our network, the role of the request is played by the user, and films act as documents.

The tower of the film builds embedding based on the data about the film: this is the title, description, genre, country, actors, etc. This part of the network is quite similar to the search one. However, for the viewer we want to use his story. To do this, we aggregate the embedding of films from the story with the fade in time since the moment of the event. Then, on top of the total embedding, we apply several layers of the network and as a result we get an embedding of size 400.

If you immediately take into account the entire history of the user in embedding, this will greatly slow down the training. Therefore, we go to the trick and learn the network in two stages. First, learn the simpler InnerDSSM. At the entrance, he receives only the last 50 events from the user's history without dividing into event types (views, ratings ...). Then we retrain the resulting InnerDSSM throughout the user's history, but with a breakdown into event types. So we get OuterDSSM, which is used in runtime.

Using a runtime network in the forehead requires quite a lot of computing resources. Therefore, we save embeddings from the movie tower in the database, and embeddings according to user history are updated near real-time. Thus, during the processing of the request, you need to apply only a small part of OuterDSSM and calculate the cosines, this does not take much time.

Conclusion

Now our recommendations are already available in our search (for example, by request [what to see] ), in the Yandex.Air service, and an adapted version of this technology is also used in Yandex.Stations. But this does not mean that we can relax. We constantly train new models, apply more and more data, try new approaches to training and new quality metrics. In my opinion, the older the area, the more difficult it is to develop. But this is the main interest for specialists.

How we learned to recommend films and why you shouldn’t rely only on ratings