Smart replay stickers



Hello, Habr! Today we restarted ICQ . Key features of the new messenger are based on artificial intelligence technologies: Smart Reply quick sticker and text prompt system for answering an incoming message, suggesting stickers for entered phrases, voice message recognition and others.

In this article I will talk about one of them - Smart Reply. This feature will save users time, since they will only need to click on the sticker they like from the ones offered. At the same time, the feature will popularize the use of various stickers and increase the emotionality of communication.

Data preparation


The task from the field of NLP was solved with the help of machine learning and neural networks. The training was conducted on specially prepared data from public chats. The pairs were used: a fragment of the dialogue and a sticker that was sent by one of the users in response to the last message of the interlocutor. To better take into account the context, a fragment of the dialogue consists of the last interlocutor's messages that are glued together, and from the user's messages before them. You can experiment with the number of messages, for example, the Google ML Kit uses a context of 10 messages [1] .

One of the problems that I had to deal with is that the frequency of using stickers, starting from the most popular ones, drops sharply. Moreover, the total number of stickers is more than a million. The figure below shows the descending frequency of use of the 4000 most popular stickers. Therefore, 4000 popular stickers with the removal of some of the most frequent ones were chosen for training in order to reduce the uneven distribution of the training set.


Frequencies of use of 4000 most popular stickers in descending order.

For texts, normalization was carried out in terms of removing numbers, repeating letters, single characters, punctuation (except for question marks and exclamations, which are important for meaning), reducing the letters to lower case. In general, the specifics are such that in chat rooms, users are poorly following the grammar.

Model selection


For prediction, the Siamese DSSM model is used, by analogy with the model for Smart Reply in Mail.ru (a similar architecture is described in [2]), but with certain differences. The advantage of the Siamese model in efficiency for inference, since the part corresponding to the stickers is calculated in advance. To answer letters, the original model uses a bag of n-gram words to represent the text, which is justified for writing: the text can be large, and we need to grasp the general specifics and give some standard short answer. In the case of chats and stickers, the text is short, and individual words are more important here. Therefore, it was decided to use individual word embeddings as features and add an LSTM layer for them. A similar idea of ​​using the LSTM layer in the case of short texts was used, for example, for text responses in the Google Allo messenger [3] and in the smiley prediction model corresponding to DeepMoji short messages [4]. The figure below schematically shows the model.


Architecture model.

In the figure, the initial layers are the embeddings of incoming tokenized words (left) and stickers (right). For the tokenization of words, a dictionary was used, which included 100K of the most popular words. To separate messages from different users, a special dedicated token was used. Training is end-to-end. After calculating the embeddings for the sequence of words, we go to the LSTM layer, the state of which is then transferred to the fully connected layers with tangent activation. As a result, in the left encoder at the output we get a vector that represents the text, and in the right - the vector corresponding to the sticker. The scalar product of vectors determines how text and sticker fit together. The dimension of the embeddings, the vector of the internal state of the LSTM, and the output vector were taken to be 300.

The objective function for training looks like this:



where K is the size of the batch,

S (x i , y i ) is the scalar product of the resulting vectors for positive examples of text-sticker pairs,

S (x i , y j )- scalar product of vectors for negative examples of text-sticker pairs. Negative examples of text-sticker pairs were generated due to random mixing of the original correct pairs. Since, for popular universal stickers, it is likely that when mixing again it turns out to be the correct pair, at the stage of retraining, additional control was used as part of the batch so that there were no similar texts for positive and negative pairs with one sticker. During the experiments work better if negative examples to use less than the K . In view of the noise of the data, training in large batches worked better. The complication of the model and the addition of the attention layer did not give a noticeable improvement in accuracy, which, rather, indicates the limitations associated with the data and their quality.

Due to the fact that the approach with a dictionary of individual words, rather than character n-gram, is chosen, the model’s flexibility with respect to typos is lost. An experiment was conducted with the training of fastText embedding, which is trained directly in the network, thereby reducing the impact of errors. In this case, the training went worse, and the addition of an attention layer helped noticeably. After weighing quality indicators and other factors, it was decided to dwell on a simpler model. In this case, the problem of typos is solved by using a spellchecker in a situation if the word is not in the dictionary. In normal mode, the spellchecker is not used, as some words with errors are a feature of informal chatting.

Getting Answers


How does the model work at the inference stage?

We calculate the sticker vectors calculated by the right encoder in advance. At the request processing stage, only the left encoder is used to obtain a vector for the incoming text. Next, you need to get the list of stickers in descending order of the scalar product of the text and sticker vector. This can be done directly by multiplying all the sticker vectors, or use the nearest neighbor search algorithms for this metric. For example, in [2] it is proposed to use Hierarchical Quantization for Maximum Inner Product Search (MIPS). We applied the HNSW search algorithm, and this gave a significant acceleration compared to exhaustive search.

Making answers diverse


The next stage is the diversification of the proposed stickers, as often top-end stickers can be all the same.

Three suggested stickers for the phrase "Hello": without diversification and with diversification.

Several options for diversification have been tested. The easiest way is to choose the top N stickers, taking into account the limit on the score, then select the top sticker, and select the other two with the maximum distance between each other, while limiting the allowable difference in the distances to the top sticker. This approach can be combined using the results of manual marking of stickers by emotional coloring (positive, negative, neutral), and choose from the top N stickers with different colors, if any.

Another option is to cluster stickers by their embeddings, and when outputting results, select no more than one sticker from the cluster. For clustering, the UMAP + HDBSCAN bundle was used. UMAP is a new effective dimensionality reduction algorithm that surpasses the already proven t-SNE. Dimension reduction was applied to two, and then the HDBSCAN clustering algorithm was used. About 100 clusters were identified. This task is not completely automatically solved, with various settings it was possible to achieve clustering of up to 70% of stickers, but then manual revision, verification is required. Therefore, we settled on the simpler options described above, since their result was good.


Clustering stickers on embeds.

results


As a result, we got a simple and effective smart replay sticker, which demonstrated a very good quality of answers. According to tests for 1000 different phrases, from simple to relatively complex, according to respondents, the top sticker was called completely suitable in more than 75% of cases. In the test of 100 simpler and more popular phrases, the result is even more impressive: the top sticker was called completely suitable in 93% of cases.

Examples of stickers offered by the model of answers.

What are the disadvantages?

Due to the imbalance in the training dataset, some words have an unnecessarily large impact. For example, the word “love” in some context often leads to the proposal of various “romantic” stickers, since there is a bias in this direction in the training data. The introduction of additional weights for frequency words and stickers, as well as the augmentation of phrases, did not completely solve the problem, but partially improved the situation.

We do not rest on our laurels, and the process of improving the quality of our models continues, experiments are conducted to modify the architecture and mechanisms for preparing and using data.

Literature


  1. Google updates ML Kit with Smart Reply API for third-party Android and iOS apps. 9to5google.com/2019/04/05/ml-kit-smart-reply
  2. Matthew Henderson, Rami Al-Rfou, Brian Strope, Yun-Hsuan Sung, Laszlo Lukacs, Ruiqi Guo, Sanjiv Kumar, Balint Miklos, and Ray Kurzweil. Efficient Natural Language Response Suggestion for Smart Reply. arXiv:1705.00652v1, 2017.
  3. Pranav Khaitan. Chat Smarter with Allo. ai.googleblog.com/2016/05/chat-smarter-with-allo.html
  4. Bjarke Felbo, Alan Mislove, Anders Søgaard, Iyad Rahwan, Sune Lehmann. Using millions of emoji occurrences to learn any-domain representations for detecting sentiment, emotion and sarcasm. arXiv:1708.00524v2, 2017.

All Articles