VKCup 2020 Stage I. A Long Way



Today we will consider the VkCup 2020 competition and the task of the first qualifying stage. The guys from Singer promised that throughout all the stages it will be difficult, but interesting. And what, in fact, is still needed?

As you know, the desire for excellence, honing oneโ€™s own skills to a sharpness similar to Damascus steel, is inherent in every specialist in any field. Examples of this mass. From the colorful paintings of artists, each of which spent a part of life (I would even say the best part of it), which could be wasted aimlessly at social events. Prior to keygen programs - software activators, each of which contains 8-bit music, an activation algorithm and animation, carefully and carefully tamped into a file, measuring only a few tens of kilobytes. Art, whatever one may say.

And no, this is not another article about self-development, honestly, honestly. I, in fact, what. In each of us there is a desire to solve non-trivial problems in our field. Moreover, the desire to do this is better than most around. It `s naturally. And, it seems to me, Machine Learning competitions, this is a great opportunity for this. Sometimes it is necessary to give freedom to the spirit of competition.

We will consider VKCup 2020, the championship for data analysis from VKontakte, namely, its three stages, with the final in St. Petersburg. Let's start with the first qualifier. Well, there is time, and Petersburg, although it is a cultural capital, does not like lateness and blemishes. So we deign to deal with affairs as soon as possible, sir, the gentleman the cabman is not expected to be located, and therefore a long way.



I think that many have already encountered NLP tasks. In this case, the task of the first qualifying stage, which set itself the task of selecting 256 people who went through the second, was an analysis of the issues of the Clover game, namely, predicting which of the questions were proposed by professional editors and which were by ordinary users.

The data size is about 40,000 rows, of which 10,000 is a test dataset by which the quality of the model will be estimated, and the other 30,000 is the remaining data set. The test metric is AUC ROC - the area under the error curve.

The data looked as follows,



where ID is the identifier of the question, Answer is a binary variable whose value must be predicted (1 - the question was proposed by the editor, zero 0 - by the user). Question - the question itself.

It would seem that the task is quite simple, but everything is far from the case. Standard approaches, such as Word2Vec, BERT, and other new-fangled NLP models, produce relatively low quality, about 0.7, which would not even make it into the first hundred. Therefore, remembering that the devil, like the LEGO people, live in the details, consider the dataset in more detail and develop a solution that will be relatively more efficient.

To begin with, during the initial EDA, you can notice that users often use atypical characters, write words in caps, sometimes not with a capital letter at the beginning of a sentence, forget about question marks, use strange quotes. They also copy editors' questions, passing them off as their own. And other interesting features.

Let's start with the baseline. Using the TF-IDF (Term Frequency - Inverted Document Frequency) measure, we get the sparsematrix, which we then feed into the gradient boost.
Just one definition, TF-IDF is a statistical measure used to assess the importance of a word in the context of a document that is part of a collection of documents or corpus. The weight of a word is proportional to the frequency of use of the word in the document and inversely proportional to the frequency of use of the word in all documents of the collection.

To reduce the retraining of the model, we will reduce the maximum number of tokens and compose the initial parameter matrix, which will consist of a combination of two smaller ones: one part is the TF-IDF matrix, where symbol sequences separated by spaces act as tokens, in fact they are words, and the other the part is a matrix where symbol sequences of a certain length act as tokens.



It looks pretty good, however, it can and should be better. This is the first part of the pipeline. In this case, โ€œPโ€ is an abbreviation of โ€œPublicโ€, this is an indicator that shows how much the model scored on the open part of the test sample. And โ€œFโ€ is an abbreviation for โ€œFinalโ€, a result on the hidden part of the data published after the end of the competition.

Next, weโ€™ll add some data to the dataset, which is generally not prohibited by the rules. And what is not prohibited is accordingly permitted. On the Internet you can find a number of datasets with quiz questions. If we download them and mark them as user questions.



This idea, this is a double-edged sword and questions of discord. If you take questions from any special book, the model will have a dissonance. On the one hand, the writing style is close to the editorial, on the other hand, cunning users have long spammed them with private messages of the public game โ€œCloverโ€. Therefore, we will not use them; we will protect the model from bipolar disorder of modeliness.

As I said before, users have some negligence in formatting their questions, and so we will extract them using regular regular expressions. We will write regular expressions to check for the presence of a question mark, to identify different types of quotes, to search for capital letters ,, ,. And we will retrain the model on these signs. Moreover, the previous predicts will be an additional sign. This will help to correct the predictions of the previous model.

Thus, a two-level model was implemented where the results of boosting on the sparse matrix of TF-IDF were adjusted by boosting on tabular attributes obtained using regular expressions.



This was enough for 5th place on a private leaderboard out of 318. Something could be corrected and optimized, but time was running out. One way or another, this is enough to go further. And excessive perfectionism leads to nervousness. In any case, see you later, love ML, stay calm and stay tuned. Further it will be more interesting ...

โ†’ The code in the Github repository

All Articles