Competition VK Sup. Track ML. 4th place. How?

image

In this contest, which was held as part of the qualifying round of VK up 2020 , track ML , it was necessary to predict the proportion of potential audiences who would see advertisements displayed on several advertising platforms a specific number of times: 1,2,3 in the future.

This was not a classic competition for sending final predictions to known test data, but a prediction on completely unknown data submitted to the model in docker launched on the competition site.

In general, such a solution equalizes the chances of participants and does not allow those who like to peek at the test, enrich the training data set with them, and fit the model to the distribution of test data. Here everyone was on an equal footing, since it is not clear what might be in the data: “junk” data, sporadic outliers, invalid delimiters, and so on. But all these nuances at the same time make us think about handling exceptions.

In this contest, I took an inconspicuous 4th place and I want to tell how it was possible.

Data Overview


The initial data were presented in the following form:

  1. users.tsv – : , , . , , , . ( ), .
  2. history.tsv — "-", ( ) . , , .
  3. The validate.tsv file is a validation file for training the model, it just contained data on at what time interval and at what price an ad was shown for a specific audience (platform and user). Users and sites were given in string form of the form (1,5,7,3,14,6).
  4. File validate_answers.tsv - an answer file consists of three columns: what percentage (values from 0 to 1), the audience looks classified 1, 2, 3 times. Accordingly, these sequences are not increasing.

The purpose of the competition : to predict three sets of values for new data from the future (in the format of validate.tsv file ) - what proportion of the audience will see an ad 1.2, 3 times.

More details about the task on the competition website .

Predictors


The final predictors that I used are a set of 2 complexes:

  • predictors based on history and their comparison with new data
  • predictors only on data from the future

Among the first complex, based on the history file, basic statistics were generated for the grouped user-site pairs , and subsequently their aggregation to the user-site pair in the validation and test files. This was followed by the selection of predictors in different ways - both based on the frequencies of using predictors at the stages of partitioning and use in the model itself, and in validations, from top to bottom and from bottom to top. Despite the different selection schemes, in general, it all came down to about one set of predictors, and as a result there were seven of them.

Interpretation of the predictors of the second complex (there were also seven of them surprisingly) is generally much simpler:

1. Delta- time difference. Is it logical? Logically: the larger the interval, the more likely the views. Of course, there is no direct dependence, but physically it should be that way; moreover, this is one of the strongest predictors if we consider them separately.

2. Delta2 is also a time difference, but converted per day (that is, an integer division by 24). That is, we turn a linear dependence into a piecewise one. The idea here is simple: we do not distinguish between hours, but very long intervals (days) will set their own trend.

3. cpm - the price itself, similarly: the more expensive the price, the more likely it is to view, again, of course, there is no direct dependence, but in the “flirting” with other predictors based on history, the dependence is clearly traced.

4-7. These are sin and cos of the start and end times of the ads, which are also translated into the 24-hour timeline. The use of these functions, in contrast to linear time, allows you to take into account time intervals that pass through the day. The use of these predictors immediately gave an improvement of 1.5 percentage points.

Metric and response


The Smoothed Mean Log Accuracy Ratio (hereinafter SMLAR ) metric presented by the organizers .

image

Where the initial response is presented in the proportion of the audience who viewed the ad 1.2.3 times, that is, the values ​​in the range [0.1].

By the way, the KDPV indicates the behavior of this metric, or rather, not the whole metric, but its part ( MAE for the logarithm of the prediction biases) for all combinations of the prediction and the true value over the entire range [0,1].

If you carefully look at the formula of the metric, then: on the one hand, this metric approximately corresponds to the geometric mean of the ratios of the predictions and the true value (with bias), which is clearly better than the arithmetic mean metric (due to the lower final result). On the other hand, if we omit the exponent, which at low values ​​behaves almost like the exponent of its degree, the metric transforms into MAE on the logarithm of the response with an offset. Thus, to construct ideologically correct models, it was necessary to use the initial response with an offset and the loss function in which there is a logarithm in explicit form, or, conversely, first use the offset response logarithm and a linear loss function ( MAE , eps) But, given my model (in which the loss function is not explicitly specified), I selected the optimal transformation of the response based on the results of the validation model.

I considered the following response options - original shares, logarithm of shares, transition to absolute values ​​of the number of users, their logarithm with different offsets (here there was an attempt to use one unified offset when moving to absolute values, since the offset of 0.005 is indicated for the shares, and the audience was different, from 300 to 2500, so the offset should be in the range from 1 to 12, but I checked only the values ​​1 and 10), and the root of the absolute value of the people viewed ad.

image

The picture above shows the results for two models that trained for a different response: the initial audience shares and the absolute number of participants.

The upper diagram shows the sorted values ​​of the true response (by fractions of the first scan) and the predicted values ​​for both models. It is immediately clear that most of the response values ​​are quite small, so the median value is about 5%, and this is only for the first scan (for the second scan, the median is already less than 1%, and for the third scan almost 0%, and for this metric small values ​​and errors on they are very unpleasant). It is also clearly visible in this diagram that the model by absolute values ​​is qualitatively better, the spread in estimates is quite minimal, and despite the fact that the deviations are almost not visible on the graph at small values, as a result, it is the errors at these small values ​​that strongly affect final result. The same can be seen on KDPV, very sharp curvature at low values, especially close to zero.

The average diagram shows the error of each sorted prediction; strong errors at small values ​​and their decrease with increasing response values ​​are visible.

On the bottom diagram, the diagram of the directly targeted metric is already being built up with the accumulated total for all sorted values. What are the conclusions from all this? The first one is that the selected response strongly affects the results of the model, but more on that below, the second conclusion, pay special attention to small values, especially close to zero, it is obvious that models will not always be able to predict a pure zero, therefore corrections are necessary. And errors at large values ​​are not so important, firstly, they are relatively small, and secondly, the percentage error at large values ​​will be small, and at the same time it will make a minimal total contribution to the metric.

As a result, according to the results of numerous experiments, the winner with a clear margin was the response - the root of the absolute values ​​of users. At the same time, on different predictions (by 1, 2, 3 views), sometimes models with a logarithm of absolute values ​​also won, this is due to the clear prevalence of 0 in the responses, and, as a result, logarithm with some kind of bias was better. But if you average it, then a simple root without any bias showed good stable results, so I didn’t want to complicate the decision, but to stop at a simple unified method - just the root of the people.

What is the reason for the fact that the transition to people significantly improves the result relative to the shares (almost 2 times)?

Apparently, the fact is that turning to people, multiplying the share by the audience, or the same thing as dividing all the predictors by the same audience, we go into the dimension relative to “one person”, and considering that the basis of my model is regression, the final estimate is a kind of weighted probability estimate relative to each predictor. It is possible that if we normalize only part of the predictors to the audience, for example, from the predictors of the first group (the sum over all pairs, for example), then this normalization would thereby bring the dimensions of all predictors closer to a single reporting system (per person), and the final regression , its response, would be nothing more than the average weighted sum of the contributions of each predictor (which characterize one person) to the total probability of viewing, then perhaps the result would be better.But at the time of the decision of the contest, I did not approach from this side and worked exclusively with a transformed response.

Model


In fact, this section had to be put higher, because it was because of this model that we had to select the type of response and the necessary predictors used for it (the model was adjusted to the data) and, one way or another, it was possible to reach an acceptable one on different predictors the result is about 15%. But I wanted that on average there was some justification for choosing specific predictors, therefore, combinations of predictors were selected for validation.

I used a model from a family of regression model trees, namely the cubist model (1992 model!), And its implementation in the package of the same name in R. Rather, the final result is the geometric mean of two sets of models, each of which consisted of 3 separate models, but in cascade: the prediction of the previous model (for 1 view) was used as a predictor for the second and third models, and the final prediction for the second view as a predictor for the third model. Both pairs of models differed slightly in predictors and intermediate adjustments, and their geometric mean was used on the basis of common sense (well, validations, with a public course), and the meaning is simple: as I wrote above, special attention is paid to zero predictions, and generally to minimal , and the geometric mean is exactly what it does: it vanishes the prediction if one of them is already zero (and this is logical if one of the models showed zero, so let it remain,than we will “delay” the prediction from zero).

And thanks to the cascade of models, the model indirectly “understood” (since regressions) that each subsequent response “clings” to this previously predicted answer of the previous estimate, and the remaining predictors correct the answer, which should be no more than the previous one. I also tested three separate models that predicted responses individually. The result was weaker due to the abundance of zeros in the second and third scans, the family of regressions could not quite go to 0, and when we add a “guide” to the previous estimate, which is already 0 or close to it, the resulting family of regressions also falls in the vicinity of this values ​​and adjusts only the response to the second and third viewing.

What is good about this model?

When I saw the task, I immediately remembered about this model, since at one of the previous competitions in a comparable problem (linear relationships and their corrections) it was also one of the best, and in general, we have fairly linear data here, there is an obvious relationship between the quantities views (the second is less than the first, the third is less than the second), there is little data - only 1008 observations, there are a small number of predictors, probably some kind of linear-broken dependencies. In addition, this model is very fast, the construction took several seconds, so it was convenient for her to test many hypotheses. And yet, she does not have hyperparameters (with the exception of neighbors (another parameter is a corrective forecast), which I did not use), on which I could retrain.

How is the prediction in this model for one tree?
, , 100 ( , , 10-20 ), , , , : ( ), , ( , ) .

, , .

Adjustments


In addition, small adjustments of predictions were used, namely: when switching from the absolute number of people to their shares, situations of very small values ​​(positive, a little more than 0, or more than 1) potentially occurred, and if in the case of values ​​more than 1, their adjustment did not play a greater role (probably there were few such flights, and if they were, then not significant), but in the case of small values, it was relatively critical. By reasoning, it was accepted that if I predict, for example, 1 person (or 0.5 people, rounding was not carried out), then with a maximum audience of 2500 (this is completely unknown with the known data in the train, which really happens on the test data), which is 0.0004 (by the way, and in the train, the minimum value is 0.0004),it means somewhere in the vicinity of this value it is necessary to turn lower values ​​to 0, and given that my models are built in a chain, the construction of the next model and its predictions depends on the predicted zero, etc. it influenced quite a lot.

It didn’t make much sense to select a threshold for validation (because the model adjusts to this data anyway, and I know the distribution), so I glanced at the public (for some selected values), but in the end I left for one of the three models a beautiful rounding threshold of 0.0005, and for the second theoretical 0.0004.

Adjustment from the top was easier, values greater than 0.95 to pay in 0.95, 0.95 was made based on the maximum share of the test data used with awith the largest margin (maximum 0.93 in the train), this adjustment had practically no effect on the public (single departures are apparently in the public), it was left exclusively for security in private. And also a correction related to zeros was added, if the prediction is zero on the first scan, then despite the predictions of the models on the second and third scan, their predictions also go to 0, this did not affect much, somewhere the second sign (the model is practically always and so it itself (less than the previous one and to zero) did), but left for security on private.

results


The results were very dependent on the type of response and the selected predictors, for example, even if you predicted fractions, or even better than their logarithm, you could select other predictors and the result would be about 16%, and if you go to absolute values ​​and also re-select predictors, then it all started about 15%, so this was my base line.

And by the way, these results were already enough to stay in the top five, but it was interesting to “boost” more.

And so, what dramatically improved these 15%?

In general, just adding hours, just hours (start and end times), immediately yielded 13.97%, changing them to sine-cosines improved to 13.44%, and then improving to 13.25% was rounding small values ​​to zero, and the geometric mean the average of the two models, that is, it was already more tuning for the test (public), and because of this, I still got a little overboard with the public.

In this competition, it was necessary to choose one solution. Now, looking in the LC, I see that my chosen solution turned out to be almost the best on private as well (the place has not changed) (the best privat is less than 0.02 percentage points), but if you take sendings in which the answer was not so rounded, then on private they were slightly worse - 13.6%, that is, there was no strong retraining for the public, but this post-tuning did not play a very large role either.

As a result, the main reserve success: predictors selected under a selected response model cubist , Cascade models (1-> 2-> 3) and temporal s e predictors ( sin , cos ).

Conclusion


Despite the fact that the winners of the first five places used various models, including modern ones (1 place - SVR , 2 place - catboost , 3 place - neural net , 5 place - lightgbm , although these winners had much more complex predictors) , I took 4th place using one of the oldest, classic models of 1992 (even SVR ideas appeared later) on fairly simple and obvious predictors, which once again confirms: it’s not always enough to run on generated predictors (these approaches were much lower in the final rating, about 20%), the common sense of predictors, and the transformation of the response, and the choice of the loss function in models (if any) play a significant role here.

In general, the competition turned out to be interesting and creative, with relevant conclusions.

I hope that at the final (full-time) stage of the competition, the task will be no less interesting.

All Articles