New models of data search and analysis. WSDM 2020 through the eyes of the Yandex.Tolki team

International scientific conferences help to monitor trends in the industry, learn about the advanced developments of leading companies, universities and talk about yourself. Of course, this applies only to the time when the world is not plunged into the abyss of a pandemic.

Before all countries switched to self-isolation, we managed to go to the WSDM conference (pronounced wisdom) by the Yandex.Tolki team to conduct a crowdsourcing tutorial, present our article and chat with colleagues.

My name is Alexei Drutsa, I am the head of the efficiency and development department of crowdsourcing and platform management in Yandex. The company is engaged in theoretical and applied research in areas related to discrete algorithms, auction theory, machine learning, data analysis and computational mathematics. During my work I published more than 20 scientific articles, including those at the conferences NIPS, KDD, WWW, WSDM, SIGIR and CIKM. In this post I will tell about my impressions after visiting WSDM, as well as give a small overview of the most interesting reports.


Conference poster

What kind of conference?


WSDM is one of the major research conferences related to data mining and analysis. This year she became the thirteenth in a row and was held from February 3 to 7 in Houston, Texas.

Some statistics. The conference was attended by about 700 people. The authors of 615 scientific papers submitted applications in order to be able to present their articles at the conference. The organizers selected 91 articles, including our work on collecting crowdsourcing data. Of the 20 applications for conducting tutorials, the WSDM organizers accepted 9, including the application from Yandex.

The main part of the conference was a poster session. At all such scientific events, this is the main way to present the work: the authors of the accepted articles prepare posters with comprehensive information about the study and answer questions from interested colleagues ( more  about the format). In addition to the poster session, participants could tell about their achievements in three formats:

  • 5-minute progress report (46 participants received this opportunity);
  • lightning-talk for 60 seconds with a brief description of the main essence of the report (this format was offered to 45 participants);
  • demo with a demonstration of the work of a tool.

Among the works published at the conference was an article from our team. It is also about crowdsourcing, but it talks about another source of crowdsourcing data - collected through captcha.


Poster of our article

The method of collecting markup using captcha has long been known and used by many companies. It works like this: suspicious users are invited to enter text from two pictures. The first image is a control one, we already have the correct answer for it. The second image contains text unknown to us, we just want to decrypt it with the help of the user. If a person enters the correct text from the first - control - image, then we consider it reliable enough and write down his second answer.

This is a very convenient, scalable and free way to markup. But there is a problem: captcha is usually offered to suspicious users, some of which are bots. When decrypting pictures with such robots, we often get similar, consistent errors. People, unlike bots, rarely make the same letter.

Typically, companies using this markup method consider the answer that most users gave the correct answer. But taking into account the high probability of similar errors being made by bots, such a scheme leads to incorrect data.

We have trained the ML-model, which predicts by the captcha input factors which answer will be the most correct. The full content of the article can be found here .

What about the tutorial?


On the very first day of the conference, we held a practical tutorial based on Yandex.Tolki . My colleagues already told about our service on HabrΓ©, its detailed description here . In short, Toloka is a crowdsourcing platform that helps you complete many tasks. Using Toloka, you can decrypt audio recordings, conduct focus groups, moderate comments or recognize pictures using the data obtained for machine learning.

Among the tutorials on WSDM, only ours took place all day.


Before the tutorial

We talked about how to solve problems using crowdsourcing. To efficiently mark up data using this method of organizing a workflow, you need not only to give people a task, but to decompose it correctly, formulate a task correctly and set up processes, for example, quality control. Some of the information that we shared with the conference participants can be found in our published video course . In it, the basic theory of crowdsourcing is shown as an example of solving the problem of segmentation of objects in the image.


Tutorial Program

For the conference, we specially came up with a pipeline that included classification, data collection on the Internet, post-acceptance and side-by-side comparisons. It consisted of four stages. The participants in the tutorial presented themselves as owners of an online clothing store. They took a picture, selected some kind of clothing item (for example, boots) on it and gave the taskers the task of finding the most similar products in the store’s database. Then these products were ranked by similarity with other tolokers.


Pipeline stages

At the end of the day after the results appeared, all participants received feedback and practical tips designed to help make each project more effective.

For example, in the real world, some of the steps in our pipeline could be automated based on the available data using the API. But at the conference, it was important for us to show how each of the stages can be processed using crowdsourcing β€” efficiently and scalably.


What else can be done to get better results and spend less money

Almost all participants in the tutorial completed it completely, reaching the very last steps. They learned how to assemble datasets from similar products of an online store using crowdsourcing. The pipeline that we reviewed in the tutorial is quite universal, it can be used not only in online trading, but also in any industry where similar objects need to be offered.

What did other companies talk about?



A full list of published works can be found on the conference website.

We noted a large number of works related to recommender search engines and the field of e-commerce. In our opinion, most of the teams did not offer new scientific theories, but presented the results of introducing certain technologies into the product. There were many reports on solutions based on neural networks - the authors told which libraries were used for this.

Here are a few posters that caught our attention, with comments:

β€’ CrowdWorker Strategies in Relevance Judgment Tasks


Poster by CrowdWorker Strategies in Relevance Judgment Tasks

This work interested us in its topic. The authors talk about how the experience of performers in crowdsourcing affects their behavior: clicks on tasks, using hot keys, and lead time.


The difference in the time taken to complete tasks between more and less experienced executors

After the experiment, the authors found that after two tasks performed on the crowdsourcing platform, less experienced workers achieved comparable execution speeds with experienced ones.

General conclusion: if there are ways to control the quality of the tasks, the experience of the performers does not greatly affect the final quality of the data.

β€’ Predicting Human Mobility via Attentive Convolutional Network


Poster for Predicting Human Mobility via Attentive Convolutional Network

This article is about predicting the user's route β€” the point at which it will be in the future. Most of these prediction methods work with GPS coordinates, and the authors of this work focused on geotags in social networks.

The authors of the work consider the user trajectories as pictures and use filters for them. Each picture has successive patterns as indicators. An attention mechanism is also added to this neural network to take into account long-term preferences.

The authors conducted experiments on three datasets and concluded that their model works better than existing models with GPS coordinates.

β€’ Metrics, User Models, and Satisfaction

The authors studied how metrics describing the behavior of users of a search engine are related to their satisfaction.


Poster for Metrics, User Models, and Satisfaction

They confirmed that metrics with user models that reflect typical behavior also tend to be metrics that correlate well with user satisfaction ratings.

β€’ Hierarchical User Profiling for E-commerce Recommender Systems


Poster for the Hierarchical User Profiling for E-commerce Recommender Systems

The authors of the paper solve the problem of recommendations for different levels of detail.

The hierarchical structure of user profiling that they proposed models the multi-level interests of users using Pyramid Recurrent Neural Networks, which usually consist of a microlayer, a layer of elements, and several layers of recurrent category neural networks.

What is the result?


This conference will be useful to specialists who are engaged in improving the search.

Before attending WSDM and any other conference, we advise you to carefully study the program and the accepted works - this will help not only to wander confusedly between posters, workshops and speeches, but also to communicate with the authors of interested projects.

And do not forget that all the work is on the network , and you can study them yourself. This, by the way, is a great way to use your free time.

All Articles