🕙 🕹️ 💫 Pavel Klemenkov, NVIDIA: We are trying to narrow the gap between what a data scientist can do and what he needs to be able to do. ✉️ 👨🏿‍💼 🤷🏾

The second set of students of the master program in data science and business intelligence Ozon Masters started - and in order to decide to leave an application and pass online testing it was easier, we asked the teachers of the program what to expect from training and working with data.

NVIDIA's Chief Data Scientist and Big Data and Data Engineering course teacher Pavel Klemenkov talked about why mathematicians write code and study at Ozon Masters for two years.

- Are there many companies that use data science algorithms?

- Actually a lot. Quite a few large companies that have really big data either start working with them efficiently or have been working for a long time. It is clear that half of the market uses data that can fit into an Excel-tablet or can be counted on a large server, but it’s impossible to say that there are only a few businesses that can work with data.

- Tell me a little about projects that use data science.

- For example, while working in Rambler, we made an advertising system that works on the principles of RTB (Real Time Bidding) - we needed to build many models that would optimize the purchase of advertising or, for example, could predict the likelihood of a click, conversion, and so on. At the same time, an advertising auction generates a lot of data: logs of site requests to potential buyers of advertisements, logs of ad impressions, click logs - these are dozens of terabytes of data per day.

Moreover, for these tasks we observed an interesting phenomenon: the more data you give for training the model, the higher its quality. Usually, for a certain amount of data, the forecast quality ceases to improve, and to further improve accuracy, you need to use a fundamentally different model, a different approach to the preparation of data, features, and so on. Here we poured more data and the quality grew.

This is a typical case where analysts had to, firstly, work with large data sets to at least conduct an experiment, and where it was impossible to get by with a small sample that fits in a comfortable macbook. At the same time, we needed distributed models, because otherwise it was impossible to train them. With the introduction of computer vision in production, such examples are becoming more common, since pictures are a large amount of data, and millions of pictures are needed to train a large model.

The question immediately arises: how to store all this information, how to efficiently process it, how to use distributed learning algorithms - the focus from bare mathematics is shifting towards engineering. Even if you do not write a code in production, you need to be able to work with engineering tools to conduct an experiment.

- How has the approach to data science vacancies changed in recent years?

- Big data has ceased to be hype and become a reality. Hard drives are cheap enough, which means that it is possible to collect all the data in general, so that in the future they will be enough to test any hypotheses. As a result, knowledge of tools for working with big data is becoming very popular, and as a result, more and more job opportunities for data engineers are appearing.

In my understanding, the result of the work of a data scientist is not an experiment, but a product that has reached production. And just from this point of view, before the advent of hype around big data, the process was simpler: engineers were engaged in machine learning to solve specific problems, and there were no problems with bringing the algorithms to production.

- What does it take to remain a sought-after specialist?

- Now many people have come to data science who have learned mathematics, machine learning theory, participated in data analysis contests where a ready-made infrastructure is provided: data is cleared, metrics are defined, and there are no requirements for the solution to be reproducible and fast.

As a result, guys who are poorly prepared for the realities of business come to work, and a gap is formed between beginners and experienced developers.

With the development of tools that allow you to assemble your own model from ready-made modules - and Microsoft, Google and many others already have such solutions - and machine learning automation, this gap will become even more pronounced. In the future, the profession will require serious researchers who come up with new algorithms, and employees with advanced engineering skills who will implement models and automate processes. Just the Ozon Masters course in data engineering is focused on developing engineering skills and the ability to use distributed machine learning algorithms on big data. We are trying to narrow the gap between what a data scientist can do and what he should be able to do in practice.

- Why do math with a diploma go to study in business?

- The Russian data science community has come to understand that skill and experience are very quickly converted into money, therefore, as soon as a specialist has practical experience, its cost starts to grow very quickly, the most skilled people are very expensive - and this is true at the current moment of development market.

Most of the work of a data scientist is to go into the data, understand what lies there, consult with people who are responsible for business processes and generate this data - and only then use them to build models. To start working with big data, it is extremely important to have engineering skills - it is much easier to circumvent sharp corners, which are many in data science.

A typical story: you wrote a SQL query that is executed using the framework Hive, which runs on big data. The request is processed in ten minutes, in the worst case - in an hour or two, and often, when you receive the uploads of this data, you realize that you forgot to take into account some factor or additional information. You have to resend the request and wait for these minutes and hours. If you are a genius of efficiency, then we will take up another task, but, as practice shows, we have few geniuses of efficiency, and people are just waiting. Therefore, in the courses we will devote a lot of time to work efficiency in order to initially write queries that work not for two hours, but for several minutes. This skill multiplies productivity, and with it the value of a specialist.

- How is Ozon Masters different from other courses?

- Ozon employees teach at Ozon Masters, and assignments are based on real business cases that are solved in companies. In fact, in addition to the lack of engineering skills, the person who has learned data science at the university has another problem: the business task is formulated in the language of business, and its goal is quite simple: make more money. And the mathematician knows well how to optimize mathematical metrics - but finding a metric that will correlate with a business metric is difficult. And you need to understand that you are solving a business problem, formulate metrics that can be mathematically optimized together with business. This skill is acquired at the expense of real cases, and Ozon gives them.
And even if you drop the cases, the school teaches a lot of practitioners who solve business problems in real companies. As a result, the approach to teaching is still more practical. At least in my course, I will try to shift the focus on how to use tools, what approaches exist, and so on. Together with students, we will understand that each task has its own tool, and each tool has a field of applicability.

- The most famous training program in data analysis, of course, ShAD - what is the difference specifically from it?

- It is clear that ShAD and Ozon Masters, in addition to the educational function, solve the local training problem. Top SHAD graduates are primarily recruited to Yandex, but the catch is that Yandex, because of its specificity - and it was big when it was not enough good tools for working with big data - has its own infrastructure and tools for working with data, which means will have to master them. Ozon Masters has a different message - if you have successfully mastered the program and Ozon or one of the 99% of other companies invites you to work, it will be much easier to start benefiting the business; the skillset acquired through Ozon Masters will be enough to just start working.

- The course lasts two years. Why does it take so much time?

- Good question. For a long time, because the content and level of teachers is an integral master's program, requiring a lot of time for mastering, including homework.

From the point of view of my course, to expect that the student will spend 2-3 hours a week on tasks is a common thing. First, tasks are performed on the training cluster, and any common cluster implies that several people use it simultaneously. That is, you have to wait for the task to begin to run, some resources can be selected and transferred to a higher priority queue. On the other hand, any work with big data is time-consuming.

, — , 25 12:00, Ozon Masters . c Zoom YouTube.

Pavel Klemenkov, NVIDIA: We are trying to narrow the gap between what a data scientist can do and what he needs to be able to do.