👊🏼 🎅🏿 🍟 Podcast: what awaits beginner scientists in the field of defense 🐞 👨‍🚀 🤲🏽

As promised , we provide a complete transcript of the first release of our podcast ( you can listen to it in Apple Podcasts ). Andrei Filchenkov , Ph.D. in Physics and Mathematics, associate professor of the Faculty of Information Technologies and Programming and the head of the machine learning group at the International Computer Science Laboratory, spoke with us .

^{Photo by Samuel Bourke (Unsplash.com)}

Dmitry:Last year youreceivedthe Ilya Segalovich Prize, and you said that you are engaged in different areas, without any common vector besides machine learning. Please comment on this approach. Is such freedom decisive for you?

Andrei Filchenkov:When I came to work at ITMO University, in our Laboratory of Computer Technologies, which had already become a scientific center, there wasn’t anything directly related to machine learning. And we began to experiment. It was the year 2014.

Dmitry:Have you formulated the task for yourself?

Andrew:They set, decided. Machine learning is designed in such a way that ideas flow from one area to another, and models flow from one task to another. It is quite difficult to be limited to something. And so it’s not only us who work. Almost always, researchers in the field of machine learning do not sit strictly in one niche, there is always a lot of research around.

Another thing is that initially we did not have a vector around which we would build our work. But now we delve into specific topics and generally limit the number of niches in which we work. First of all, this is automatic machine learning, everything related to image processing and generation, including video, analysis of social networks.

Two relatively new topics appeared last year: machine learning for code analysis and machine learning for routing. We have been routing since 2016, but delved into this topic. We tried to take on a variety of tasks, but this approach affects the results. Now we are trying to focus more on quality than on diversity.

Dmitry: And therefore, you are already formulating point problems for those who work in the laboratory, and acting as a theoretical mentor for a large number of specialists?

Andrew: Yes.

Dmitry: It may seem to the layman that this activity is connected with very basic things, formulas and scientific articles. But the guys from the lab, as you can see, work with the analysis of social networks, recommendation engines, Trump's Twitter analyze. Please tell us about such projects that are understandable at a practical level?

Andrey: Our activity can be divided into two categories. The first is algorithms that are applicable to a lot of things. For example, feature selection algorithms. They are needed for many tasks - for analyzing DNA sequences, processing medical data, and so on. The second category is tasks from the subject area in which we are already trying to use different algorithms. There are more tasks in this category, they are understandable and diverse, and not always completely scientific.

^{In the photo: Andrey Filchenkov}

In addition to the laboratory, we have a common environment in which startups and companies revolve. They come with industrial orders, tasks arise quite practical. Recently, we worked with the TVC channel. They have a huge number of different documents and articles. And they are very interested in monitoring how this or that phenomenon is presented in the press and how it changes over time - how many conventional Ilon Mask is in the press, in what context he is encountered. We wrote a search engine for them in order to quickly perform rather complex queries over a fairly large database.

Dmitry: It was more effective for them to contact you and introduce custom development, than to go to the market for a finished “boxed” product?

Andrew: Yes. They have a specific task. There are no such solutions on the market.

We also do quite interesting things with Vk, where one of our graduate students works. Advertising is one of the main sources of revenue for the site, like most of these companies. And you need to understand to whom to show this advertisement, so that everything is effective. The task looks like that. There are a large number of advertisers. We need to ensure that they have shows, and people come relevant.

There are a number of studies related to medicine. For example, a skin analysis from a photograph to identify important medical markers. There are two topics. One is related to identifying whether a mole is a mole or is it a signal that something serious can be. The second is an understanding of what causes these or those spots, rashes (these can be symptoms of a large number of skin and sexually transmitted diseases).

Dmitry:And here you can increase the availability of basic diagnostics?

Andrew: Yes. But even if the symptoms are visible because they are on the skin, their classification is not so simple. We automate this task so that you can put the application on your phone, take a picture and understand, but there is something suspicious. We don’t make decisions here, but if the application tells the person that it seems that he has something wrong, it is better to go to the doctor and get a consultation.

Transfer of various solutions and the development of the field of Moscow

Timecode ( audio version ) - 09:07

Dmitry: Have you had situations when you took decisions that were tested in one area of activity and applied them in another area, as an experiment or when you were having difficulty with a task?

Andrew:Usually we try to use automatic machine learning. Roughly speaking, we have some black box in which we drop tasks and get basic solutions. They, of course, are usually worse than what you can do with your hands after that, but this is a baseline that you can focus on. If we talk about specific solutions, for example, related to image processing, the transfer works here. The basic thing is pre-trained VGG and its further training on data. The formation of basic solutions based on trained and laid out access networks is a standard step. And it often turns out that it’s better not to do anything in the current config.

Dmitry: In this case, does it turn out that you are changing some architectural things from the point of view of iron, or is it just about software solutions, algorithms?

Andrew:Usually we only talk about algorithms. There were few problems for which we would have to adapt to iron. The brightest is what we did with Huawei to recognize characters in photos. There were rather strict restrictions connected with the fact that this model had to be implemented on the phone. Where there is not much memory, there are restrictions on operations that can be used, since it is impossible to apply in pure form what we did on TensorFlow. And here we had to dodge, because there were good decisions, but without the possibility of transfer, and we had to try hard. But more often we make a decision that quickly works somewhere on the server, but without restrictions on everything else.

Dmitry:I ask, because the question often arises of what the efficiency of such a system may depend on. Tell us what problems are you currently facing in terms of the evolution of algorithms and theoretical approaches to this area? Is there any limit that this sphere has reached, or is it developing rapidly now?

Andrew:The question is rather complicated, but interesting. For a long time, everything was really limited to iron. There were quite complex algorithms, but they did not always work well. Now a lot is connected with neural networks, because computing power has appeared that allows them to drive. Although from some point computing power has become less important, we always mean that we will have certain costs. If we talk about classical algorithms, then there are no particular problems with capacities. The question is how to quickly and efficiently find the best solutions.

Initially, since the capacities were limited, it was important that the person who solves the problem understood how the algorithms work, their resource requirements, applicability for this task. There was a period when this became not quite so, as capacities became quite affordable.

Dmitry: That is, there has been a significant enough leap in the growth of computing power and capabilities. We slightly lifted a hand from a pulse. Now you are returning it.

Andrei: Yes, the last couple of years there has been an increase in automatic machine learning. And there is a version that all the expertise that is now in machine learning can be replaced by a hundredfold increase in computing power. Automatic machine learning is actually trying to find a solution for your problem. And this is a matter of capacities, since this is a search, not an exhaustive search, but still very complex, long, large-scale, requiring a huge amount of computing resources. As a result, we again rest on the fact that expertise is being replaced by resources and their availability.

^{Photo by Franki Chamaki (Unsplash.com)}

I don’t believe that we will ever find ourselves in a situation where an expert will not be needed, because there are still many tasks that cannot be taken and automated like this (setting a task, understanding business processes) until we come up with powerful systems artificial intelligence. At a minimum, it will not be very soon. Therefore, the task will be preserved, but the requirements for computing power to be started again become one of the key factors.

Dmitry: But at the same time, you still have to take some actions to “catch up” with the growth of computing power in terms of the efficiency of the algorithms themselves. Companies that can have huge capacities, but still, it seems to me, each task has a certain timeframe, in which it must be solved and remains relevant.

Andrew: Yes. There is a desired quality, there is a time for which it is necessary to achieve it. We are limited on both sides. If there is enough time, then quality is almost always achievable. Although, of course, each task has fundamental limitations on what quality can be obtained. Roughly speaking, we will not be able to predict the character by the location of the moons better than a coin. But if the required restriction does not reach the ceiling, then we can achieve the desired quality for some time. Sometimes we can say that in the allotted time we found a fairly good solution, we can’t get the best for this time. There are always these two “coordinates”.

The study of new results, their understanding and adaptation

Timecode ( audio version ) - 16:41

Dmitry: From the point of view of professional development of those who work in your laboratory, how can you assess the distribution of basic knowledge and theoretical training in comparison with what you have to study directly in the process: analyze scientific articles, breakthroughs, changes in this area? How many of these changes come in as part of a week or month? How difficult is it to implement them in your activities, in those projects that you are working on here now?

Andrew:This is a very good question. It all depends on the area. For example, a graduate student who is automating the task of clustering is now protecting me. In general, there is not a very large stream of new results. On the other hand, he does AutoML for clustering. Here it is not so easy to keep track of the stream of new results. Moreover, it is difficult to be limited to one area.

As I said at the very beginning, in the field of machine learning, everything is strongly connected. The ideas that originally arose in the field of image processing flow in a couple of steps elsewhere. Word processing is the closest candidate. One of the most important success factors is simply to be aware of what is happening not only in your niche, but also in the field of machine learning in general.

^{Photo by Nick Hillier (Unsplash.com)}

This is difficult to do because a lot of results are generated. Tens of thousands of participants at major conferences, a huge number of works, up to hundreds of works a week in the archives. Such an amount is difficult to even flip through. But the milestones are important to watch.

To do this, you do not need to scan the archive or read conference collections from cover to cover, but subscribing to the public related to machine learning and keeping track of what is written there is very important.

Initially, there were many different results in language processing. As soon as Word2vec appeared, everyone began to move to embeddings. There are a lot of them. There were several basic ones, but for different particular tasks they had their own, and it was necessary to monitor how they improve them. Then Google rolled out BERT, and it gave a very big increase. At some point, it turned out that if you understood how BERT works, and if you are involved in processing English, then you really do not need anything.

You could not worry about the successes of the new versions of fastText and the embeddings based on it. BERT now has a lot of changes. It is still basic, like Word2vec at one time. But there are already many modifications coming out with a frequency of several jobs per month or even no less work per week.

And this dynamic is changing all the time. There is a place where everything collapses, and here you can catch up with something, then it starts to bloom in different colors again, and you need to follow this. The problem is that this is happening more often, that is, the speed at which new results appear is only increasing. How to keep track of this is one of those fundamental life problems that I encounter.

Dmitry: Do such bursts of activity appear more around large companies that create their own frameworks, or does the scientific community throw something?

Andrew:Everyone is involved to varying degrees. I have to admit that companies have more results than universities. It depends on the area. A typical scenario is when large scientists work not only in universities, but also in companies. Most often they hire those who continue to conduct research activities. Therefore, the company accumulates a huge intellectual potential. Universities rarely can provide similar computing power and data sets, so company dice over articles (affiliation) are found more and more often.

Dmitry: Even on articles, despite the fact that the company has the ability to close most of its developments and research from the public? That makes some sense. But the university does not have such an opportunity, by its nature it is obliged to publish something.

Andrew:Indeed, there may be bias. Universities publish in any case, because this is the specificity of their work. The flow of articles from universities is greater, because there are many universities, people are published, and companies do not welcome this everywhere. But breakthrough results - companies have more of them.

So they capitalize their fame and competencies, attract new scientists, explaining that research can be done with them, and very cool. For example, they made a new model, it can be very theoretical. But more often than not, it is of great practical importance, in the company in the first place. They can be laid out in open source, but, for example, like Google, it will already be some previous versions, which they themselves consider outdated.

Open development, preparation of publications and requirements for them

Timecode ( audio version ) - 23:48

Dmitry: How important are open source development in your business? How do they affect her? In the example with clustering automation, you said that you take open source development and apply it on some fundamental basis that you can study, work out thoroughly, that is, go the other way and not become attached to rapid changes in a narrow sphere.

Andrew:The posted code is very important. Almost always the standard question arises - does what is laid out work. Still - questions of comparison with the results. When a person creates something, he needs to prove that it is better than what it was before. Unfortunately, quite often a situation arises when people compare with what came before them, according to the figures published in the article. This is not very correct. There is no guarantee that a person is reproducing an experiment in the way that others did. Therefore, it is important to take different decisions and run them yourself, check how they work, and make sure that you did better in a similar experimental setup.

You can only compare correctly if you have the code.

^{Photo Headway (Unsplash.com)}

Secondly, if a person does not develop anything new from the point of view of competition, but wants to apply the method, it would be nice to have it on GitHub, and it could be used. But even when something is laid out, documentation is not always available. Rarely publish work with good documentation. It happens that they don’t upload the code at all, and you have to implement everything as described in the article. This is problem. This situation creates difficulties with reproducibility of machine learning results. And this is almost the main problem in our sphere.

It's easy to publish retraining results, but it's hard to check if there is no code. We compared our own AutoML with existing solutions. Fortunately, there are published libraries that are actively used. There are more newfangled solutions, but their code is not laid out. It was clear that they were worse - what they laid out was a special case of our decision.

Dmitry: Is it possible to envisage some approaches that will eliminate this problem, at least within the framework of your work and guarantee the possibility of reproducing the results?

Andrew:You can link to open repositories and provide documentation. I try to make it all in all the articles. Unfortunately, this is not always the case. This happens with almost everyone: a person wrote a code, it works, but people are embarrassed to share it with others, because they are ashamed of some of its elements. But I try to spread the idea that when writing an article, you need to supplement it with a link to GitHub. The repository with the code must be prepared in advance.

Dmitry: Here you need to spend time on its documentation.

Andrew: At least minimally. This is necessary so that what is written in the article helps to understand what is happening, even if the code is not documented at all. Although it is difficult. The larger the project, the less the article helps.

Work of laboratory specialists and frequency of output

Timecode ( audio version ) - 28:37

Dmitry: If you evaluate the time of preparation of a scientific article and a practical project, in what time frame can a person realize something in your field of activity? Conventionally, every six months or a year he publishes an article, and some large project takes, say, two years.

Andrei: There is a significant problem with the fact that it strongly depends on the hypothesis itself. It is really difficult to answer this question. I can paint the project so that it is carried out both in two months and in two years. In this case, similar results will be achieved.

This does not mean that you can take these two months and do every day twelve times less. Here as with nine mothers. We are tied to computing power, but many things can be parallelized. Many things can be completed depending on what the project is about.

Similarly with the preparation of articles. There are things where a model learns an hour, there are things where a model learns a day. In AutoML, experiments can go from a week. It so happens that experiments are not put on reasonable tasks, because they are not counted. Therefore, there are many tasks where everything is considered on some CIFAR. At least some reasonable data sets, more like practice, no one is trying to touch, because the article will be released in two years, the model will be, but already out of date.

The second criterion that affects the state of affairs is what you already know about the subject area. If you have an understanding of what models exist, an experiment is set up, you test a specific hypothesis, then everything is fine. But usually the task can change in the process. Not too much.

^{Photo Science in HD (Unsplash.com)}

But now it has changed a little in some direction, and it is necessary to prove the novelty of what you proposed. It is necessary to look at what has been done earlier, to compare.

It turns out that your decision is new, but not the coolest. You are trying to improve it, the result is changing. It is necessary to prove that the result is new. And also, a man screwed somewhere reinforcement training, and now we must look in this direction.

This process can be very stretched. There are long-term articles, they are written for a long time, because things change a little. Another point is the platform where the publication of the article is planned. In good places they require justification for why your result is good. It happens that this explanation is the most time-consuming. That is, the model works, and understanding why it works is a separate task, even more complicated than coming up with this model.

Finally, the writing of the article itself. Unfortunately, all worthy places to publish in order to contribute to science are English-speaking. This imposes its limitations. Writing English-language articles is not so easy. If a lab employee writes his first article, he will do it for a very long time. If this is his twentieth article, quickly. Moreover, if I do not actively kick a person, he will write a couple of months, if I will, anyway a few weeks. Of course, it all depends on how complex the subject he is considering.

The choice of area for development and prerequisites

Timecode ( audio version ) - 33:04

Dmitry: How does the initial choice of direction affect development? For example, a person is just starting to do all this at a serious level and, for example, takes a too wide area for himself. And here crutches begin to climb with periodic changes and gains in results from related industries, and a person gets stuck in it like in a swamp. How to focus on the initial stages?

Andrew: Such stories arise if a person imagines a variety of techniques that he can use. If he is just starting, most likely there will be no such problems.

Dmitry: Because the original base, which we talked about, draws with existing results, and a person does not bring anything new?

Andrew:Yes, you could say that. Well-read is important here. The more articles you read, the easier it is to write them. The language of articles is a separate language. This is scientific English. Moreover, scientific English in machine learning is different from scientific English in biology. If a person reads a lot of articles, it becomes easier for him to formulate thoughts and structure the article.

Dmitry: Tell us about the features that you expect to see in newcomers - those who are just preparing to join you.

Andrew:First of all, this is the ability to program. The field is applied, without it anywhere. If we conducted research in a completely fundamental field, we could not think about it - mathematical training would be enough. But all we do is assume that it will be implemented in Python, for example. The programming skill is crucial.

^{Photo by Hitesh Choudhary (Unsplash.com)}

The second is the core competencies in the field of machine learning. It is important that the person as a whole understands what machine learning is, how it is arranged, even if he is engaged in his narrow task. For example, he began training with reinforcements, but here you still need to understand how machine learning is structured in general, what retraining is. It is also important to know what models and ideas exist in order to try to apply them at home. A more important factor is motivation, the desire to dive, and internal interest. If a person has a well-read mind, an understanding of the field, the presence of his own ideas, he will be better able to write articles and position himself in our field.

Dmitry:What do you manage to do in addition to managing the laboratory and participating in the work of the wards? One way or another, you have to delve into, understand, suggest solutions, correct errors. Do you have any personal third-party projects? Maybe your day looks somehow special? Or do all efforts go only to profile activities, and is it your hobby?

Andrei: My every day is not like another. For example, in November I was less likely to be in St. Petersburg than not in St. Petersburg. It is impossible to say that I have some consistent regular activity at one time or another. I'm trying to write out what I want to do in a month. For example, to read so many articles, talk to your graduate students, scatter tasks depending on where I am, what I do, what factors influence it.

Since childhood, I wanted to do science, I like to devote it all the time. True, the management of the laboratory is not quite a science. Organizational activity and scientific activity are two different things. Usually, scientists are not particularly interested in organization and bureaucracy. If a person manages to manage well, then he does not have time for deep immersion in the task. Unfortunately, I can’t combine it effectively: either I start to delve into the leadership and fall out of context, or I start the research, and the organizational process is heading off. So now I’m trying to reconsider the approach. Our laboratory has a manager, Ivan Smetannikov, we have administrative staff, but there are still not enough staff.

Leadership is a complex and stressful process, it takes a lot of time. This is a necessary evil in order to engage in scientific activities and solve more than one specific problem. I could be a senior researcher and deal with one or several tasks, but this way you can deal with a large number of topics, and it is always more interesting - to dive, unload, come up with solutions. It’s more interesting for me to come up with a solution than to implement and verify it. Other people do this, but this is not a delegation of routine.

In a study, you can almost never say: “Do this!” Rather, they say: "Try to do this and think about this." You can set a person a direction, give clues, he has absolutely no routine work. If you say: “Try to solve this problem”, then someone will succeed, but someone will not. It is not always clear where to start and where to look. Somewhere I just direct, somewhere I set a task, somewhere I almost roll out a solution.

Dmitry: And do you select those who will be involved in various areas?

Andrei: We discuss with everyone what he wants to do and what his research will be about.

Dmitry: Such a personal approach?

Andrew:Of course. Many studies in machine learning are quite individual things. It is clear that people collaborate in groups and do something together, but forcing a person to do something is an ineffective approach. It happens that he does not have any preferences, and he can be invited to do something. He can get involved. But if a person is not drawn in, he has no intrinsic motivation, he will not read, he will not develop. He will be a good performer, but this is not enough to do good research. Need personal interest. Therefore, it is necessary to select a study for a person.

Dmitry: What sources or literature do you recommend for those who want to move from programming in its broad sense to the topic of machine learning?

Andrew: The main recommendation - courseson Coursera by Andrew Eun (Andrew Ng). There are deeper courses at the Higher School of Economics. It is important that there is practice. Machine learning is about how algorithms really work with data. This is far from math, because data affects what you need to do with it. You can feel this when learning, there are many pitfalls and specifics in the code. I can recommend the books: Deep Learning by Sergei Nikolenko, the more classic Hesti, Tibshirani, Friedman or the completely classic Bishop . Of course, it is a little outdated, but you can make a good idea.

Podcast cooks and leads dmitrykabanov.

Additional reading:

Startups of ITMO University (our English-language materials on Habré):

Podcast: what awaits beginner scientists in the field of defense

Transfer of various solutions and the development of the field of Moscow

The study of new results, their understanding and adaptation

Open development, preparation of publications and requirements for them

Work of laboratory specialists and frequency of output

The choice of area for development and prerequisites

More articles: