3 traps that beginners Data Scientists fall into

This is what can happen if you are not good at math.





Hello! This is Petr Lukyanchenko, author and leader of the online courses "Mathematics for Data Science" at OTUS. In the classroom, we love to illustrate everything with cases, so here, too, every problem that beginners encounter, I will start with an example.

History No. 1 ., , , . , -, , . , 0,95. , «», , . , , , , .

— , , - ?



In our story, the trainee prepared the data incorrectly because he did not understand what kind of dependence to assume. This is the most common and dangerous mistake that newcomers to data analysis make.

In all classes we broadcast two things:

  1. Any analysis should begin with a hypothesis
  2. The hypothesis may be erroneous. It is not scary to make a mistake, it is important to understand, correct and continue the analysis in time.

The ability to formulate hypotheses, which are subsequently tested on data, causes the greatest difficulty for beginners, interns and young specialists in Data Science. They, as a rule, know statistics quite well, but do not have experience, therefore they often blindly believe that a good value of the metric signals that their result is valid. Because of this, newcomers are often driven by the desire to obtain a high correlation value. But a high correlation in itself is not a guarantee of the right dependence!

Imaginary correlations (regressions) are usually very funny. You can take any two parameters, and if each of them has a trend component, then the estimated correlation will turn out to be close to unity, while the parameters themselves may not have any relationship.

For example, a person studies glaciers in Greenland and decides to see how the amount of precipitation in Thailand during the monsoon season affects the rate of ice melting. In a given period, both of these variables increase, that is, they have some trending components: in Thailand, the volume of precipitation increases at the same time when the hot period begins and the glaciers melt faster. If we consider the correlation "head-on", it will be close to unity, which means that there is a direct relationship between the values. Therefore, before analytics, you must first work with the data - clear them of the trend component, i.e. Detrend and get the daily value of the increase. And now these Δx variables are used to obtain correlation. This is a very simple thing, which nevertheless significantly improves the quality of analysis.

History No. 2. . - , — . , : , . ?

, , . , , , , .

It is the wrong choice of the time period for calibration, when external factors are not taken into account, that is the most common mistake when the model working at first becomes useless.


Load data into the model as in a black box


For several years of rapid development of the areas of Data Science, mankind has accumulated impressive libraries of models and methods of data processing. And this is great - they can be used to solve ordinary problems, which many experts resort to, not only beginners, but also experienced ones. The danger is to take the finished model, just stick the data into it and get some predictive value at the output. An experienced specialist always uses math tools to test and adapt the method to his task.

For beginners, at first it is difficult to identify the restoration of the empirical distribution in existing data. And even if a novice specialist successfully selects the appropriate method in the library or a senior colleague helps him with setting up the model, another danger lies in wait for him: at any time, the nature of the data behavior may change or the internal process of the time series may change. This means that you need to quickly recalibrate the model, because its accuracy has decreased, and as a result, the effectiveness of the entire prediction has fallen. In order to catch this and adjust the model, you need to own statistical methods and understand what principle it works on.

Even if the method is programmed in Python and is somewhere in the box, at least once it must be displayed manually to understand how it works. If you come across this method in the project and you need to adapt it, you will already know in which chains which steps you need to do.

History No. 3. Imagine you have a data matrix of 10,000 rows per 10,000 columns. ~ 30 milliseconds are spent on multiplying each pair of elements, that is, your algorithm will process the data for more than an hour! And if it will be a billion to a billion matrix? Or do you need to run a lot of such algorithms?

Raw Matrices


It often happens that newcomers do not process or prepare matrices before analysis. As a result, the process takes away their extra time and effort. To simplify and speed up work with matrices, specialists use tools from linear algebra. It works like this: the existing data matrix is ​​projected into a low-rank subspace and thereby temporarily reduce its dimension.

You can learn how to do all this in our online courses “Mathematics for Data Science”. The basic level is designed for training from the school curriculum and focuses on the mathematical component. You should go to the Advanced level if you once, even for a very long time, studied higher mathematics or already have experience in Data Science. At the Advanced level, we analyze data analysis methods for different tasks. At the end of the course, students do design work: they try to manually implement one of the methods to understand how it is arranged and to modify one of its sections. The entrance test will help you determine the level.

The theory and practical skills that you will master in the classroom are primarily necessary for Middle specialists, but they will also be useful at the start of the profession. We conducted a survey among our partner employers in the field of Data Science and found out that more than half of them are ready to hire an intern with knowledge of mathematics, even if he does not know how to work with Python libraries.

Also, if you work or just look at Data Science, I invite you to subscribe to the Data Street telegram channel , where I share my experience and collect useful materials from the world of mathematics, data analysis and machine learning. I will be glad to see you here at the OTUS courses!

You can learn more about the courses, as well as pass the entrance test to test your knowledge, by clicking on the links below:


All Articles