😢 👏🏿 🕴🏾 Data Engineer and Data Scientist: what's the difference? 😠 🙏 👨🏻‍🎨

The professions of Data Scientist and Data Engineer are often confused. Each company has its own specifics of working with data, different goals of their analysis and a different idea of which of the specialists should be involved in what part of the work, therefore each has its own requirements.

We understand what the difference is between these specialists, what business problems they solve, what skills they have and how much they earn. The material turned out to be large, so we divided it into two publications.

In the first article, Elena Gerasimova, head of the Department of Data Science and Analytics in Netology, tells what the difference is between Data Scientist and Data Engineer and what tools they work with.

How the roles of engineers and Scientists differ

A data engineer is a specialist who, on the one hand, develops, tests, and maintains the infrastructure for working with data: databases, storage and mass processing systems. On the other hand, it’s the one who cleans and “combes” the data for use by analysts and data scientists, that is, it creates data processing pipelines.

Data Scientist creates and trains predictive (and not only) models using machine learning algorithms and neural networks, helping businesses find hidden patterns, predict events and optimize key business processes.

The main difference between a Data Scientist and a Data Engineer is that they usually have different goals. Both work to ensure that data is accessible and of high quality. But Data Scientist finds answers to his questions and tests hypotheses in the data ecosystem (for example, based on Hadoop), and Data Engineer creates a pipeline for servicing a machine learning algorithm written by a data scientist in a Spark cluster within the same ecosystem.

A data engineer brings value to a business by working as a team. Its task is to act as an important link between different participants: from developers to business reporting consumers, and to increase the productivity of analysts - from marketing and product to BI.

Data Scientist, on the contrary, is actively involved in the company's strategy and extracting insights, making decisions, implementing automation algorithms, modeling and generating value from data.

Work with data obeys the GIGO principle (garbage in - garbage out): if analysts and data scientists deal with unprepared and potentially incorrect data, then the results, even with the most sophisticated analysis algorithms, will be incorrect.

Data engineers solve this problem by building pipelines for processing, cleaning and transforming data and allowing data scientist to work with high-quality data.

There are many tools on the market for working with data that cover each of the stages: from the appearance of data to the output to the dashboard for the board of directors. And it is important that the decision on their use is made by the engineer, not because it is fashionable, but because it will really help the rest of the participants in the work.

Conditionally: if the company needs to make friends with BI and ETL - downloading data and updating reports, here is a typical legacy foundation with which the Data Engineer will have to deal (well, if the team has an architect besides him).

Data Engineer Responsibilities

Development, construction and maintenance of data infrastructure.
Error handling and the creation of reliable data processing pipelines.
Bringing unstructured data from various dynamic sources to the form necessary for the work of analysts.
.
, - .
.
, , .
( ).

There is another specialization within the Data Engineer trajectory - ML engineer. In short, such engineers specialize in bringing machine learning models to industrial deployment and use. Often, a model received from a data scientist is part of the study and may not work in combat.

Responsibilities of Data Scientist

Extract features from data for applying machine learning algorithms.
Using various machine learning tools to predict and classify patterns in data.
Improving the performance and accuracy of machine learning algorithms by fine-tuning and optimizing algorithms.
Formation of “strong” hypotheses in accordance with the company's strategy, which must be checked.

Data Engineer, Data Scientist , .

Today, expectations from data processing professionals have changed. Previously, engineers collected large SQL queries, manually wrote MapReduce and processed the data using tools such as Informatica ETL, Pentaho ETL, Talend.

In 2020, a specialist can not do without knowledge of Python and modern tools for computing (for example, Airflow), understanding the principles of working with cloud platforms (using them to save on hardware, while observing security principles).

SAP, Oracle, MySQL, Redis are traditional tools for a data engineer in large companies. They are good, but the cost of licenses is so high that learning to work with them makes sense only in industrial projects. At the same time, there is a free alternative in the form of Postgres - it is free and is suitable not only for training.

Historically, Java and Scala have been frequently requested, although as languages and technologies evolve, these languages fade into the background.

Nevertheless, the hardcore BigData: Hadoop, Spark and the rest of the zoo is no longer a prerequisite for a data engineer, but a kind of tool for solving tasks that traditional ETL cannot solve.

In the trend are services for using tools without knowing the language in which they are written (for example, Hadoop without knowledge of Java), as well as providing ready-made services for processing streaming data (voice or image recognition on video).

Industrial solutions from SAS and SPSS are popular, with Tableau, Rapidminer, Stata and Julia also widely used by data scientists for local tasks.

Analysts and data scientists only had the opportunity to build pipelines a couple of years ago: for example, it is already possible to send data to PostgreSQL-based storage with relatively simple scripts.

Typically, the use of pipelines and integrated data structures remains the responsibility of data engineers. But today, more than ever, the trend is strong for T-shaped specialists - with wide competencies in related fields, because the tools are constantly being simplified.

Why Data Engineer and Data Scientist Work Together

Working closely with engineers, Data Scientist can focus on the research part, creating ready-to-use machine learning algorithms.
And engineers focus on scalability, data reuse and ensure that the data input and output pipelines in each individual project are consistent with the global architecture.

This separation of duties ensures coherence between teams of specialists working on different machine learning projects.

Collaboration helps effectively create new products. Speed and quality are achieved thanks to a balance between creating a service for everyone (global storage or integration of dashboards) and the implementation of each specific need or project (highly specialized pipeline, connecting external sources).

Working closely with data scientists and analysts helps engineers develop analytical and research skills to write better code. The exchange of knowledge between users of data warehouses and data lakes improves, which makes projects more flexible and provides more sustainable long-term results.

In companies that aim to develop a culture of working with data and build business processes based on them, Data Scientist and Data Engineer complement each other and create a complete data analysis system.

In the next article, we’ll talk about what kind of education Data Engineer and Data Scientists should have, what skills they need to develop, and how the market works.

From the editors of Netology

If you look closely at the profession of Data Engineer or Data Scientist, we invite you to study the programs of our courses:

Profession " Data Engineer ".
Profession Data Scientist .

Data Engineer and Data Scientist: what's the difference?

How the roles of engineers and Scientists differ

Why Data Engineer and Data Scientist Work Together

From the editors of Netology

More articles: