👩🏼‍🤝‍👨🏻 ➿ 🐴 Why do we need DevOps in the field of ML data 🤜 👨🏿‍🤝‍👨🏻 🦈

Deploying machine learning (ML) in production is not an easy task, but in fact, it is an order of magnitude harder than deploying conventional software. As a result, most ML projects will never see the light of day - and production - as most organizations give up and give up trying to use ML to promote their products and serve customers.

As far as we can see, the fundamental obstacle for most teams to create and deploy ML in production on the expected scale is that we still have not been able to bring in DevOps practicesin machine learning. The process of creating and deploying ML models is partially disclosed by MLOps solutions that have already been released, but they lack support from one of the most difficult aspects of ML: data.

In this article, we discuss why the industry needs DevOps solutions for ML data, and how the unique difficulties of ML data hinder efforts to put ML into practice and deploy it in production. The article describes the vacuum in the current ML infrastructure ecosystem and suggests filling it with Tecton, a centralized data platform for machine learning. Click here to read an article from my co-founder Mike for more details on launching Tecton.

Tecton was created by a team of engineers who created internal ML platforms for companies such as Uber, Google, Facebook, Twitter, Airbnb, AdRoll, and Quora. The significant investments of these companies in ML allowed the development of processes and tools for the extensive use of ML for their organizations and products. The lessons presented in this article, as well as the Tecton platform itself, are largely based on the experience of our ML deployment team in production over the past few years.

Remember the time when the software release was long and painful?

The software development and deployment process twenty years ago and the ML application development process of our day have a lot in common: feedback systems turned out to be incredibly long, and by the time you got to the release of the product, your initial requirements and design had already become obsolete. And then, at the end of the noughties, a set of best practices for software development emerged in the form of DevOps , providing methods for managing the development life cycle and allowing for continuous, rapid improvements.

The DevOps approach allows engineers to work in a well-defined common code base. Once a phased change is ready for deployment, the engineer validates it through a version control system. The process of continuous integration and delivery (CD / CI) takes the most recent changes, conducts unit testing, creates documentation, conducts integration testing, and as a result, in a controlled manner releases changes to production or prepares a release for distribution.

Fig. 1: typical DevOps process

Key features of DevOps:

Programmers own their code from start to finish. They are empowered and fully responsible for every line of code in production. This sense of ownership generally improves the quality of the code, as well as the availability and reliability of programs.
Teams are able to quickly repeat processes and are not constrained by months-long cycles of the cascade model. Instead, they are able to test new features with real users almost immediately.
Performance and reliability issues are quickly detected and analyzed. If performance metrics fall immediately after the last deployment, an automatic rollback is triggered, and the changes that caused the deployment in the code are very likely to cause the metrics to fall.

These days, many development teams have taken such an integrated approach as the basis.

... In general, the deployment of ML is still long and painful

Contrary to software development, there are no well-defined, fully automated processes for quick production in data analysis. The process of creating an ML application and deploying it into a product consists of several steps:

Fig. 2: data analysis specialists have to coordinate their work between several teams in different areas

Discovery and access to source data. Data science specialists at most companies spend up to 80% of their time searching for source data to model their problem. This often requires cross-functional coordination with data engineers and the regulatory team.
. , . , , .
. . ( ).
Deployment and integration of the model. This step usually involves integration with a service that uses the model for forecasting. For example, an online retailer’s mobile application that uses a recommendation model to predict product offers.
Monitoring setup. Once again, to make sure that the ML model and data processes work correctly, the help of programmers is required.

As a result, ML teams face the same problems that programmers faced twenty years ago:

Data science experts do not have full ownership of the life cycle of models and functionality. To deploy their edits and support them in production, they have to rely on others.
data science . . . , , data science, , , , , .

. 3: ,

. data science, . , , , .

DevOps ML . DevOps ML data

Such MLOps platforms like Sagemaker Kubeflow and move in the right direction on the way to help companies to simplify the ML production, so we can observe how MLOps introduces principles and DevOps tools in ML. To get started, they need quite decent upfront investments, but after proper integration they are able to expand the capabilities of data science specialists in the field of training, management and production of ML models.

Unfortunately, most MLOps tools tend to focus on the workflow around the model itself (training, implementation, management), which creates a number of difficulties for existing MLs. ML applications are defined by code, models, and data.. Their success depends on the ability to create high-quality ML data and deliver them to production quite quickly and stably ... otherwise this is another “garbage in, garbage out”. The following diagram, specially selected and adapted from Google’s work on technical debt in ML, illustrates the “data-centric” and “model-centric” elements in ML systems. Today, MLOps platforms help with many “model-centric” elements, but only with a few “data-centric” ones or do not affect them at all:

Fig. 4: Model and atacentric elements of ML systems. Today, model-centric elements are largely covered by MLOps systems.

The next section demonstrates some of the most difficult tests we have encountered in simplifying ML production. They are not comprehensive examples, but they are designed to show the problems that we encounter during the management of the ML data life cycle (functions and labels):

Access to correct source data sources
Creating Functions and Labels from the Source Data
Combining functions in training data
Calculation and issuance of functions in production
Production Tracking

A small reminder before we dive further: the ML function is the data that serves as an input to the model to make a decision. For example, a food delivery service wants to show the expected delivery time in its application. To do this, it is necessary to predict the duration of the preparation of a particular dish, in a particular restaurant, at a specific time. One of the convenient signals for creating such a forecast - a proxy for how busy the restaurant is - will be the “final account” of incoming orders in the last 30 minutes. The function is calculated based on the flow of the initial data of the order order:

Fig. 5: The initial data is changed by the function transformation into function values

Test Date # 1: Accessing the Right Source Data

To create any function or model, a data science specialist first needs to find the correct data source and gain access to it. There are several obstacles along the way:

Data Discovery: Professionals need to know where the source data is. A great solution is data cataloging systems (such as Lyft`s Amundsen ), but they are not yet used so universally. Often the necessary data simply does not exist and therefore must first be created or cataloged.
Approval of access: Often running around between authorities to obtain permissions to access data that will solve problems is an obligatory step on the path of data science experts.
: , , . , .

- #2:

The source data can come from many sources, each with its own, important properties that affect the types of functions extracted from them. These properties include data source support for transformation types , data relevance , and the amount of data archive available :

Fig. 6: Different data sources have different approaches to different types of data transformation and provide access to different amounts of data depending on the relevance.

It is important to take these properties into account, since the types of data sources determine the types of functions that a data science specialist can obtain from the source data:

( Snowflake Redshift) ( ). , , « ».
( MongoDB MySQL) . , 24 .
( Kafka) ( ). . , « 30 ».
Prediction query data is the initial data of events that occur in real time right before an ML forecast is made, for example, a query just entered by the user in the search bar. Even if such data is limited, it is often “fresh” as much as possible and contain an easily predictable signal. Such data comes with a prediction request and can be used in real-time calculations such as searching for similarity estimates between a user's search query and documents in a search array.

Looking ahead, we draw attention to: combining data from different sources with complementary characteristics allows you to create really good functions. Such an approach requires the implementation and management of more advanced transformations of functions.

Test Date # 3: Combining Functions into Training Data

The formation of training or test data sets requires combining the data of the corresponding functions. In this case, it is necessary to track many details that can have critical effects on the model. The two most insidious ones are:

Data leakage: Data science specialists need to ensure that their model is trained on the correct information and not allow “leakage” of unwanted information into the training data. Such data may be: data from a test suite, data from ground truth, data from the future, or information that violates important preparatory processes (for example, anonymization).
: — . ( ). , data science , .

- #4:

After the model is put into operation in real time, in order to create correct and relevant forecasts, it needs to constantly supply new function data - often on a large scale and with minimal waiting time.

How should we pass this data to models? Directly from the source? Receiving and transferring data from storage can take minutes, days, hours, or even days, which is too long for data output in real time and therefore in most cases is impossible.

In such cases, the calculation of functions and the consumption of functions must be disengaged. For preliminary calculation (pre-calculation)functions and their shipment to the production data warehouse optimized for delivery, it is necessary to use ETL processes. These processes create additional difficulties and require new maintenance costs:

Search for the optimal compromise between relevance and profitability: The disconnection of the calculation and consumption of functions gives urgency top priority. Often, due to the increase in cost, the processes of functions can be run more often and, as a result, produce more relevant data. The correct compromise varies by function and use case. For example, the aggregation function of a thirty-minute delivery final invoice window will make sense if it is updated more often than a similar function with a two-week final invoice window.

Integration of function processes:Speeding up the production of functions requires obtaining data from several different sources, and as a result, resolving the problems associated with this is more complicated than when working with only one data source, which we discussed before. The coordination of such processes and the integration of their results into a single vector of functions requires a serious approach from data engineering.

Prevention of distortion in training ( training / serving-skew ):Discrepancies between the results of learning and work processes can lead to distortions in learning. Distortions during training are rather difficult to detect, and their presence can make the model predictions unusable. The model can behave chaotically when drawing conclusions based on data generated differently from those on which it was trained. In itself, the issue of distortion and working with them deserves a separate series of articles. However, it is worth highlighting two typical risks:

: ( ) , . Null? ? . .

. 7 ,

́ : - ( ). , , , . , , . — , , , .

Fig. 8: The graph shows the final account of orders: on (1) the values of the function issued for the forecast and updated every 10 minutes; on (2) training data are shown that incorrectly display the true values much more clearly in comparison with the functions issued for production

Test Date # 5: Tracking Features on Production

Something will break, despite all attempts to correctly circumvent the above problems. When an ML system crashes, it almost always happens due to a “data integrity violation”. This term can indicate many different reasons, each of which requires tracking. Examples of data integrity violations:

: , , . , , . , .
: , . , , (, ), .
: . , (, , ) .
Unclear responsibility for data quality: In the case where functions can receive source data from several different distributing sources, who is ultimately responsible for the quality of the function? The data-science specialist who created the function? Data-science specialist who trained the model? Owner of data feed channel? The programmer who carried out the integration of the model into production? In cases where responsibilities are unclear, problems remain unresolved for too long.

Such tests create an almost insurmountable obstacle course for even the most advanced teams in the field of data science and ML engineering. Solving them requires something better than the unchanging status quo of most companies, when individual tailor-made solutions remain the only answer to a subset of these problems.

Introducing Tecton: A Date Machine Learning Platform

At Tecron, we are creating a date platform for machine learning to provide assistance with the most common and most difficult problems in the field of data science.

At a high level, the Tecron platform has:

Function processes to turn your source data into functions and labels
Function repository for storing archived function and tag data
Function server for issuing the latest function values to production
SDK for training data and manipulation of function processes
Web UI to monitor and track features, labels, and datasets
Monitoring engine for determining data quality or drift problems, and alerts

Fig. 9: Being the central data platform for ML, Tecton provides functions in development environments and production.

The platform allows ML teams to bring DevOps practices to ML data:

Planning: Tecron features are stored in a central repository. This allows data science specialists to share, find and use each other's labors.
Code: Tecton allows users to set up simple and flexible function transformation processes.
Build: Tecton compiles functions into high-performance data processing tasks.
Testing: Tecton supports functional and integration testing of functions.
Release: Tecton integrates tightly with git. All function descriptions have version control and are easy to reproduce.
: Tecton ( Spark). Tecron.
: Tecton data science , .
: Tecton .

Of course, ML data without an ML model will not give you a practical implementation of ML. Therefore, Tecton provides flexible APIs and integrates with existing ML platforms. We started with Databricks, SageMaker, and Kuberflow, and continue to integrate with complementary ecosystem components.

Why do we need DevOps in the field of ML data