Migrating from a seamless Data Lake to a distributed Data Mesh

Hello, Habr! I present to you the translation of the article “How to Move Beyond a Monolithic Data Lake to a Distributed Data Mesh” by Zhamak Dehghani (Zhamak Degani) (all images are taken from the same article).

All large companies are now trying to build huge centralized data warehouses. Or even more huge cluster Data Lakes (as a rule, on a hdup). But I do not know a single example of the successful construction of such a data platform. Everywhere is pain and suffering for both those who are building a data platform, and for users. In the article below, the author (Zhamak Degani) offers a completely new approach to building a data platform. This is the architecture of the fourth generation data platform called Data Mesh. The original article in English is very voluminous and frankly hard to read. The translation also turned out to be rather big and the text is not very simple: long sentences, rather dry vocabulary. I did not reformulate the author’s thoughts in order to maintain the accuracy of the wording.But I highly recommend that you still get through this difficult text and read the article. For those who deal with data, it will be very useful and very interesting.

Evgeny Cherny

Many companies are investing in the next generation of Data Lake with the hope of simplifying company-wide data access and providing business insights and the ability to make high-quality decisions automatically. But current approaches to building data platforms have similar problems that do not allow us to achieve our goals. To solve these problems, we need to abandon the paradigm of a centralized Data Lake (or its predecessor, the data warehouse). And move on to a paradigm based on a modern distributed architecture: consider business domains as a first-level priority, apply platform thinking to create an infrastructure with the ability to self-service and perceive data as a product.

image

Content

  • The current architecture of the data platform in a large company
    • Problematic architectural approaches
    • domain driven
      • -
      • (data pipelines),
        • (discoverable)
        • (addressable)
        • ,
    • data- -
    • Centralized data infrastructure as a platform
  • Paradigm shift towards Data Mesh

Building a data-driven organization remains one of the main strategic goals of many of the companies I work with. My clients are well aware of the benefits of making decisions based on quality data: providing the highest quality customer service, hyper personalization, reducing operating costs and time due to optimization, providing employees with analysis and business intelligence tools. They invest heavily in building modern data platforms. But despite growing efforts and investment in building such platforms, many organizations consider the results mediocre.

Organizations face many difficulties in the process of transforming into a data-driven company: migration from legacy systems and decades of development systems, resistance from the existing culture, and high competition between different business priorities. Be that as it may, I want to share with you an architectural approach that takes into account the reasons for the failure of many initiatives in the field of building data platforms. I will demonstrate how we can adapt and apply the lessons of the past decade in building distributed architectures in the data field. I have called this new architectural approach Data Mesh .

Before reading further, I ask you to try to abandon the prejudices that have been laid down by the current paradigm of the traditional data platform architecture while reading this article. Be open to the possibility of moving from centralized Data Lakes to a deliberately distributed Data Mesh architecture. Accept that data is inherently distributed and omnipresent.

The current architecture of the data platform in a large company


Let's talk about the centralized, monolithic and business-independent meaning of Data Lake data.

Almost every client I work with is either planning or is already building their third generation data platform. Recognizing the mistakes of previous generations.

  • First generation: proprietary enterprise data warehouses and business intelligence platforms. These are decisions for large sums of money that left companies with equally large amounts of technical debt. Technical debt is in thousands of unsupported ETL jobs, tables and reports that only a small group of specialists understands, which leads to an underestimation of the positive impact of this functionality on the business.
  • Second generation: Big Data ecosystems with Data Lake as a silver bullet. A complex ecosystem of big data and long-running batch jobs supported by a central team of highly specialized data engineers. At best, used for R&D analytics.

Third-generation data platforms are more or less similar to previous generations, but with a bias towards

  1. streaming to provide real-time data availability with an architecture like Kappa ,
  2. combining batch and streaming processing to transform data using frameworks such as Apache Beam ,
  3. use of cloud services for data storage and processing and cloud Machine Learning platforms.

The third-generation data platform eliminates some of the problems of previous generations, such as real-time data analysis, and also reduces the cost of managing a big data infrastructure. However, many underlying features that led to the failures of previous generations are still preserved.

image
Figure 1: Three Generations of Data Platforms

Problematic architectural approaches


To uncover the basic limitations that all generations of data platforms have in themselves, let's look at their architecture and features. In this article, I will use the business of streaming Internet media (such as Spotify, SoundCloud, Apple iTunes) as an example to explain some concepts.

Centralized and monolithic


From a height of 10,000 meters, the architecture of the data platform looks like Figure 2 below.
image
Figure 2: View from a height of 10,000 meters on a monolithic data platform. The

central part of the architecture is responsible for:

  • (to ingest) , , . , , : ; ; ; , ; ( ..).
  • , , , . , , — .
  • (to serve) . machine learning BI . , . , Kafka.

By default, the generally accepted agreement is the fact that the monolithic Data Platform stores and owns data that belongs to different business domains. For example, 'play events', 'sales KPIs', 'artists', 'albums', 'labels', 'audio', 'podcasts', 'music events', etc. - data from a large number of disparate domains.

Despite the fact that over the past decade we have successfully applied the concept of Domain Driven Design (and its key Bounded Context pattern ) to the design of our information systems, we have largely ignored these concepts in the design of data platforms. We have moved from data ownership at the business domain level to data ownership regardless of business domains. We are proudthat created the biggest monolith - Big Data Platform.

image
Figure 3: A centralized data platform without clear boundaries between data from different business domains. And without ownership of the relevant data by the business domain,

such a centralized model can work for small organizations that have simple business domains and limited data consumption options. But it is not suitable for large companies with large and complex business domains, a large number of data sources and diverse needs for working with data from consumers.

There are two weak links in the architecture and structure of a centralized data platform, which often lead to failure in the process of building it:

  • A large number of sources and large amounts of data. , , . , . . , , , . , data scientists . , ( ) , . , – - .
  • . . , . .

Here I need to clarify that I am not speaking out in favor of using fragmented, disparate data hidden in the depths of legacy systems. Such data that is difficult to detect, understand and use. I also do not support the numerous disparate data warehouses within the same organization, which are the result of many years of accumulated technical debt. But I argue that the answer to such inaccessible fragmented data is not to create a centralized data platform with a centralized team that stores and owns data from all business domains.

This approach does not scale in large organizations, as shown above.

Highly connected conveyor decomposition


image
Figure 4: Architectural decomposition of the data platform

The second problem with the traditional architecture of the data platform is how we decompose the architecture. If it drops to 3,000 meters above the architecture of the data platform, we will find an architectural decomposition around the functions of loading, cleaning, aggregating, serving data, etc. As described in the previous section, the need to connect new sources and new consumers requires a platform growth. Architects must find a way to scale the system by breaking it down into architectural quanta. Architectural quantum, as described in the book “ Building Evolutionary Architectures”, Is an independently deployable component with high functional connectivity, which includes all the structural elements necessary for the correct operation of the system. The motivation for dividing the system into architectural quanta primarily consists in creating independent teams, each of which creates and maintains its own architectural quantum (functional subsystem). This allows you to parallelize work and increase speed and operational scalability.

Influenced by previous generations of data platforms, architects divide the platform into a series of data processing steps. This is a pipeline that implements data processing: loading, preparing, aggregating, providing access / unloading, etc.

Although this partitioning provides a certain level of scaling, it also has an internal limitation that slows down the development of new functionality on the platform: there is a high connectivity between the steps of the pipeline, which does not allow for the necessary independence of individual teams.

Let's get back to our streaming media example. The streaming media platforms on the Internet have a strong domain design around the type of media they offer. They often start their services with “songs” and “albums”, and then apply to “music events”, “podcasts”, “radio shows”, “films”, etc. Enabling one new feature, for example, visibility for “podcasts” play rate ”, requires a change in all components of the pipeline. Teams need to develop new services for loading, cleaning and preparing data (including aggregation) in order to add the visibility of “podcasts play rate”. This requires synchronization between releases of various functional teams. Many data platforms use configuration-based download tools that can easily handle such tasks.like simply adding new sources or expanding existing ones. But this does not eliminate the need for end-to-end release management at all stages of the data processing pipeline. To provide users with access to any new data, the minimum architectural unit that needs to be changed is the entire pipeline. And this significantly limits our ability to increase the speed and scale of the development of the data platform in response to the emergence of new data sources and users.And this significantly limits our ability to increase the speed and scale of the development of the data platform in response to the emergence of new data sources and users.And this significantly limits our ability to increase the speed and scale of the development of the data platform in response to the emergence of new data sources and users.

Disparate and highly specialized teams


The third problem with modern data platforms is how we structure the teams that create and maintain the platform. When we go low enough over the architecture of a traditional data platform, we will see a group of narrowly specialized data engineers separated from those organizational units in which data is created or used for decision-making. Data platform engineers are singled out in separate teams only on the basis of their technical competencies and experience with big data technologies. Business knowledge of the corresponding subject areas (business domains) is absent in such teams.

image
Figure 5: Scattered narrowly specialized data platform teams

Personally, I do not envy the lives of data platform engineers. They should receive data from teams that have no incentive to provide quality and correct data. They lack an understanding of the business meaning of the data that you have to download. They must prepare data to meet analytical and operational needs, without a clear understanding of the end use of this data and without access to experts in the field of consumption of this data.

It should be noted that we have previously encountered a similar problem of team separation. And they were able to find a successful solution to this problem.

image

In our example with multimedia streaming, we have the “media player” command, which owns data on how users interact with the player: songs that users listen to, purchases made, audio quality of the songs they listen to, etc. On the other hand, there are teams of consumers of relevant data: a team of song recommendations; sales monitoring team; artist payment team, etc. And between them, a sad team of developers of a data platform, which, at the cost of great effort, receives data from one team and provides access to them (after preliminary processing) to all consumers.

In reality, we have uninvolved teams of data sources and frustrated teams of data consumers who have to fight for a place at the top of the backlog of the data platform development team.

We have created an architecture and organizational structure that does not provide the necessary scalability and is not able to achieve the goals of building a data-driven organization.

Next-Generation Data Platform Architecture


And what is the solution to the problems that we discussed above? In my opinion, a paradigm shift is needed. A paradigm shift at the intersection of methods that have played an important role in building a modern scalable distributed architecture and which the technology industry as a whole has implemented at an accelerated pace. Methods that have yielded successful results.

I believe that the next enterprise data platform architecture is to integrate the Distributed Domain Driven Architecture, design self-service platforms, and product thinking for data.

image
Figure 6: Shifting the next generation data platform paradigm shift.

I understand that this may sound like a lot of buzzwords in one sentence, but each of these components has had an incredibly positive impact on changing the technical foundations of our information systems. Let's see how we can apply each of these disciplines to the data world in order to move away from the current paradigm that has been transferred from many years of building data warehouses of previous generations.

Data and distributed domain driven architecture


Decomposition and ownership of data based on business domain orientation


Eric Evans' book, Domain-Driven Design , has had a profound impact on contemporary architectural thinking and, therefore, organizational modeling. The new microservice architecture decomposed information systems into distributed services that are built within the boundaries of specific business domains. This fundamentally changed the way teams are formed: from now on, a team can independently and autonomously own its microservices.

Interestingly, we ignored the concept of business domains in the field of data. Coming Application of Domain Driven Design in Data Platform Architecture: This is the Emergence of Business Domain Eventsin information systems and loading them into monolithic data platforms. However, after the data is uploaded to the centralized storage, the concept of ownership of data from different business domains by different teams is lost.

To decentralize a monolithic data platform, you need to change the way you think about data, its location and ownership. Instead of transferring data to a Data Lake or platform, domains should store and maintain their data sets in a user-friendly way.

In our example, instead of loading data from the media player into a centralized repository for further processing by the repository support team, why not store and process these datasets within the domain and not give any other team access to them? The very place where these data sets will be physically stored can be technically implemented within the domain as you wish. Of course, you can use a centralized architecture, but the data from the media players themselves will remain under the ownership and support of the team of the corresponding domain in which this data is generated. Similarly, in our example, the song recommendations development domain can create data sets in the format that is best suited for use (for example, in the form of graph structures) based on data from the media player. If there are other teams,who consider this format convenient and useful, they can also access it.

This, of course, implies that we can duplicate data in different domains when we change their format to one that suits a particular consumer.

All this requires a shift in our thinking from downloading data (via ETL or streaming) to scaling this process to all domains. The architectural quantum in a domain oriented data platform is a business domain, not the stage of loading and transforming data.

image
Figure 7: Decomposition of an architecture based on business domains and data-owning teams.

Source domain datasets


Some business domains are well aligned with data sources (information systems). In the ideal case, the information system and the team accompanying it are not only responsible for adding and supporting business functionality, but also provide datasets describing the facts and reality of the corresponding business domain. However, at the scale of a large organization, as a rule, there is no single correspondence between the business domain and the information system. As a rule, for each domain there are several information systems that automate different business processes of a given domain and, accordingly, store data related to it. For such domains, there is a need to integrate and aggregate disparate data in order to obtain data sets that are consistent and aligned across the entire business domain.

The best format for storing facts describing a business domain is Domain Events . They can be stored as a distributed event log with timestamps. This log can be granted access to authorized consumers.

In addition to these logs, data sources must also provide access to periodic snapshots of key datasets in their domain. Aggregate such images is for the time interval that better reflects the interval of changes for your domain (usually a day / week / month / quarter, etc.).

Please note that business domain data sets prepared for consumers should be separated from internal data sets of sources (which information systems use for their work). They should be stored in a physically different place suitable for working with big data. Next, it will be described how to create such a data warehouse and service infrastructure for it.

Domain-specific datasets prepared for consumers are the most basic elements of the entire architecture. They do not transform and are not tailored to a specific consumer, but are raw and unprocessed data.

Consumer domain datasets


Other domains are closely related to data consumers. The data sets of such a domain are created in such a way that, when used, they fit the associated set of user scenarios. These datasets are different from source domain datasets. This is not raw data, but data passed through several stages of transformation. The structure of these data sets and their presentation are tailored to the specific cases of their use. Those. This is an analogue of specialized data marts in a centralized repository. For such data sets of the consumer domain (consumer domain datasets) should be provided the possibility of rapid recovery from raw data.

Distributed data pipelines implemented within their domains


The ownership of data in our new architecture is delegated from the central platform to teams within business domains, but the need for data cleaning, preparation and aggregation (using the data pipeline) does not disappear. Therefore, the implementation of its own data pipeline becomes an internal task of the business domain team. As a result, we get our own domain data pipelines distributed across all domains.

For example, source domains should include data cleansing, duplicate removal, data enrichment, etc., so that other domains can use this data without preliminary processing. Each such data set must comply with its Service Level Objective in terms of data quality.

Similarly, the stages of building specialized showcases of a centralized pipeline for processing data go into the own data pipelines of consumer domains that build consumer domain datasets.

image
Figure 8: Distributed data processing pipelines implemented within their domains

It might seem that such a model will lead to a large duplication of efforts in each domain to create its own implementation of a data processing pipeline. We will talk about this issue in the “centralized data infrastructure as a platform” section.

Data and Product Thinking


The transfer of ownership of data and responsibility for the development and maintenance of data processing pipelines to the side of business domains may cause serious concern about the continued availability and ease of use of such distributed data sets. Therefore, here we come in handy product thinking in relation to data.

Domain Data as a Product


Over the past ten years, product thinking has deeply penetrated the development of information systems of organizations and has seriously transformed the approach to this development. Domain teams for the development of information systems provide new capabilities in the form of APIs that developers use in organizations as building blocks to create higher order functionality and higher value. Teams strive to create the best experience for users of their APIs through clear and detailed documentation that is easy for users to access; test environments carefully tracked quality indicators.

For a distributed data platform to be successful, data-teams of business domains must apply product thinking in relation to providing datasets: perceive the data that they prepare as a product, and consumers (analysts, data scientists, data engineers, ML specialists etc.) as your customers.

image
Figure 9: Characteristics of Domain Datasets as Products

Consider our example - streaming media content over the Internet. The most important business domain is the story of reproduction: by whom, where, when and which songs were listened to. This domain has various key data consumers within the organization. One requires data in near real-time mode to study user experience and timely detection of any problems and playback errors. Others are interested in historical snapshots aggregated by day or month. Therefore, our domain provides data in two formats: playback events in streaming form (streaming, topic in kafka or something like that) and aggregated playback events in batch format (file, table in hive, etc.).

To provide the best user experience for consumers, business domain data products must have the following key features.

Convenience and ease of detection (discoverable)


It is necessary to ensure the conditions under which any data-product can be easily found. The most common implementation of this requirement is the presence of a registry - a catalog of all available data-products with the necessary meta-information (such as owners, sources of origin, samples of data sets, update frequency, structure of data sets, etc.). Such a centralized service allows data consumers to easily find the data set they are interested in. Each data product from any business domain must be registered in a centralized data directory.

Please note that there is a shift from a single centralized platform that owns all the data to distributed data products of different business domains that are registered in a single data directory.

Unique address (addressable)


Each data-product must have a unique address (in accordance with the global agreement), which will allow its consumers to get programmatic access to it. Organizations can adopt various agreements on the name of the data products and their location, depending on the available methods of physical storage of data and the formats of the data itself. For distributed decentralized architecture, such general conventions are necessary. Dataset address standards will eliminate friction when searching and accessing data products.

Data quality


No one will use a product that is not credible. In the current generation data platforms, downloading and publishing data containing errors and not reflecting the whole business truth is widespread, i.e. data that cannot be trusted. It is in this part that a significant number of ETL jobs are concentrated, which clear data after loading.

The new architecture requires the owners of data products to adopt SLO (Service Level Objective) in terms of accuracy, reliability and relevance of the data. To ensure acceptable quality, it is necessary to use methods such as data cleaning and automatic data integrity testing at the stage of creating a data product. Information on the data lineage in the metadata of each data product gives consumers additional confidence in the product itself and its suitability for specific needs.

The target value of the data quality indicator (or the acceptable range) varies depending on the data product of a particular business domain. For example, a “replay event” domain can provide two different products: one in near real-time mode with a lower level of accuracy (including missed or repeating events); and the second with a longer delay and a higher level of data quality. Each data-product defines and maintains a target level of integrity and reliability of its data in the form of a set of SLO (Service Level Objective).

Clear description of semantics and data syntax


Quality products should be easy to use. Creating data products as simple as possible to use by analysts, engineers, and data scientists requires the presence of well-described semantics and data syntax. Ideally, sample datasets are provided as examples.

Data Integrability and Organization-wide Standards


One of the main problems in a distributed domain driven data architecture is the need to integrate data from different domains. The key to easy and efficient data integration between domains is to define and follow rules and standards. Such standards should be defined at the organization level. Standardization is required in the field of determining acceptable data types and rules for their application, conventions on the names and addresses of data products, metadata formats, etc.

For those entities that can be stored in a different form and with a different set of attributes in different domains, it is necessary to implement the practice of Master Data Management. Assign them global identifiers and align the set and, most importantly, attribute values ​​among all domains.

Ensuring the interoperability of data for their effective integration, as well as defining standards for storing and presenting data products at the organization level, are one of the fundamental principles for building such distributed systems.

Data Security and Access Control


Ensuring secure access to data is a must, regardless of whether the architecture is centralized or not. In the world of decentralized business domain-oriented data products, access control is possible (and should be applied) with a higher degree of granularity for each data set. Data access control policies can be defined centrally, but implemented separately for each data product. As a convenient way to implement access control to data sets, you can use the Enterprise Identity Management system and role-based access control .

Next, a single infrastructure will be described, which allows you to easily and automatically implement the above features for each data product.

Cross-functional business domain data command


The following roles should be represented in teams providing data in the form of data products: owner of the data product and data engineer.

The owner of the data product is responsible for the concept and roadmap, the life cycle of its products. Measures the satisfaction of its customers and constantly measures and improves the quality of the data of its business domain. It fills and balances the backlog of its data products with the requirements of data consumers.

Also, data product owners must define key metrics and performance indicators (KPIs) for their products. For example, the time required to familiarize yourself and start using the data product by the user may be one of these metrics.

In order to create and maintain their own data pipelines inside a business domain, the team must include data engineers. A good side effect of this will be the dissemination of relevant skills within the business domain. According to my observations, at present some data engineers, although competent in using their tools and technologies, lack knowledge of standard software development practices when it comes to creating data products. First of all, such DevOps practices as continuous delivery and automatic testing. On the other hand, software developers who develop information systems often do not have enough experience and knowledge in the field of technologies and tools for working with data as a product.Combining them into multifunctional teams within the business domain will lead to the emergence of specialists of a broader profile. We observed something similar during the development of DevOps when new types of engineers appeared, such asSRE .

image
Figure 10: Cross functional domain data command

Centralized data infrastructure as a platform


One of the sensitive aspects of the distributed domain driven architecture of the data platform is the need for duplication in each domain of the efforts and skills needed to operate the infrastructure and technology stack used in data pipelines. Fortunately, the creation of a common infrastructure as a platform is a task that is well learned to solve in IT (but not in the field of working with data).

The data infrastructure team must own and provide as a service the tools necessary for business domains to collect, process and store their data products.

image
Figure 11: Data Infrastructure as a Platform

The data infrastructure as a platform should be free of any domain-specific concepts or business logic. Also, the platform should hide from users the complexity of its implementation and provide the maximum amount of its functionality for use in self-service mode. Here is a list of some of the features that a centralized data infrastructure such as a platform should provide:

  • Scalable data storage in various formats
  • Data encryption (here hashing, depersonalization, etc.)
  • Versioning data products
  • Storing data product data schema
  • Data access control
  • Logging
  • Orchestration of threads / data processing processes
  • In-memory caching
  • Storing metadata and data lineage
  • Monitoring, alerts, logging
  • Calculation of quality metrics for data products
  • Data Catalog Maintenance
  • Standardization and policies, the ability to control compliance
  • Addressing data products
  • CI / CD pipelines for data products

When creating a centralized data infrastructure, it is necessary to ensure that the creation of a data product on such an infrastructure takes as little time as possible. Therefore, the maximum automation of key functionality is very important, such as: the ability to download data using simple configurations, automatic registration of a data product in the data directory, etc. Using cloud infrastructure can reduce operating costs and increase the speed of providing access to data infrastructure on demand.

Paradigm shift towards Data Mesh


It was a long read! Let's briefly summarize everything that is written above. We examined some of the key characteristics of modern data platforms: centralized, monolithic, complex data pipelines (with hundreds and thousands of jobs closely connected to each other), disparate highly specialized teams. After we talked about a new data mesh approach, which includes distributed data products focused on business domains managed by cross-functional teams (with owners of data products and data engineers), using a common data infrastructure as a platform for hosting.

The Data Mesh is a distributed architecture, with centralized management and developed standards that ensure data integrability, and with a centralized infrastructure that allows the use of self-service. I hope the reader is quite obvious that such an architecture is very far from a set of loosely coupled storage of inaccessible data, independently developed in different departments.

image
Figure 12: Data Mesh Architecture from 10,000 Meters

You may ask: how does Data Lake or the Data Warehouse fit into this architecture? They are simply separate nodes (domains) in this distributed architecture. There is a high probability that in such an architecture we will no longer need Data Lake. After all, we will have access to research the original data of different business domains, designed as data-products.

Accordingly, Data Lake is no longer the central element of the entire architecture. But we will continue to use the technologies and tools used to build Data Lake, either to create a common data infrastructure, or for the internal implementation of our data-products.

This actually brings us back to where it all began. James dicksonin 2010 he intended to use Data Lake for one business domain, and several data domains would form Water Garden.

The main paradigm shift is to consider the business domain data product as a first priority task, and tools and technologies as a second priority task (as an implementation detail). This is to divert the mental model from a centralized Data Lake into an ecosystem of data products that integrate seamlessly and efficiently with each other.

A few words about reporting and visualization (using BI tools, etc.). The same principle applies to them: in this architecture they are separate nodes. Those. they are independent data-products within a business domain, focused primarily on the consumer, and not on the data source.

I admit that although I see the successful application of the Data Mesh principles by my clients, scaling these principles in large organizations has a long way to go. But technology is obviously not a limitation here. All of the tools that we use today can equally well be used for the distribution and ownership of data products by different teams. In particular, the transition to standardization of packet and stream data processing jobs, as well as the use of tools like Apache Beam or Google Cloud DataFlow , makes it easy to process a variety of data sets with unique addresses.

Data catalog platforms such as Google Cloud Data Catalog, provide ease of discovery, access control and centralized management of data sets of distributed business domains. A large number of cloud platforms allows business domains to choose suitable for the targeted storage of their data-products.

The need for a paradigm shift is obvious. There are all the necessary technologies and tools for this. Business executives and data processing professionals must acknowledge that the current Big Data paradigm and approach with one big Data Lake platform will only repeat the failures of the past, using new cloud technologies and tools.

Let's move from a centralized monolithic data platform to an ecosystem of data products.

image

Links to primary sources and additional materials on the topic



All Articles