Dichotomy of data: rethinking the relationship with data and services

Hello everyone! We have great news, in June OTUS starts the course “Software Architect” again , and therefore we traditionally share useful material with you.




If you come across this whole story with microservices without any context, then you are excusable to consider it a bit strange. Partitioning the application into fragments connected by a network, of course, means adding complex fault tolerance modes to the resulting distributed system.

Despite the fact that this approach includes splitting into many independent services, the final goal is much more than just the operation of these services on different machines. We are talking about interaction with the outside world, which in its essence is also distributed. Not in the technical sense, but rather in the sense of an ecosystem that consists of many people, teams, programs, and each of these parts must somehow do its job.

Companies, for example, are a set of distributed systems that together contribute to the achievement of a certain goal. We have ignored this fact for decades, trying to achieve unification, transferring files via FTP or using corporate integration tools, while focusing on our personal isolated goals. But with the advent of services, everything has changed. Services have helped us look beyond the horizon and see the world of interdependent programs that work together. However, in order to work successfully, it is necessary to realize and design two fundamentally different worlds: the outside world, where we live in an ecosystem of many other services, and our personal, inner world, where we rule alone.



Such a distributed world is different from the one in which we grew up and to which we are accustomed. The principles of building traditional monolithic architecture do not stand up to criticism. Therefore, a correct understanding of such systems is more than creating a classroom diagram on a white marker board or a cool proof of concept. It is about such a system working successfully for a long time. Fortunately, services have been around for quite some time, although they do look different. SOA lessons are still relevant, even flavored with Docker, Kubernetes, and slightly battered hipster beards.

So, today we look at how the rules have changed, why we need to rethink our approach to the services and data that they transmit to each other, and why we need completely different tools for this.

Encapsulation won't always be your friend


Microservices can work independently of each other. It is this property that gives them the greatest value. The same property allows services to scale and grow. Not so much in terms of scaling up to quadrillion users or petabytes of data (although here they can help), but in terms of scaling from the point of view of people, because teams and organizations are growing continuously.



However, independence is a double-edged sword. That is, the service itself can spin easily and naturally. But if a function is implemented inside the service that requires the use of another service, then in the end we have to make changes to both services almost simultaneously. In the monolith, this is easy to do, you just make a change and send it to the release, but in the case of synchronization of independent services there will be more problems. The coordination between teams and release cycles destroys flexibility.



As part of the standard approach, annoying end-to-end changes are simply avoided, clearly dividing the functionality between the services. The single sign-on service here can be a good example. He has a clearly defined role that distinguishes him from other services. Such a clear separation means that in a world of rapidly changing requirements for the services surrounding it, the single sign-on service is unlikely to change. It exists within a strictly limited context.



The problem is that in the real world, business services cannot constantly maintain an equally clean separation of roles. For example, the same business services work more with data coming from other similar services. If you are engaged in online retail, then processing the order flow, product catalog or user information will become a requirement for many of your services. Each of the services will need access to this data to work.


Most business services use the same data stream, so their work is always intertwined.

So we come to an important point worth talking about. While services work well for infrastructure components that work largely apart, most business services are more intertwined.

Data dichotomy


Service-oriented approaches may already exist, but they still have little information on how to exchange large amounts of data between services.

The main problem is that data and services are inseparable. On the one hand, encapsulation encourages us to hide data so that services can be separated from each other and facilitate their growth and further changes. On the other hand, we need to be able to freely share and rule over general data, as well as over any others. It is about being able to immediately start work, as freely as in any other information system.

However, information systems have little to do with encapsulation. In fact, even the opposite. Databases do everything they can to give access to the data stored in them. They come with a powerful declarative interface that allows you to modify the data as you need. Such functionality is important at the stage of preliminary research, but not for managing the growing complexity of an ever-evolving service.



And here a dilemma arises. Contradiction. Dichotomy. After all, information systems are about providing data, and services are about concealment.

These two forces are fundamental. They form the basis of most of our work, constantly striving for excellence in the systems we create.

As service systems grow and evolve, we see different manifestations of the effects of data dichotomy. Either the service interface will grow, providing an increasingly wide range of functions and will begin to look like a very wonderful home-grown database, or we will be disappointed, and we will implement some way to extract or move massively whole data sets from service to service.



In turn, creating something that looks like a wonderful homegrown database will lead to a number of problems. We will not go into details of what the shared database is dangerous , just say that it represents significant costly engineering and operational difficulties for the company that is trying to use it.

Worse, data volumes multiply problems with service boundaries. The more common data is inside the service, the more complicated the interface will become and the more difficult it will be to combine the data sets coming from different services.

An alternative approach to extracting and moving entire data sets also has its problems. A common approach to this issue looks like a simple extraction and storage of the entire data set, and then storing it locally in each consumer service.



The problem is that different services interpret the data they consume differently. This data is always at hand. They are modified and processed locally. Pretty quickly they cease to have anything in common with the data in the source.


The more mutable the copies, the more the data will vary over time.

Even worse, such data is difficult to correct in retrospect ( MDM can really come to the rescue here). In fact, some of the intractable technological problems that a business faces are due to heterogeneous data multiplying from application to application.

To find a solution to this problem about shared data, you need to think differently. They should become first-class objects in the architectures we build. Pat hellandcalls such data “external”, and this is a very important feature. We need encapsulation so as not to expose the internal structure of the service, but we must facilitate services access to shared data so that they can correctly perform their work.



The problem is that none of the approaches is relevant today, since neither the service interfaces, nor the messaging, nor the Shared Database offer a good solution for working with external data. Service interfaces are poorly suited for exchanging data at any scale. Messaging moves data, but does not store its history, so data gets corrupted over time. Shared Databases focus too much on one point, which holds back progress. We are inevitably stuck in a data failure cycle:


Data Insolvency Cycle

Streams: a decentralized approach to data and services


Ideally, we need to change the approach to how services work with shared data. At the moment, any approach is faced with the aforementioned dichotomy, since there is no magic pollen that could be generously sprinkled with it and made so that it disappears. However, we can rethink the problem and come to a compromise.

This compromise involves a degree of centralization. We can use the distributed log mechanism because it provides reliable scalable flows. Now we need services to be able to join and work with these common threads, however we want to avoid the complex centralized God Services that perform this processing. Therefore, the best option is to embed streaming processing in each consumer service. So services will be able to combine data sets from different sources and work with them as they need.

One way to achieve this approach is to use a streaming platform. There are many options, but today we will consider Kafka, since the use of its Stateful Stream Processing allows us to effectively solve the presented problem.



Using the distributed logging mechanism allows us to follow a well-trodden path and use messaging to work with event-oriented architecture . It is believed that this approach provides better scaling and separation than the request-response mechanism, because it gives control of the flow to the receiver, not the sender. However, you have to pay for everything in this life, and here you need a broker. But for large systems, this trade-off is worth it (which cannot be said about your average web applications).

If a broker is responsible for distributed logging, and not a traditional messaging system, you can take advantage of additional features. Transport can be scaled linearly almost as well as a distributed file system. Data can be stored in the logs for a long time, so we get not only messaging, but also information storage. Scalable storage without fear of a mutable general state.

Then you can use the stateful stream processing mechanism to add declarative database tools to consumer services. This is a very important point. While data is stored in shared streams that all services can access, the pooling and processing that the service makes are private. They find themselves isolated within a strictly limited context.


Get rid of the dichotomy of data by dividing the immune flow of states. Then add this feature to each service using Stateful Stream Processing.

Thus, if your service must work with orders, a product catalog, a warehouse, it will have full access: only you will decide what data to combine, where to process it and how it should change over time. Despite the fact that the data is general, work with them is completely decentralized. It is made inside each service, in a world where everything goes according to your rules.


Share data so that its integrity is not violated. Encapsulate a function, not a source, in every service that needs it.

It so happens that the data needs to be massively moved. Sometimes a service requires a local historical data set in a selected database engine. The trick is that you can guarantee that if necessary, a copy can be restored from the source by accessing the distributed logging mechanism. Connectors at Kafka do a great job of this.

So, the approach considered today has several advantages:

  • Data is used in the form of shared streams that can be stored for a long time in the logs, and the mechanism for working with shared data is wired in each individual context, which allows services to work quickly and easily. In this way, you can balance the dichotomy of the data.
  • , , . .
  • Stateful Stream Processing , , .
  • , , , -.
  • , . , .
  • , .

As you can see, this is more than just REST. We got a set of tools that allows you to work with shared data in a decentralized way.

In today's article, not all aspects were disclosed. We still need to decide how to balance between the request-response paradigm and the event-oriented paradigm. But we will deal with this next time. There are topics you need to get to know better, for example, why Stateful Stream Processing is so good. We will talk about this in the third article. And there are other powerful designs that we can use if we resort to them, for example, Exactly Once Processing . With its help, the rules of the game for distributed business systems are changed, since this design provides transactional guarantees for XAin scalable form. This will be discussed in the fourth article. Finally, we will need to go over the details of the implementation of these principles.



But for now, just remember the following: a data dichotomy is the force we face when creating business services. And we must remember this. The trick is to turn everything upside down and begin to consider the general data as first-class objects. Stateful Stream Processing provides a unique compromise for this. He avoids the centralized “God Components” holding back progress. Moreover, it provides the speed, scalability and fault tolerance of data streaming pipelines and adds them to each service. Therefore, we can focus on the general stream of consciousness, to which any service can connect and work with its data. So the services are more scalable, interchangeable and autonomous. Therefore, they will not only look good on whiteboards and when testing hypotheses,but also work and develop for decades.



.



All Articles