Logging and query tracing are best practices. Yandex Report

Yandex.Market has a large microservice architecture. The browser request of the Market’s main page gives rise to dozens of embedded requests in different services (backends) that are developed by different people. In such a system, it can be difficult to understand for what reason the request fell or took a long time to process.


Anatoly Ostrovsky megatolyaexplains how his team solved this problem and shares practices specific to the Market, but generally relevant for any great service. His report is based on his own experience of deploying a new marketplace in a fairly short time. For several years, Tolya led the interface development team in the Market, and now he has moved to the direction of unmanned vehicles.

- All our marketplaces are built according to general principles. This is one big single system. But if we talk about the frontend, then, from the point of view of the user, the applications are completely different.



At the same time, our frontends go to many backends. Sometimes these backends are similar to each other (different instances of the same application). And sometimes they are unique to the service (special billing). The structure of such a system can be considered a classic microservice architecture.

There is always a problem in a large service - it is difficult to understand what exactly is happening in it. Or, for example, what happens at the moment when a payment error occurs with the user. Suppose it happened at his place yesterday, and today we need to understand what happened.



Backend 2 can “burn” either for a specific product, or at a specific time, or for specific users. We need to be able to respond to any situation.



We have many backends, and, as I said, they can walk on their own. If this is presented in the form of a graph, then it will turn out rather confusing. In real life, there may be hundreds of microservices. Imagine how many connections there will be between them.

There are many steps to dive into this topic. I will briefly talk about each of them.



First of all, agree with colleagues from the backend on a common request marking system - it is easier to find them later. Next, you need to be able to quickly reproduce the problem. Suppose a payment error occurred - try to quickly understand how it happened and in which backend. Store logs not only in files, but also in the database so that aggregations can be done. And, of course, an important part of the process is graphs and monitoring. Next, in order.

Unified request identification system




This is one of the easiest tools to understand what is happening with the service. Agree with colleagues that, for example, your frontend generates some kind of id-request (the requestId variable in the picture) and then throws it into all the endopoints of the backends. And the backend itself doesn’t re-invent anything. He takes the requestId that has arrived and forwards it further in requests to his backends. At the same time, he can specify his prefix, so that among the identical requestId it would be possible to find this particular backend.



Thus, when you gobble your logs, then, making sure that your logs say, for example, that the backend is five hundred, there may be two options. You will either give this request id to your colleagues and they will look at it in their logs, or you will see for yourself.



All our logs are marked with such id so as not only to understand what happened and at what moment, but also to keep the context of this request. It is asynchronous, it may later add something to the log. And if you bite at requestId, nothing good will come of it.

cURL


To reproduce the problem, we use the cURL utility. This is a console utility that makes network requests - http and https. cURL supports many more different protocols, but when doing web development, it’s easier to assume that this is a tool for working with http (s) requests.



To get acquainted with the cURL team, you can go to any site, then go to Network and copy any request in the form of cURL. The result is such a large line:



If you try to figure it out, then there is nothing to worry about. Let's try to take it apart.



Here is a request to market.yandex.ru.



A User-Agent has been added here, which already takes up a lot of space.



In fact, the rest is cookies. There are quite a lot of them in Yandex code. In serialized form, they have a very formidable look. In fact, there is nothing else besides them.

So what is cURL useful for? If you copied it to yourself and ran it, then you saw the same market.yandex.ru page as I did - only the computer on which it was running would be different. Of course, some side effects could give differences in IP addresses, but in general they would be the same requests. We would reproduce the same scenario.



In order not to invent such cURL queries each time, you can use the npm-package format-curl.



It takes in all the request parameters that the function usually takes - that is, in this case only headers and url. But he also knows how to query, body, etc. And the output is just a string with a cURL request.



Therefore, all our logs in the dev environment also contain cURL requests.



We even logged backend cURL requests directly in the browser to immediately see how we go to our backends without looking at the browser console.



Please note that cURL requests involve the transfer of session cookies - this is bad. If you threw off your cURL request to me at market.yandex.ru, then I could log in to your Market and any other Yandex service under your login. Therefore, we do not store such requests anywhere, and log them only in test benches for ourselves - such data cannot be leaked.

Clickhouse


Next I will talk about structured logs. Here I will keep in mind the specific ClickHouse database, but you can choose any. ClickHouse is a columnar DBMS, it is more convenient to select from a huge amount of data and take in large chunks of data. It is good in that you can save a large piece of the log in it and then, for example, make some kind of aggregation of a billion records.



In this case, the ClickHouse select example is plain SQL. Here we show the number of requests for status codes for today.



As a result, we will have 180 thousand two hundred and seven five hundred, and the remaining status codes, for example, are not interesting to us. But how can we use this interestingly?



We can say that the ratio of two hundred to the ratio of the total number of answers is the Service Level Indicator, which answers the question of how well our application works in terms of status codes. Although simple, he is already talking about something.



Based on our indicator, we can come up with the first SLI, that is, say, for example, that 99% of our requests should be OK. And here we can compare that we performed our SLI. If they had not fulfilled, we could have tried to make out either some last requests that would have been five hundred, or simply critical things.



For example, payment errors are critical for us, but in this case they will return zero - it’s lucky :)



How to make sure that your logs are in this convenient form and can be taken via SQL?



This is a topic for a separate big report, and it all depends on your infrastructure. But it seems there are two ways. First: submit metadata directly to runtime, directly to the database. We do it differently, in the second way: we follow the log file and submit chunks either in the database or in an intermediate place.

This works for us rather multilayered - we send logs from a specific instance to a remote storage server of such logs.

Request Trace


There is no concept of “query tracing”. This term was coined by Google.



If you search the Internet for “query tracing,” you will see the traceroute command. Perhaps this is similar to query tracing.



There is even a console program, and here I executed it for the website bringly.ru (a service that we developed last spring). It helps to understand what kind of machines and servers the request goes through before it gets to the balancer, which will respond either with layout or with something else.



Here is our balancer. By query tracing, we mean not this, but everything that happens inside our balancer - how the request lives further inside our node.js application.



We see this in the form of a timeline, where the horizontal time is displayed, and the vertical sequence of requests. In this case, we have a request to the product card, where it can be seen that this is a request to the backend of our authorization. After his decision, three long lines went - this is a request to our main backend for a basket, product card and similar products. So we have a trace.



Here you see the same request for authorization, and then the backend went. In this case, the backend does not behave very optimally, because it has many consecutive queries to its database. Probably, such a query could be optimized.



And here is an example of a trace, when we did not go to any backend, but immediately gave the status of 500. How is such a trace useful to us? We do not have to disturb colleagues. We have the id of this request, so we can look in the logs ourselves and figure out what is happening.



Here is the opposite situation. Backend said that something was wrong and at the same time wrote in the meta-information what exactly happened - some kind of stack trace appeared.



How to make yourself the same tool?

The most important thing here is the database. If you have the simplest “INSERT INTO” in the database of some actions with the service, later you can at least use SQL to find the necessary events. And if necessary, you can make an interface to this.

Graphs


This is a very interesting topic, I will not dwell on it in detail today, of course, :)



Let's talk about logging. We have a lot of graphs, we look at them when we roll something out - and in the timings there is such a noise.



Charts help visually see that something is wrong. And then you still need to look at the logs and understand what is wrong with them. In this case, the surge immediately after the release at least means that such a release needs to be rolled back immediately.

Monitoring


An even more important part and a stronger degree of immersion in this topic is monitoring. By monitoring, I understand, firstly, automatic monitoring of charts and, secondly, any automated rules for monitoring something.



We monitor the ratio of five hundred to the total number of responses per minute. We also monitor four hundred, the presence of a load on the service, checking the health check knob, which pulls the ping knob of each of the backends, etc.



In addition, we have monitoring dashboards that we include on screens near workstations. So we immediately see which of the monitoring "blush". For example, here it is one of the main ones, where the frontend and our main backend are visible. Here you can see that some monitoring lights up on the backend. This means that at that moment the person responsible for this service will receive a message on Telegram, or perhaps they will even call him - it depends on the monitoring settings.

Summary


A single requestId will help you find problems more easily in a service that consists of several applications. The correct cURL will allow you to more accurately reproduce the problems and see how you yourself, for example, send data to the backend. Structured logs will allow you to come up with SLI, and they are more convenient to use than regular text logs. And do not forget to follow the charts and carry out monitoring.

I recommend reading the Site Reliability Engineering book from Google if you are interested in infrastructure things.

All Articles