📢 🛠️ 🏢 SRE Observability: Namespaces and Metric Structure 🔧 📶 🧔

Spyglass by Shorai-san

Structured metric namespaces are important for quick access to information during incidents. Carefully plan names and metric dimensions to support a wide range of queries and extensions. One way that is effective in creating a flexible metric model is to think of them as a tree.

This provides several advantages: viewing certain subsets of data, defining a metric in terms of its children and establishing relationships between metrics. Mail.ru Cloud Solutions

Team Translated Article , which discusses the properties of metric namespaces, allowing you to gradually increase the detail of queries and move to subsets of data, as well as view the metric from the point of view of the metrics of which it consists. Many of these concepts are familiar to you, as they are implemented in native cloud monitoring solutions such as Prometheus and DogStatsD.

Metric Namespaces and Their Structure

Metric namespaces are the conceptual spaces in which metrics live. They are often limited to a database or account:

The metric namespace is also the structure of the metrics within the namespace. The proper name and structure open up a number of huge advantages.

The namespace in the diagram above does not have an explicit structure. Each metric is separate and independent. Metrics have nothing in common except the fact that they exist in the same namespace. In this unrelated structure, each metric will be used individually. To see the frequency of http requests for a service, the service metric must be accessed directly - service_N_http_requests_total.

Suppose we want to see the total number of requests to all services. What happens in the above example if we create a new service?

If the total number of requests is calculated by summing the requests to service_1 and service_2, then adding service_3 does not change the total number of requests. To calculate the correct total number of requests, you need to change the counting rule by adding service_3_http_requests_total. Take a look at the graph of the number of requests below:

Metrics tree

An alternative to a structureless namespace is to accept an explicit structure using the metric name as the namespace. In the diagram below, you see this structure as a tree:

In Prometheus and Datadog, a metric structure is created using tags and tags . Tags allow you to build a tree dynamically: whenever a new service is added, it refers to the root metric.

In Prometheus, the number of requests per second for all services can be viewed by building a request, as in the picture below:

With a structured namespace, you can dynamically calculate the sum of queries across the entire node. In this case, Prometheus calculates the number of requests per second for each service as a separate metric.

Defining Inheritance Metrics

When using the metrics tree, each metric dimension (labeled “service”) contains an individual frequency of requests for a particular service. Using the metrics namespace, you can get not only the total request frequency, but also the request frequency for each service:

Using the namespace, you can select and visualize not only the general metric data, but also the data of the part of the general metric, grouped by some attribute. So, in the picture above the frequency of requests to individual services is visible, their sum gives the frequency of requests to the node.

Narrowing down the sample - subsets of data

Namespaces also support query narrowing to retrieve specific subsets of data. For example, let's ask the question: “What is the p99 delay (99% of requests are faster than the specified delay) in all successful HTTP requests for servers with canary deployment?”.

The tree above models the concept of a namespace and optionally models how metrics are stored on disk. Using a well-defined metric namespace allows you to extend metrics to any parameter.

The picture below shows a graph of p99 and p50 from the tree of metrics above:

If you make a more specific request, you can, for example, answer the following question: “What is the delay p99 in all successful HTTP requests for servers with canary deployment in the context of each server?”

Below is a visualization of a metric with a selection by machine_id:

Since the metric has a well-defined structure, we can select the necessary data from the top-level metric by specifying the necessary selection criteria - in our case, machine_id.

Odds Rule

Coefficients are another way to structure data (correlations). This is a very powerful mechanism and the basis for calculating the availability and error rate of SLOs (these indicators were popularized by Google SRE experts).

Coefficients allow the end user to explicitly associate metrics, establishing a metric structure. These relationships are most often expressed as a percentage, that is, accessibility can be calculated as the ratio of “successful requests / total number of requests”, and the error rate as “number of errors / total number of requests”. Another example of a coefficient is how often one state arises from several states.

Let's illustrate this and assume that there is an application that executed the request, and the request could lead to one of two states: data taken from the cache (cache_hit: true) or data taken from the main source (cache_hit: false). To see the cache hit ratio, the data should be structured as follows:

The graph below shows the frequency of hit and miss cache. Each request either gets or does not get into the cache. In total, about 160 requests per second occur:

The following graph shows the cache hit ratio relative to the total number of requests. Hit coefficient is 0.5 (50%):

So you can relate any two metrics. In Datadog and Prometheus, this relationship is expressed by a simple arithmetic operation.

Questions answered by data

It is important to think through the questions that data should answer. In the very first example, data sampling cannot exactly answer the question: “How many queries per second do all instances process?”, But the namespace tree would help to get the answer.

Another common case is the namespace of client metrics with the name of the service, and not with the name of the client library. Adding the name of the client library to the namespace will answer the question: “The total number of requests from all clients?”.

General useful questions answer the four golden signals of Google . Each question is posed in a general way, and then it is specified:

How many requests do all customers in total make?
How many requests does each client make?
How many requests does each client make to each node?
What is the percentage of successful server requests for each RPC?

The same strategy applies to delays, error rates and resource saturation.

Generic tagged metrics

Here's what I read in best practices for query optimization and data storage for Datadog and Prometheus.

To get a global view that supports detailing to specific segments, start with a common top namespace and add tags and labels (start with common ones, then add more specific ones). In doing so, consider the recommendation below.

Beware of cardinality

Both Datadog and Prometheus recommend limiting the number of tags. To quote the Prometheus manual :

, , , . , .

, 10. , , . .

, 100 , , .

, node_exporter. . , , node_filesystem_avail. 10 000 , 100 000 node_filesystem_avail, Prometheus.

If you add FS quotas per user, you will quickly reach tens of millions of time series from 10,000 users per 10,000 nodes. This is too much for the current implementation of Prometheus. Even with lower numbers, you will no longer have other, potentially more useful, indicators in this monitoring.

Start without tags and add more tags over time as needed.

Convenient user-level monitoring is often better achieved through distributed tracing . Distributed tracing has its own metrics space and best practices.

Conclusion

It is important to understand what questions can be answered by structuring the metrics. An incorrect structure leads to difficulties in obtaining answers. Although structuring the metric space is not complicated, it requires prior planning to get the most out of the data.

When problems arise, the ability to manually expand the metric to see all the states is critical, and it is important that the namespace does not interfere with this.

Good luck!

What else to read :

SRE Observability: Namespaces and Metric Structure