Thanos - Scalable Prometheus

The translation of the article was prepared specifically for students of the course "DevOps Practices and Tools" .




(Fabian Reinartz) — , Go . Prometheus Kubernetes SIG instrumentation. production- SoundCloud CoreOS. Google.

(Bartek Plotka) — Improbable. . Intel, Mesos production- SRE Improbable. . : Golang, open source .


Looking at our flagship SpatialOS product, you can guess that Improbable needs a highly dynamic global cloud infrastructure with dozens of Kubernetes clusters. We were one of the first to use the Prometheus monitoring system . Prometheus is capable of tracking millions of metrics in real time and comes with a powerful query language to retrieve the necessary information.

Simplicity and reliability Prometheus is one of its main advantages. However, having passed a certain scale, we encountered several drawbacks. To solve these problems, we developed Thanos- An open source project created by Improbable to seamlessly transform existing Prometheus clusters into a single monitoring system with an unlimited repository of historical data. Thanos is available on github here .

Stay up to date with the latest news from Improbable.

Our goals with Thanos


At a certain scale, problems arise that go beyond the capabilities of the vanilla Prometheus. How to reliably and economically store petabytes of historical data? Can this be done without prejudice to the response time to the request? Is it possible to access all metrics located on different Prometheus servers with a single API request? Is it possible to somehow combine the replicated data collected using Prometheus HA?

To address these issues, we created Thanos. The following sections describe how we approached these issues and explained the goals we pursued.

Querying data from multiple instances of Prometheus (global query)


Prometheus offers a functional approach to sharding. Even one Prometheus server provides sufficient scalability to free users from the difficulties of horizontal sharding in almost all use cases.

Although this is an excellent deployment model, it is often required to access data on different Prometheus servers through a single API or UI - global view. Of course, it is possible to display several requests in one Grafana panel, but each request can be executed only on one Prometheus server. On the other hand, with Thanos, you can query and aggregate data from multiple Prometheus servers, as they are all accessible from one endpoint.

Previously, to get a global view of Improbable, we organized our Prometheus instances into a multi-levelHierarchical Federation . This meant creating one Prometheus meta server that collects a portion of the metrics from each leaf server,



which proved problematic, complicating the configuration, adding an additional potential point of failure and applying complex rules to provide the federated endpoint with just the right data. In addition, federation of this kind does not allow you to get a real global view, since not all data is accessible from one API request.

Closely related to this is the single presentation of data collected on the Prometheus high-availability (HA) servers. The Prometheus HA model independently collects data twice, which is so simple that it couldn't be simpler. However, using a combined and deduplicated representation of both threads would be much more convenient.

Of course, there is a need for highly available Prometheus servers. At Improbable, we are really serious about monitoring data every minute, but having one instance of Prometheus per cluster is a single point of failure. Any configuration error or hardware failure can potentially result in the loss of important data. Even simple deployments can lead to minor glitches in the collection of metrics, since restarting can be significantly longer than the scraping interval.

Reliable storage of historical data


Cheap, fast and long-term metrics storage is our dream (shared by most Prometheus users). In Improbable, we were forced to set the storage period for the metrics to nine days (for Prometheus 1.8). This adds obvious limitations to how far we can look back.

Prometheus 2.0 is better in this regard, as the number of time series no longer affects the overall server performance (see KubeCon keynote about Prometheus 2 ). However, Prometheus stores data on a local drive. Although highly efficient data compression can significantly reduce the use of a local SSD, there is still a limitation on the amount of historical data that is saved.

At Improbable, we also care about reliability, simplicity and cost. Large local drives are more difficult to operate and back up. They are more expensive and require more backup tools, which leads to unnecessary complexity.

Downsampling


As soon as we started working with historical data, we realized that there are fundamental difficulties with O-big that make queries slower and slower if we work with data for weeks, months and years.

A standard solution to this problem will be downsampling — downsampling the signal. By downsampling, we can “scale down” to a larger time range and maintain the same number of samples, which allows us to maintain responsiveness of requests.

Downsampling old data is an inevitable requirement of any long-term storage solution and goes beyond vanilla Prometheus.

Additional goals


One of the initial goals of the Thanos project was to seamlessly integrate with any existing Prometheus installations. The second goal was simple operation with a minimum entry barrier. Any dependencies should be easily satisfied for both small and large users, which also implies a low base cost.

Thanos Architecture


After our goals were listed in the previous section, let's work on them and see how Thanos solves these problems.

Global view


To get a global view on top of existing instances of Prometheus, we need to associate a single request entry point with all servers. This is exactly what the Thanos Sidecar component does . It is deployed next to each Prometheus server and acts as a proxy, serving local Prometheus data through the Store API's gRPC interface, allowing you to select time series data by label and time range.

On the other hand, there is a stateless horizontally scalable Querier component that does a bit more than just responding to PromQL requests through the standard Prometheus HTTP API. Components Querier, Sidecar, and other Thanos communicate via the gossip protocol .



  1. When receiving a request, Querier connects to the corresponding Store API server, that is, to our Sidecar, and receives time series data from the corresponding Prometheus servers.
  2. After that, it combines the answers and performs a PromQL query on them. Querier can combine both disjoint data and duplicate data from Prometheus HA servers.

This solves the bulk of our puzzle - combining data from isolated Prometheus servers into a single view. In fact, Thanos can only be used for this opportunity. Existing Prometheus servers do not need to make any changes!

Unlimited shelf life!


However, sooner or later we will want to save data that goes beyond the normal storage time of Prometheus. To store historical data, we have chosen object storage. It is widely available in any cloud, as well as in local data centers and is very economical. In addition, almost any object storage is accessible through the well-known S3 API.

Prometheus writes data from RAM to disk approximately every two hours. The block of stored data contains all the data for a fixed period of time and is immutable. This is very convenient, since Thanos Sidecar can simply look at the Prometheus data catalog and, as new blocks appear, load them into the object storage buckets.



Downloading to the object storage immediately after writing to a disk also allows you to keep the simplicity of a “scraper” (Prometheus and Thanos Sidecar) simple. Which simplifies the support, cost and system design.

As you can see, backing up data is very simple. But what about querying data in object storage?

The Thanos Store component acts as a proxy for retrieving data from object storage. Like Thanos Sidecar, it participates in the gossip cluster and implements the Store API. Thus, existing Querier can consider it as Sidecar, as another source of time series data - no special configuration is required.



Time series data blocks are made up of several large files. Downloading them on demand would be quite inefficient, and local caching required huge memory and disk space.

Instead, Store Gateway knows how to handle the Prometheus storage format. Thanks to the clever query planner and caching of only the necessary index parts of the blocks, it became possible to reduce complex queries to the minimum number of HTTP requests to object storage files. Thus, it is possible to reduce the number of requests by four to six orders of magnitude and achieve response times that are generally difficult to distinguish from data requests on a local SSD.



As shown in the diagram above, Thanos Querier significantly reduces the cost of a single data request in the object storage by using the Prometheus storage format and placing related data side by side. Using this approach, we can combine many single queries into a minimum number of bulk operations.

Compaction and downsampling


After the new time series data block has been successfully loaded into the object storage, we consider it as “historical” data that immediately becomes available through the Store Gateway.

However, after a while, blocks from one source (Prometheus with Sidecar) accumulate and no longer use the full indexing potential. To solve this problem, we introduced another component called Compactor. It simply applies the local Prometheus compaction mechanism to historical data in the object store and can be run as a simple periodic batch job.



Thanks to efficient compression, a query to the repository over a long period of time does not present problems in terms of data size. However, the potential cost of unpacking a billion values ​​and running them through a query processor will inevitably lead to a sharp increase in query execution time. On the other hand, since there are hundreds of data points per pixel of the screen, it becomes impossible to even visualize the data in full resolution. Thus, downsampling is not only possible, but will not lead to a noticeable loss of accuracy.



For downsampling data, Compactor continuously aggregates data with a resolution of five minutes and one hour. For each raw fragment encoded using TSDB XOR compression, various types of aggregated data are stored, such as min, max or sum for one block. This allows Querier to automatically select the aggregate that is suitable for this PromQL query.

To use data with reduced accuracy, the user does not need any special configuration. Querier automatically switches between different resolutions and raw data as the user zooms in and out. If desired, the user can control this directly through the “step” parameter in the request.

Since the cost of storing one GB is small, by default Thanos saves the original data, data with a resolution of five minutes and one hour. There is no need to delete the original data.

Recording rules


Even with Thanos recording rules are an essential part of the monitoring stack. They reduce the complexity, latency and cost of requests. They are also convenient for users to obtain aggregated metric data. Thanos is based on vanilla instances of Prometheus, so it’s perfectly acceptable to store recording rules and alerting rules on an existing Prometheus server. However, in some cases this may not be enough:

  • Global alerts and rules (for example, alerts when a service is down on more than two of the three clusters).
  • Rule for data outside of local storage.
  • The desire to keep all the rule and alert in one place.



For all of these cases, Thanos includes a separate component called Ruler, which computes rule and alert through Thanos Queries. By providing the well-known StoreAPI, the Query node can access freshly computed metrics. Later, they are also stored in the object store and made available through the Store Gateway.

The power of Thanos


Thanos is flexible enough to be customized to your requirements. This is especially useful when migrating from simple Prometheus. Let's take a quick look at what we learned about the components of Thanos. Here's how to transfer your vanilla Prometheus to the world of “unlimited metrics storage":



  1. Add Thanos Sidecar to your Prometheus servers - for example, the adjacent container in the Kubernetes pod.
  2. Expand multiple Thanos Querier replicas to view data. At this point, it’s easy to set up gossip between Scraper and Querier. Use the metric 'thanos_cluster_members' to test component interactions.

Only these two steps are enough to provide a global view and seamless data deduplication from potential Prometheus HA replicas! Just connect your dashboards to the Querier HTTP endpoint or use the Thanos UI interface directly.

However, if you need to back up metrics and long-term storage, you will need to perform three more steps:

  1. Build an AWS S3 or GCS Bucket. Configure Sidecar to copy data to these buckets. Now you can minimize local data storage.
  2. Expand the Store Gateway and connect it to an existing gossip cluster. Now you can send data requests in backups!
  3. Deploy Compactor to improve query performance over long periods of time using compaction and downsampling.

If you want to know more, feel free to look at our examples of the manifest kubernetes and getting started !

In just five steps, we have turned Prometheus into a reliable monitoring system with global view, unlimited storage time and potential high availability of metrics.

Pull request: we need you!


Thanos was an open source project from the start. Seamless integration with Prometheus and the ability to use only part of Thanos makes it an excellent choice for scaling a monitoring system without any extra effort.

We always welcome GitHub Pull Request and Issues. At the same time, do not hesitate to contact us through Github Issues or slack Improbable-eng #thanos if you have questions or feedback, or want to share your experience! If you like what we do at Improbable, feel free to contact us - we always have vacancies !



Learn more about the course.



All Articles