😌 🔑 🚵🏻 How to use Prometheus to detect anomalies in GitLab 🕰️ 👩🏽‍🤝‍👨🏾 🌷

One of the basic features of the Prometheus query language is real-time aggregation of time series . You can also use the Prometheus query language to detect anomalies in time series data.

The Mail.ru Cloud Solutions team has translated an article by the engineer of the GitLab infrastructure team , where you will find code examples that you can try on your systems.

What is anomaly detection for?

There are four main reasons that determine the importance of anomaly detection for GitLab:

Incident Diagnostics : We can detect which services have gone beyond their normal scope by reducing incident detection time (MTTD) and, accordingly, offer a faster solution to the problem.
Detection of performance degradation : for example, if a regression is introduced into a service that causes it to access another service too often, we can quickly detect and fix this problem.
Detection and elimination of abuse : GitLab provides delivery and integration mechanisms ( GitLab CI / CD ) and hosting (GitLab Pages) and a limited number of users who can use these mechanisms.
Safety : Anomaly detection is important for detecting unusual trends in the GitLab time series.

For these and other reasons, the author of the article decided to figure out how to configure the definition of anomalies in the GitLab time series using Prometheus queries and rules.

What is the correct aggregation level?

First, time series must be correctly aggregated. In the example below, we used the standard http_requests_total counter to retrieve data, although many other metrics would do as well.

http_requests_total{
 job="apiserver",
 method="GET",
 controller="ProjectsController",
 status_code="200",
 environment="prod"
}

This test metric has several parameters: method, controller, status code (status_code), environment, plus parameters added by Prometheus itself, for example, job and instance.

Now you need to choose the right level of data aggregation. Too much, too little - all this is important for detecting anomalies. If the data is too aggregated, there are two potential problems:

You can skip the anomaly because aggregation hides problems that occur in subsets of your data.
If you find an anomaly, it is difficult to relate it to a separate part of your system without additional effort.

If the data is not aggregated enough, this can lead to an increase in the number of false positives, as well as to incorrect interpretation of valid data as erroneous.

Based on our experience, the most correct level of aggregation is the level of service, that is, we include the label of work (job) and environment (environment), and discard all other labels.

The aggregation, which we will discuss throughout the article, includes: job - http_request, aggregation - five minutes, which is calculated on the basis of work and environment in five minutes.

- record: job:http_requests:rate5m
expr: sum without(instance, method, controller, status_code)
(rate(http_requests_total[5m]))
# --> job:http_requests:rate5m{job="apiserver", environment="prod"}  21321
# --> job:http_requests:rate5m{job="gitserver", environment="prod"}  2212
# --> job:http_requests:rate5m{job="webserver", environment="prod"}  53091

From the above example, it is clear that from the series http_requests_total subsets are selected in the context of work and environment, then their number is considered for five minutes.

Using Z-score to detect anomalies

The basic principles of statistics can be applied to detect anomalies.

If you know the mean and standard deviation of the Prometheus series, then you can use any sample in the series to calculate the z-score.

A Z-score is a measure of the relative spread of the observed or measured value, which shows how many standard deviations its spread relative to the average value is .

That is, the z-score = 0 means that the z-score is identical to the average value in the dataset with the standard distribution, and the z-score = 1 means that the standard deviation = 1.0 from the average.

We assume that the basic data have a normal distribution, which means that 99.7% of the samples have a z-score from 0 to 3. The farther the z-score is from zero, the less likely it is to exist.

We apply this property to detect anomalies in the Prometheus data series:

We calculate the mean and standard deviation for the metric using data with a large sample size. For this example, we use weekly data. If we assume that the records are evaluated once a minute, then in a week we will have a little over 10,000 samples.

# Long-term average value for the series
- record: job:http_requests:rate5m:avg_over_time_1w
expr: avg_over_time(job:http_requests:rate5m[1w])

# Long-term standard deviation for the series
- record: job:http_requests:rate5m:stddev_over_time_1w
expr: stddev_over_time(job:http_requests:rate5m[1w])

We can calculate the z-score for the Prometheus query as soon as we get the mean and standard deviation for aggregation.

# Z-Score for aggregation
(
job:http_requests:rate5m -
job:http_requests:rate5m:avg_over_time_1w
) /  job:http_requests:rate5m:stddev_over_time_1w

Based on the statistical principles of normal distributions, suppose that any value outside the range of +3 to -3 will be an anomaly. Accordingly, we can create a warning about such anomalies. For example, we will receive an alert when our aggregation goes beyond this range for more than five minutes.

Graph of the number of requests per second to the GitLab Pages service for 48 hours. The Z-score in the range from +3 to -3 is highlighted in green. The

Z-score can be difficult to interpret on the graphs, since this value does not have a unit of measurement. But the anomalies in this graph are very simple to determine. Everything that goes beyond the green zone, which shows a corridor of values with a z-score of -3 to +3, will be an anomalous value.

What to do if data distribution is not normal

We assume that the distribution of data is normal. Otherwise, our z-score will be false.

There are a lot of statistical tricks to determine the normality of the data distribution, but the best way is to check that your data z-score lies in the range from -4.0 to +4.0.

Two Prometheus queries showing minimum and maximum z-scores:

(
max_over_time(job:http_requests:rate5m[1w]) - avg_over_time(job:http_requests:rate5m[1w])
) / stddev_over_time(job:http_requests:rate5m[1w])
# --> {job="apiserver", environment="prod"}  4.01
# --> {job="gitserver", environment="prod"}  3.96
# --> {job="webserver", environment="prod"}  2.96

(
 min_over_time(job:http_requests:rate5m[<1w]) - avg_over_time(job:http_requests:rate5m[1w])
) / stddev_over_time(job:http_requests:rate5m[1w])
# --> {job="apiserver", environment="prod"}  -3.8
# --> {job="gitserver", environment="prod"}  -4.1
# --> {job="webserver", environment="prod"}  -3.2

If your results are in the range from -20 to +20, this means that too much data has been used and the results are distorted. Remember also that you must work with aggregated rows. Metrics that do not have a normal distribution include parameters such as error rate, delay, queue lengths, and so on. But many of these metrics will work better with fixed alert thresholds.

Statistical seasonality anomaly detection

Although calculating z-scores works well with the normal distribution of time-series data, there is a second method that can give even more accurate anomaly detection results. This is the use of statistical seasonality.

Seasonality is a characteristic of a time series metric when this metric undergoes regular and predictable changes that are repeated in each cycle.

Chart of requests per second (RPS) from Monday to Sunday for four consecutive weeks The

chart above illustrates the RPS (number of requests per second) for seven days - from Monday to Sunday, for four consecutive weeks. This seven-day range is called “offset”, that is, the pattern that will be used for measurement.

Each week on the chart is a different color. The seasonality of the data is indicated by the sequence in the trends indicated on the graph - every Monday morning we observe a similar increase in RPS, and in the evenings on Friday we always observe a decrease in RPS.

Using seasonality in our time series data, we can more accurately predict the occurrence of anomalies and their detection.

How to use seasonality

Prometheus uses several different statistical mechanisms to calculate seasonality.

First, we make a calculation, adding a growth trend for the week to the results of the previous week. The growth trend is calculated as follows: subtract the moving average for the last week from the moving average for the past week.

- record: job:http_requests:rate5m_prediction
  expr: >
    job:http_requests:rate5m offset 1w          # Value from last period
    + job:http_requests:rate5m:avg_over_time_1w # One-week growth trend
    — job:http_requests:rate5m:avg_over_time_1w offset 1w

The first iteration turns out to be somewhat “narrow” - we use the five-minute window of this week and the previous week to get our forecasts.

At the second iteration, we expand our coverage by taking the average of the four-hour period for the previous week and comparing it with the current week.

Thus, if we try to predict the metric value at eight in the morning on Monday, instead of the same five-minute window the week before, we take the average value of the metric from six to ten in the morning of the previous Monday.

- record: job:http_requests:rate5m_prediction
  expr: >
    avg_over_time(job:http_requests:rate5m[4h] offset 166h) # Rounded value from last period
    + job:http_requests:rate5m:avg_over_time_1w    # Add 1w growth trend
    - job:http_requests:rate5m:avg_over_time_1w offset 1w

The request indicated 166 hours, which is two hours less than a full week (7 * 24 = 168), since we want to use a four-hour period based on the current time, so we need the offset to be two hours less than a full week.

Real RPS (yellow) and predicted (blue) in two weeks

A comparison of the actual RPS with our forecast shows that our calculations were fairly accurate. However, this method has a drawback.

For example, May 1, GitLab was used less than usual on Wednesdays, as this day was a day off. Since the estimated growth trend depends on how the system was used in the previous week, our forecasts for the next week, that is, on Wednesday, May 8, gave a lower RPS than actually.

This error can be corrected by making three forecasts for three consecutive weeks before Wednesday, May 1, that is, the three previous Wednesday. The request remains the same, but the offset is adjusted.

- record: job:http_requests:rate5m_prediction
  expr: >
   quantile(0.5,
     label_replace(
       avg_over_time(job:http_requests:rate5m[4h] offset 166h)
       + job:http_requests:rate5m:avg_over_time_1w — job:http_requests:rate5m:avg_over_time_1w offset 1w
       , "offset", "1w", "", "")
     or
     label_replace(
       avg_over_time(job:http_requests:rate5m[4h] offset 334h)
       + job:http_requests:rate5m:avg_over_time_1w — job:http_requests:rate5m:avg_over_time_1w offset 2w
       , "offset", "2w", "", "")
     or
     label_replace(
       avg_over_time(job:http_requests:rate5m[4h] offset 502h)
       + job:http_requests:rate5m:avg_over_time_1w — job:http_requests:rate5m:avg_over_time_1w offset 3w
       , "offset", "3w", "", "")
   )
   without (offset)

There are three forecasts on the chart for three weeks before May 8, compared with the actual RPS for Wednesday, May 8. You can see that the two forecasts are very accurate, but the forecast for the week of May 1 remains inaccurate.

In addition, we do not need three forecasts, we need one forecast. The average value is not an option, as it will be blurred by our distorted RPS data from May 1. Instead, you need to calculate the median. Prometheus does not have a median query, but we can use quantile aggregation instead of the median.

The only problem is that we are trying to include three series in the aggregation, and these three series are actually the same series in three weeks. In other words, they have the same labels, so it's hard to put them together.

To avoid confusion, we create a label called offset and use the label-replace function to add an offset to each of the three weeks. Then in quantile aggregation we discard these labels, and this gives us an average of three.

- record: job:http_requests:rate5m_prediction
  expr: >
   quantile(0.5,
     label_replace(
       avg_over_time(job:http_requests:rate5m[4h] offset 166h)
       + job:http_requests:rate5m:avg_over_time_1w — job:http_requests:rate5m:avg_over_time_1w offset 1w
       , "offset", "1w", "", "")
     or
     label_replace(
       avg_over_time(job:http_requests:rate5m[4h] offset 334h)
       + job:http_requests:rate5m:avg_over_time_1w — job:http_requests:rate5m:avg_over_time_1w offset 2w
       , "offset", "2w", "", "")
     or
     label_replace(
       avg_over_time(job:http_requests:rate5m[4h] offset 502h)
       + job:http_requests:rate5m:avg_over_time_1w — job:http_requests:rate5m:avg_over_time_1w offset 3w
       , "offset", "3w", "", "")
   )
   without (offset)

Now our forecast with a median of three aggregations has become more accurate.

Median forecast versus actual RPS

How to find out that our forecast is really accurate

To check the accuracy of the forecast, we can return to the z-score. It is used to measure the discrepancy of the sample with its forecast in standard deviations. The larger the standard deviation from the forecast, the higher the likelihood that a particular value is an outlier.

The projected deviation range is from +1.5 to -1.5.

You can change our Grafana chart to use a seasonal forecast, rather than a weekly moving average. The range of normal values for a specific time of the day is shaded in green. Everything that goes beyond the green zone is considered an outlier. In this case, the outlier occurred on Sunday afternoon, when our cloud provider had some network problems.

It is good practice to use a ± 2 z-score for seasonal forecasts.

How to set up alerts with Prometheus

If you want to set up an alert for abnormal events, you can apply the fairly simple Prometheus rule, which checks whether the z-score of an indicator is in the range between +2 and -2.

- alert: RequestRateOutsideNormalRange
  expr: >
   abs(
     (
       job:http_requests:rate5m - job:http_requests:rate5m_prediction
     ) / job:http_requests:rate5m:stddev_over_time_1w
   ) > 2
  for: 10m
  labels:
    severity: warning
  annotations:
    summary: Requests for job {{ $labels.job }} are outside of expected operating parameters

In GitLab, we use a custom routing rule that sends an alert through Slack when it detects any anomalies, but does not contact our support team.

How to detect anomalies in GitLab using Prometheus

Prometheus can be used to detect some abnormalities.
Proper aggregation is the key to finding anomalies.
Z-scoring is effective if your data has a normal distribution.
Statistical seasonality is a powerful mechanism for detecting anomalies.

Translated with support from Mail.ru Cloud Solutions .

Still useful :

How to use Prometheus to detect anomalies in GitLab