Integrated Server Load Metrics Evaluation

Working in one of the largest banks in the country, I had to face the task of evaluating the efficiency of resource use of approximately 16 thousand servers. The task was formulated very simply - it was necessary to develop a methodology for evaluating server load metrics for a period. Ideally, server load per period should be estimated using one or several (no more than 8) numbers.

A few words about the features of using virtual servers


Large organizations (especially banks) have a motley zoo of legacy applications deployed on different servers using a variety of virtualization technologies. A private cloud is a promising technology, but in reality, large organizations will long use various virtualization platforms to deploy a variety of applications.

As virtualization platforms evolve, there comes a time when no one in the company can understand how efficiently the resources are used. Even the most advanced monitoring tools do not provide an answer to this question due to various server usage scenarios. For example, a department may have a report server that will be fully loaded for only a limited period of time. Say 3-4 hours at the end of the month. In real-life scenarios, no one allocates dynamically resources for such servers - this is difficult technically and organizationally. Resources are allocated specifically for the maximum periodic server load, although it happens infrequently.

As a summary, in large organizations, virtual farm resources are spent extremely inefficiently.

Below I propose a methodology with which you can easily justify the increase and decrease in resources allocated to the virtual server, regardless of the scenario.

Methodology


To assess the load of resources, it is necessary to collect statistics of various counters; to assess the load of resources, various metrics will be used. Conventionally, counters can be divided into 2 types (according to the rate of change): “fast” and “slow”. A good example of a “fast” counter is the processor load counter (% CPU). An example of a slow counter is the percentage of free hard disk space (% FreeSpace).
Evaluation of slow counters consists in calculating the extreme (minimum or maximum) value of the metric for the period. This approach allows (for example, when assessing disk free space) to estimate the free resource and, if necessary, to allocate additional volumes or reduce current ones.

For fast counters, a different approach is used. The disadvantages of using simple integral metrics (average, maximum, minimum and median) to assess the dynamics of such counters are well described here . Common disadvantages include the lack of information about increased loads (medium and peak). If we take the maximum value for the period as an integral metric, then the presence of outliers (for example, instant loading of the CPU up to 100% when the program starts) will not give objective information.

In the articleIt is proposed to use the 0.9 quantile to estimate the fast metric (this is a value that indicates the level below which the observed value in 90% of the samples lies). With a uniform server load according to this metric, we can adequately estimate the average processor load. But this approach has the same drawbacks - the lack of information about increased loads (medium and peak).

Below, as an illustration, the weekly and daily chart of the% CPU counter. The maximum counter value on the charts was 100%.





The graph shows that during the indicated period there is a burst of load, which lasts about 3 hours. For this counter, a variety of metrics were calculated for the week. Figure 2 shows that the median (green line, 5% value), average (yellow, 12% value) and 0.9 quantile (red, 27% value) filter the load change and information about it is lost.

As a development of the idea of ​​quantiles, I would like to propose the idea of ​​a sliding quantile. This is an analogue of the moving average.but the quantile 0.9 is used as a window function. Moreover, we will use 2 sliding quantiles to estimate the level of the counter - fast with a short period (1 hour) and slow with a long period (24 hours). A fast quantile will filter out instantaneous emissions and provide peak load information. A slow quantile will allow you to estimate the average load.

As you can see from the graphs, moving quantiles of 0.9 are dynamic characteristics (brown - fast, purple - slow). For simplicity, evaluating the status of the counter as metrics it is proposed to use:

  • the maximum quantile value with a period of 1 hour, which shows the maximum continuous server load for the period,
  • the average quantile value with a period of 24 hours, which shows the average server load for the period.

On the graph, the maximum value of the fast quantile is the black line at 85%, the average value of the slow quantile is the pink line at 30%.

Thus, when analyzing the load of server resources (according to the% CPU counter), if we take the average for the month (12%) as a metric, we can make an erroneous decision to reduce the allocated resources. The double fast / slow moving quantile metric (85 and 30%) shows that there are enough allocated resources, but no surpluses.

Decision


The implementation of the assessment of the efficiency of resource use has been decomposed into 3 tasks:

  1. data collection
  2. development of assessment methodology
  3. methodology implementation in current architecture

Above, I examined task 2 of this implementation, below we will talk a little about the third task.

Data was collected in the ClickHouse database. This column DBMS is ideal for storing time-series data. This was discussed in detail at ClickHouse Meetup on September 5, 2019. A comparison of ClickHouse with other time-series DBMS can be found here .
As a result of data collection, we formed several tables in which the data was organized line by line (the values ​​of each counter were written in a separate line). And, of course, there were problems with raw data.

The first problem is the unevenness of the intervals between the counter entries. For example, if the standard counter recording period was 5 minutes, sometimes there were gaps and the next record was more than 5 minutes (up to 20 minutes) from the previous one.

The second problem is that sometimes the counter data came 2 or more times (with different values) with the same time stamp.

And the third problem - ClickHouse has no window functions.

To solve the first problem, you can use ASOF JOIN. The idea is quite simple - for each counter of each server to create a table evenly with evenly filled time intervals. Using ASOF JOIN allows filling the values ​​in the new table with the nearest values ​​from the raw data table (filling options similar to ffill and bfill can be configured).

The solution to the second problem is aggregation with the choice of the maximum value at a given time.

To solve the third problem, several solutions were considered. First, the Python script was rejected due to insufficient performance. The second solution — copying raw data to an MSSQL database, calculating metrics, and copying back — seemed too complicated to implement. MSSQL also has window functions, but there is no aggregate function needed. One might be puzzled and write your own SQL CLR function. But this option was rejected due to excessive complexity.

A working solution could be an SQL script for ClickHouse. An example of this script is given below. For simplicity, I looked at calculating only the fast quantile for a single counter for multiple servers. The solution does not look very simple and not very convenient, but it works.

As a result, a test report was created in PowerBI to demonstrate the methodology.





Conclusion


In conclusion, I would like to speculate on the development of the solution. If you look at the solution from the point of view of data warehouses, you can see that in this way the task of creating a data warehouse (Data Warehouse) from a layer of raw data (Staging Area) is solved. You can discuss the architecture, but for ClickHouse as a column database, normalization is not critical (or maybe even harmful).

Further development of the repository is seen in the creation of aggregate tables (day \ week \ month) with different lifetimes (TTL). This will avoid excessive swelling of the storage.
The next step may be to use data for predictive analytics.

PS

The code and data for testing are posted here .

All Articles