How to tidy up an overloaded server?

The material, the translation of which we publish today, is dedicated to finding bottlenecks in server performance, fixing problems, improving system performance and preventing performance degradation. Here, on the way to solving problems of an overloaded server, it is proposed to take the following 4 steps:

  1. Assessment of a situation: determination of a bottleneck of server performance.
  2. Stabilization of the server: the application of urgent measures to improve the situation.
  3. System Improvement: expanding and optimizing system capabilities.
  4. Server monitoring: using automated tools to prevent problems from occurring.



1. Assessment of the situation


When traffic overloads the server, the processor, network, memory, and disk I / O can become a performance bottleneck. Determining what exactly causes the problem allows you to focus on the most important. Consider some features of the analysis of the most important server subsystems.

  • . , 80%, , . , 80-90%. 100%, . , , . , , , «» . , , , , .
  • . , , , , . , -, , . , , .
  • . , . , . . , (Out Of Memory, OOM). , , , .
  • /. , , , , . , , , (, ). — , .

What we will discuss below is aimed at solving problems with the processor and the network. The fact is that most projects suffer, during periods of peak traffic, precisely from such problems.

You can start troubleshooting server problems by using the top command . If there is such an opportunity, here you can resort to the historical data of the hosting provider and to the data collected by monitoring systems.

2. Server stabilization


Having an overloaded server in the system can quickly lead to cascading failures in other parts of the system. As a result, it is important, after it became known that the server is overloaded, to stabilize it, and only then to investigate the situation with a view to introducing some serious improvements into the system.

▍ Limit query processing speed


Limiting the speed of processing requests allows you to protect the infrastructure by limiting the number of incoming requests. This is very important when server performance drops. As the server response time grows, users tend to aggressively refresh the page, which further increases the server load.

Although refusing to process a request is a simple and effective measure, it is best to reduce the load on the server by limiting the number of requests that come to it using some external system. This can be, for example, a load balancer, reverse proxy server or CDN. Below are links to instructions for working with several systems of this kind:


Here is material on reducing server load using various approaches to limiting the speed of request processing.

▍HTTP caching


Look for ways to improve caching of content. If the resource can be given to the user from the HTTP cache (from the browser cache or from the CDN), then it does not need to be requested from the server, which reduces the load on the server.

HTTP headers like Cache-Control , Expires, and ETag indicate how a particular resource should be cached. Auditing and correcting these headers can help improve caching.

Although you can resort to the capabilities of service workers for caching , they use a separate cache . This is an aid to the core browser caching system, not a replacement for it. Therefore, when fixing problems of an overloaded server, efforts should be focused on optimizing HTTP caching.

Diagnostics


Launch Lighthouse and take a look at the Serve static assets with an efficient cache policy metric to see a list of resources with short and medium caching times ( Time To Live , TTL). Review the resources listed and consider increasing their TTL. Here are the estimated caching terms applicable to various resources.

  • Static resources need to be cached for a long period (1 year).
  • Dynamic resources need to be cached for a short time (3 hours).

Cache setting


It is necessary to write the necessary cache caching time, expressed in seconds, into the Cache-Controlmax-age header directive . Here are instructions for setting this header on different systems:


Note that the directive max-ageis just one of many directives that affect caching. There are other directives and other headers that affect cache features. In order to better understand this issue, it is recommended that you read this HTTP caching guide.

▍ Gradual reduction in system capabilities


A gradual reduction in system capabilities is a strategy to temporarily limit functionality aimed at removing excessive load from the server. This concept can be applied in many different ways. For example, giving customers a static text page instead of a full-blown application, disabling search, or returning less than usual search results. This includes disabling the resource-intensive capabilities of projects that do not affect their basic functionality. The main attention here should be paid to disabling the functionality, which can be abandoned without affecting the main features of the application too much.

3. System improvement


▍Use of CDN


The task of serving static resources can be transferred from the server to the Content Delivery Network (CDN). This will reduce the load on the server.

The main function of CDN is to quickly deliver materials to users through the use of a large network of servers located near users. In addition, some CDNs offer additional performance related features. Among them - data compression, load balancing, optimization of media files.

CDN setup


The advantages of CDN are revealed if the company owning the network has a large grouping of servers distributed around the world. Therefore, supporting your own CDN service rarely makes sense. A typical CDN setup is a fairly quick procedure that takes about half an hour. It consists in updating the DNS records so that they would point to the CDN.

CDN Optimization: Case Study


In order to identify resources that are not served using the CDN (but should be issued to users with a CDN), you can use WebPageTest . On the results page, click on the rectangle signed as Effective use of CDNand view the list of resources that should be served by CDN.


WebPageTest Results

Problem solving


If resources are not cached using the CDN, find out if the following conditions are true:


▍ Scaling computing resources


The decision to scale computing resources should be made with caution. Although it is often possible to solve certain problems by resorting to scaling, having done this inopportune, it is possible to unnecessarily complicate the system and unreasonably increase the cost of its support.

Diagnostics


A high indicator of time to the first byte ( Time To First Byte , TTFB) may be a sign that the server is approaching its limits. You can find TTFB information in the Reduce server response times (TTFB) section of the Lighthouse report.

For a deeper study of the situation, you need to use some monitoring tool and analyze the processor usage. If the current or forecasted processor load value exceeds 80% - this means that you need to think about increasing the server capacity.

Problem solving


Adding a load balancer to the system allows you to distribute traffic between multiple servers. A load balancer is located in front of the server pool and distributes traffic to the appropriate servers. Cloud providers offer users load balancers ( GCP , AWS , Azure ), but you can also use your own balancer using HAProxy or NGINX . After the load balancer is ready to work, additional servers can be added to the system.

In addition to load balancing, most cloud providers offer automatic scaling of computing power ( GCP , AWS ,Azure ). Automatic scaling is associated with load balancing. Namely, with automatic scaling of resources at high load times, additional resources are allocated, and during periods of low load, unnecessary resources are disabled. But, even considering this, it should be noted that automatic scaling is also not a universal solution. It takes time to start the servers automatically. Auto-scaling configurations require serious configuration. Therefore, before applying a complex system of automatic scaling, it is worth trying a relatively simple configuration with a load balancer.

▍Using data compression


Text resources must be compressed using the gzip or brotli algorithm. In some cases, compression can help reduce the size of such resources by about 70%.

Diagnostics


To find resources that need compression, you can use the Enable text compression indicator from the Lighthouse report.

Problem solving


To enable compression, you need to edit the server settings. Here are the details about this:


▍Optimization of images and other media materials


On the image falls the bulk of the materials most websites. Image optimization can lead to a significant reduction in the size of site materials. Moreover, such optimization is performed quite quickly.

Diagnostics


There are various indicators in the Lighthouse report that indicate potential image optimization options. To search for large images that need optimization, you can use the usual browser developer tools. Such images may well become good candidates for optimization.

Here is a list of LightHouse report metrics that you should pay attention to when exploring the possibility of image optimization:


If you’re using Chrome’s developer tools to help you optimize your images, you can follow these steps:

  • Record the network activity of the page.
  • Click on Imgto filter non-image resources.
  • Click on a column Sizeto sort the image files by size.

Problem solving


First, let's talk about what should be done if you have little time.

In such a situation, you should pay attention to large images, and to images that are downloaded more often than others. Having found them, they must be subjected to manual optimization, using a tool like Squoosh . Large photos are usually good candidates for optimization. For example, taken from a resource like Hero Images .

Here's what you need to pay attention to when optimizing images:

  • Size: Images should not be larger than necessary.
  • : , 80-85 , 30-40% .
  • : JPEG, PNG. MP4, GIF.

Now a few words about how to approach image optimization for those who have a little more time.

If images make up a significant share of the site’s materials, consider using a specialized CDN service designed to work with images for their maintenance. Such services allow you to remove the burden of working with images from the main server. Setting up a project to use such a CDN service is simple, but it requires updating existing links to images so that they point to CDN resources. Here is the material on the use of specialized CDN services designed for images.

▍Minification of JavaScript and CSS


Code minification allows you to reduce its size by removing unnecessary characters.

Diagnostics


Take a look at the Minify CSS and Minify JavaScript metrics in the Lighthouse report to identify resources that need minification.

Problem solving


If you don't have much time, focus on minifying JavaScript code. On most sites, the amount of JavaScript code exceeds the amount of CSS code, so this move will give better results. Here's the stuff about minifying JavaScript, and here 's the stuff about minifying CSS.

4. Server monitoring


Server monitoring tools support data collection and visualization using control panels. They can notify users of various events related to server performance. Using these tools can help prevent and mitigate server performance issues.

When setting up a monitoring system, you should strive for the greatest possible simplicity. Excessive data collection and too frequent notifications can cause negative effects. The wider the range of data collected and the more often they are collected, the more expensive it will be to collect and store them. And if the one who is responsible for the state of the server will be bombarded with messages about minor events, then he, as a result, will ignore these messages.

Notifications should contain metrics that consistently and accurately describe problems. For example, server response time (latency) is a metric that is especially good for this: it allows you to identify a large number of problem situations and is directly related to how the server is perceived by users. Notifications based on low-level metrics, such as processor utilization levels, can play the role of a useful add-on, but they can only indicate a small part of possible problems. In addition, notifications should not be based on average indicators, but on indicators corresponding to 95-99 percentiles. Otherwise, analyzing averages can easily lead to skipping problems that do not affect all users.

Monitoring setup


All major cloud providers provide customers with their own monitoring tools ( GCP , AWS , Azure ). In addition, the Netdata tool can be noted here - an excellent free open source alternative to provider tools. Regardless of what exactly you use, you will need to install an agent application on each server that you want to monitor. After completing the system setup, be sure to set up notifications. Here are the instructions for setting up different monitoring tools:


Summary


Today we talked about how to identify and fix server performance issues. I would like to believe that your servers will work stably and advice from this material will not be useful to you. And if something goes wrong - we hope you find something here that will help you deal with the problem as quickly as possible.

Dear readers! What do you do in a situation where the server on which your project is running starts to slow down?


All Articles