Checking the portal of regional public services under load through puppeteer

Hello, Habr! You, too, are curiously watching the "epic of the American Wild West on the distribution of land plots - go first and stick a flag to stake" or perhaps even participate in it. More precisely in its modern version - be the first to apply for public services in order to receive money for children or get a pass to leave the house. Looking at all this, I would like to share the experience of our team in testing and participating in the preparation of a regional portal of services for the provision of the “First Class Record” service. It is also very similar to the habra effect and, I think, was close to what happened a couple of days ago with the federal portal gosuslugi.ru, but on a regional scale.

We passed this challenge in Khabarovsk in January this year and recently participated in the preparation of a similar service for issuing hunting permits in another region. Below is a little experience that will give you the opportunity to look at the issue of preparing the work of regional public services portals for peak periods from another perspective.

And for starters - a photo of a black bus, who was on duty at the school for three days around the clock, in which the author of these lines confirmed his turn among parents three years ago. At least our parents have come together. Watching in winter in Khabarovsk at -30 degrees is still a pleasure.

image

In the process of preparing for the peak recording in the first grade, several teams worked, because one way or another is involved in it: the data center operator, the operator and the developer of the information system of the regional portal, the operator and the developer of the integrated education information system, technical support for users of the regional portal. We at iondv perform the last task, independently monitoring the health of the portal and supporting users.

Our role in the preparation is to organize testing and recommendations on caching configurations in nginx, well, we also prepared instructions for users with the recommended “behavior”.

For those who do not know about the problem of writing in the 1st grade
, . . , , - ( , — ), , - , , . , , , – , . .

1- , . 0:00 , 10:00 26 . , – . 10:00 , — , - .

. . 2017 . . , , . .

A service for a demanded resource as a technical task


The problem in services like “writing to 1st grade” in the integral indicator of the load (or probability of queries) tending to the “delta function” (δ-function, Dirac function) is clearly visible on the graphs in the form of peaks. At this moment, there is a multiple increase in calls in a short period of time.

δ-function and peak query statistics

Our experience says that the main task of training is not to increase resources. The task is to minimize the potential number of requests per second, stretch them for some period and prepare the system for the remaining load. In this case, it is necessary to find and accelerate bottlenecks - this will give the greatest effect in accordance with the principles of the theory of bounded systems (Goldratt's principles). And otherwise it is the bottleneck that will fail. The whole system should work from him: the principle of "drum-buffer-rope."

It is physically impossible to spill the entire volume in 10 minutes of hourglass in 1 minute - it is obvious that they will collapse. Similarly for the provision of services. It does not surprise anyone - when it is the turn of the MFC and scandals to receive services, but it surprises everyone - why the portal went down.

There are different patterns of load handling behavior from queuing theory :

  • You can put users on hold, i.e. increase the queue;
  • you can simply refuse to service those who came later, for example, until the previous ones are processed;
  • You can try to endlessly increase productivity.

Expediency is somewhere in between. Indeed, for a service with a limited resource provided by a region or state, not only speed is important, but, first of all, the preservation of social justice - i.e. equal conditions for all. At the same time - if the user has not received what he needs, he initiates a new request. In this case, requests grow in an avalanche, forming a model of attack “ dog-pile effect ” (dog-pile effect, cache stampede, hit miss storm) - the user has already canceled the request and initiated a new one, while the previous one is still in the queue to be processed.

This process reinforces the fact that whole families participate in the submission - dad and mom fill out applications at the same time, and often submit applications several times for reliability. And besides, often also in several tabs and several browsers. Therefore, the expected peak load usually makes sense to multiply by 2-3 times, from the number of those who actually apply for such services.

Justice retreat
, «», . ? «» , . . — , . « ». . .

Organization of the provision of services


We calculated the expected number of applicants based on a combination of data on the total number of applications submitted last year and per-minute data for other regions. Typically, the peak of applications falls on 5-10 minutes, including because the portals almost do not respond for the first three to five minutes, and later users fill out the form from 1 to 5 minutes (do not be surprised, many fill out from the phone even in such “nervous” conditions) .

An approximate calculation model for the conditional 1000 applications per hour is as follows:

  • peak from 5 to 10 minutes from the start and 80% of applications will be filed under the Paretto rule
  • Conventionally, we plan 160 applications per minute or 3 applications per second.

In fact, the first submission occurred after one minute and 45 seconds, and the peak of applications went from 4 minutes.

To reduce the load on ESIA and on the system from generating authorization sessions, the instructions suggested that users log in in advance and lengthen the session lifetime. In fact, 50% were authorized in 1 hour, and ~ 90% in half an hour. We encountered earlier that users started logging into the portal 10 minutes before the start of the service - and authorization began to work unstably. It’s hard to say why. Perhaps the reason is that when technical work is carried out in Moscow at night, in Khabarovsk we have just the beginning of the working day.

Retreat about instruction and organizational arrangements
.

, « » . .. . , - ..

, , , . - . , , .

It is impossible to remove the "delta function" when the form is reloaded at 00:00. The whole point of this procedure is that the service appears at a given time. But you can try to reduce the number of browser requests on all expected user routes and thereby leave the load on the system only from the necessary ones - form, dynamic directories and sending applications.

The nginx settings themselves are fairly standard. Here it is more important to choose the limitations that the system can withstand. Pick them up - i.e. start queuing requests when the server is expected to come to the limit of its capabilities.

Well, and most importantly, we forced the caching (proxy_cache) and increased the lifetime of the "expires" data in nginx for all static paths and, where possible, dynamic pages in which there are no sessions. By the way, this is a common mistake when caching - writing to the cache data (sometimes even statics) in which someone else's session is stored, the output is usually to delete these cookies from the headers if the server cannot separate the data types.

In the browser for the user, it looks like updating pages from files downloaded from disk or memory. But even when the user gets them from the server, they are taken from the nginx cache. Directories themselves, of course, are cached in the system itself.

image

This reduced the number of potential requests from 89 requests to 14 and the volume from 2.1 MB (for 1000 users who updated the page, this is a potential peak of 4-8 Gbit / s) to 38 Kb (we all remember webpack, but for entrprise platforms this is not always easy to do). According to the results of the passage, it was still necessary to cache not only in the system, but also in nginx some of the directories from the form and dynamic classifiers not used at the peak moment and force the lifetime for them. And with an increase in load it generally makes sense to put on the main fully static page with routing users to the desired service or to create a separate resource for the service.

To reduce the load on sending, drafts and automatic filling of data for the child were disabled. All users have different data entry speeds, which eliminates the appearance of a form that is completely ready for submission and avoids the delta function for sending applications - all 1000 in one minute. At the same time, social justice is maintained, although, of course, complaints appear.

I will not describe the optimization of the system itself - during load testing, bottlenecks were identified - mainly in DBMS queries and the indexes and queries themselves were optimized.

Probably the most important optimization is to simplify the form. What affects speed the most when implemented in a form?

  • — , , . — 5-10 ( iPhone ) 5- 375 / (1 10 , application/x-www-form-urlencoded – 20 ), 100 625 /. 100/ — . , « ». — ? , . , ?
  • sophisticated guides. The load is usually increased by using the FIAS or CLADR address directory. The problems here are due to size - FIAS takes up to 40GB in the database and it takes time to search for it. Tenths of a second, but multiplied by 1000 simultaneous requests, load any system. Without special preparation, possibly in the form of a separate web service and on a separate resource, it is difficult to withstand the load - therefore, they often use a plain text field for the address.

Well, let's move on to the tests.

Load testing in preparation


Testing was done through puppeteer - by emulating user actions in the Crominium browser. Yanedeks.tank and JMeter beat off protection against attacks, because they generate many of the same type of requests. In addition, these tests weakly coincide with the profile of real queries when changing the behavior of the system under load. In addition, the servers cache requests, and it is difficult to reproduce part of the processes in them (for example, authorization). By the way, from one of the devDV seminars we have a presentation with a presentation on the use of puppeteer for testing, including load, link to the video .

To begin with, we compiled a user behavior profile and divided the procedure into key stages:

  1. mass authorization in ESIA
  2. one-time updating of the service form,
  3. mass feed

For each of the stages we did a separate test.

Last year, there were difficulties at the authorization stage in ESIA, but testing it full-scale is difficult. The system is external, protection against attacks and bans on authorization is triggered. Nevertheless, it is possible to formulate a test profile to verify precisely the bottlenecks of the system under test - usually this is the number of simultaneously authorized sessions and planned authorization values ​​per minute, which can be adjusted by recommendations.
In the test, the wrapper is important for organizing several threads, we use the 'puppeteer-cluster'. But usually it is more complicated to handle exceptions and change the behavior of the portal under load - layout elements are often revealed that pop up twice. Or the elements do not appear if some data did not load as expected. These are all the errors that users will see and reload the page - which means they will create an additional load. There are two ways: implement exception handling in the test. Or modify the portal.

The test itself is simple. Below is a fragment from clicking on the “Login” button on the services portal to entering data into ESIA.

await page.waitForSelector(AUTH_AVAIL,{timeout:OPT_ELEM_WAIT_TIME});
const needAuth = await page.$(ELEM_AUTH_IN);
if (!needAuth) throw (new Error(`  `));
        
await page.waitForSelector(AUTH_BUT, OPT_ELEMENT_VISIBLE);
await page.click(AUTH_BUT);
await waitNewUrl(page, 'https://esia.gosuslugi.ru/idp/rlogin?cc=bp', OPT_PAGE_WAIT_TIME);
await page.waitForSelector('#mobileOrEmail', OPT_ELEMENT_VISIBLE);
let text = await elemGetText(page, '#authnFrm > div.login-slils-box > div > div.detected > div.left > div.this-user');
if (text) 
   text = text.replace(/ -\(\)/g, '');        
if (text && text.indexOf(user) === -1) {
  await page.click('div.click-to-another > a');
  await page.waitForSelector('#authnFrm > div.login-slils-box > div >' +
                ' div.detected > div.left > div.this-user', OPT_ELEMENT_INVISIBLE);
}
await page.waitForSelector('#password', OPT_ELEMENT_VISIBLE);
await page.type('#mobileOrEmail', user);
await page.type('#password', pwd);
await page.click('#loginByPwdButton');

Verification of updating the application form pending users "opening the record." The reboot test is essentially one-step, but it is important to check the types of errors returned - the network is a problem, a nginx error, a server error, and whether the form meets the criteria. And the difficulty is to generate the maximum volume of requests in the least amount of time and not fall under the protection restrictions (however, during the tests it can be changed, on the other hand, it is also a check of the network and server infrastructure settings and WAF).

Such tests on puppeteer require a lot of resources to work. De facto it turned out that you need at least 2 cores against the 1st core of the front-end subsystem and a very wide channel. But when renting them in the cloud - this is quite affordable. We used Yandex.cloud.

In the test, authorization is first implemented in ESIA for each stream separately. After that, a separate browser is launched for each thread and within the framework of one instance a given number of updates is carried out. After that, the instance restarts. The check itself may include a typical path, for example, the main page, the form of the service. But more often it is enough only to completely update the service and check the necessary directory that the service can be submitted - all as in the instructions for users.

image

A fragment of the test to open the main and refresh the page.

try {
  await page.setViewport(PUP_OPT);
  await page.goto(BASE_URL);
  await page.setCookie(...cookies[worker.id]);
  await page.goto(`${BASE_URL}/nd/lk/form/dnv.htm`);
  rdyRefresh++;
} catch (err) {
  console.error(`#       ${data}: ${err.message}`);
  getErr++;
  await page.screenshot({path: filename});
}
for (let i = 0; i < AMOUNT_REFRESH - 1; i++) {
  const filenameIter = path.join(BASE_DIR, PIC_DIR, `${data}-${i}.png`);
   try {
       await page.reload({waitUntil: ["networkidle0", "domcontentloaded"]});
        rdyRefresh++;
    } catch (err) {
        if (!err.message.includes('Navigation failed because browser')) {
           console.error(`#     ${data}-${i}: ${err.message}`);
           getErr++;
           await page.screenshot({path: filenameIter});
        }
   }
}

For the load by sending applications, the entire verification cycle was implemented - with a reload of the form and verification of the input of all data.

Fragment.

for (let i = 0; i < AMOUNT_RESEND; i++) {
   const filename = path.join(BASE_DIR, PIC_DIR, `${data}-${i}.png`);
  try {
     await page.goto('https://uslugi27.ru/nd/lk/form/dnv.htm');
  } catch (err) {
      console.error(`#      1  ${data}-${i}: ${err.message}`);
      await page.screenshot({path: filename});
      getErr++;
      continue;
 }
 try {
     const FORM_PREF = '#createForm > div:nth-child(4) > ';
     await clickDelayed(page,`${FORM_PREF}fieldset.petgroup.ungroupped-attrs > div > div:nth-child(4) > div.col-md-9.attr-data`);
// <…>
     await page.type(`${FORM_PREF}fieldset:nth-child(2) > div > div:nth-child(1) > div.col-md-9.attr-data > input`, '');
// <…>
  } catch (err) {
      console.error(`#      ${data}-${i}: ${err.message}`);
      await page.screenshot({path: filename});
     continue;
  }
  try {
      await page.click('#createForm > div.col_100.controls > button.btn.btn-primary.pull-right.next');
      await clickDelayed(page,`#createForm > div:nth-child(5) > fieldset > div > div:nth-child(1) > div > div`);
       await page.click('#createForm > div:nth-child(5) > fieldset > div > div:nth-child(2) > div > div');
       await page.click('#createForm > div.col_100.controls > button.btn.btn-success.pull-right.submit');
  } catch (err) {
    console.error(`#     ${data}-${i}: ${err.message}`);
    await page.screenshot({path: filename});
    sendErr++;
    continue;
  }

By the way, the test can be accelerated if you enter all the data not from puppeteer by the await page.type construct, but transfer this logic to the browser itself. But then the complexity of catching errors increases. Like so

document.querySelector('#createForm > div:nth-child(4) > fieldset.petgroup.ungroupped-attrs > div > div:nth-child(4) > div.col-md-9.attr-data').click();
 document.querySelector('#createForm > div:nth-child(4) > fieldset:nth-child(2) > div > div:nth-child(1) > div.col-md-9.attr-data > input').value = '';

During the tests, we provided several thousand ESIA authorizations and about 16 thousand applications sent. How was the restoration of a productive educational information system after such a number of statements - do not even ask. This is a completely different story.

The main visible result of this process was that local media were bored now in the days of enrollment in first grade. The service has left the media area.

In parallel, we made a dashboard for monitoring the performance of the form based on Grafana: the number of applications, the number of calls, Yandex metrics, etc. But we will leave this topic for the next time.

Well, I would like to congratulate everyone who is connected with the topic of improving the quality of the provision of state and municipal services in electronic form. This endless preparatory work was not in vain - after all, in April and May the number of applications submitted increased significantly.

All Articles