🍈 👨‍👨‍👦‍👦 🕖 Load optimization on a Highload project with ElasticSearch 💧 🔰 ⛪️

Hello, Habr! My name is Maxim Vasiliev, I work as an analyst and project manager at FINCH. Today I would like to tell you how with the help of ElasticSearch, we were able to process 15 million queries in 6 minutes and optimize the daily load on the site of one of our clients. Unfortunately, we will have to do without names, since we have an NDA, we hope that the content of the article will not be affected. Let`s go.

How the project is arranged

On our backend, we create services that ensure the performance of sites and mobile applications of our client. The general structure can be seen in the diagram:

In the process, we process a large number of transactions: purchases, payments, operations with user balances, on which we store a lot of logs, and also import and export this data to external systems.

The reverse processes also go when we receive data from the client and transmit it to the user. In addition, there are still processes for working with payments and bonus programs.

Short background

Initially, we used PostgreSQL as the only data warehouse. Its standard advantages for DBMS: availability of transactions, developed language of data sampling, wide tools for integration; coupled with good performance, satisfied people have long satisfied our needs.

We stored absolutely all data in Postgres: from transactions to news. But the number of users grew, and with it the number of requests.

For understanding, the annual number of sessions in 2017 on the desktop site alone is 131 million. In 2018, 125 million 2019 is again 130 million. Add another 100-200 million from the mobile version of the site and the mobile application, and you will get a huge number of requests.

With the growth of the project, Postgres ceased to cope with the load, we did not have time - a large number of various queries appeared, under which we could not create a sufficient number of indexes.

We realized that there was a need for other data warehouses that would provide our needs and take the load off PostgreSQL. Elasticsearch and MongoDB were considered as possible options. The latter lost on the following points:

Slow indexing speed with increasing data volume in indexes. At Elastic, the speed does not depend on the amount of data.
No full text search

So we chose Elastic for ourselves and prepared for the transition.

Switching to Elastic

1. We started the transition with a point of sale search service. Our client has a total of about 70,000 points of sale, and it requires several types of searches on the site and in the application:

Text search by city name
Geo-search in a given radius from some point. For example, if a user wants to see which points of sale are closest to his home.
Search by a given square - the user draws a square on the map, and he is shown all the points in this radius.
Search for additional filters. Points of sale differ from each other in assortment

Speaking of organization, then in Postgres we have a data source both on the map and on the news, and in Elastic Snapshots are made from the original data. The fact is that initially Postgres could not cope with the search by all criteria. Not only were there many indexes, they could also intersect, so the Postgres scheduler got lost and did not understand which index to use for it.

2. Next in line was the news section. Every day, publications appear on the site so that the user does not get lost in the flow of information, the data must be sorted before delivery. For this, you need a search: on the site you can search by text match, and at the same time connect additional filters, since they are also made through Elastic.

3. Then we moved the transaction processing. Users can buy a specific product on the site and participate in the prize draw. After such purchases, we process a large amount of data, especially on weekends and holidays. For comparison, if on ordinary days the number of purchases is somewhere between 1.5-2 million, then on holidays the figure can reach 53 million.

At the same time, the data needs to be processed in a minimum amount of time - users do not like to wait for a result for several days. You won’t achieve such deadlines through Postgres - we often got locks, and while we processed all requests, users couldn’t check whether they received prizes or not. This is not very pleasant for the business, so we moved the processing to Elasticsearch.

Periodicity

Now updates are configured event-wise, under the following conditions:

Point of sale. As soon as data from an external source comes to us, we immediately start the update.
News. As soon as any news is edited on the site, it is automatically sent to Elastic.

Here again, it is worth mentioning the advantages of Elastic. In Postgres, when sending a request, you need to wait until it honestly processes all the records. You can send 10 thousand records to Elastic, and immediately start working, without waiting until the records are distributed across all Shards. Of course, some Shard or Replica may not see the data right away, but very soon everything will be available.

Integration Methods

There are 2 ways to integrate with Elastic:

Through the native client over TCP. The native driver is gradually dying out: it is no longer supported, it has very inconvenient syntax. Therefore, we practically do not use it and try to completely abandon it.
Via an HTTP interface in which you can use both JSON requests and Lucene syntax. The latter is a text engine that uses Elastic. In this option, we get the ability to Batch through JSON requests over HTTP. This is the option we are trying to use.

Thanks to the HTTP interface, we can use libraries that provide an asynchronous implementation of the HTTP client. We can take advantage of Batch and the asynchronous API, which in the end gives high performance, which helped a lot in the days of a large action (more on that below)

A few numbers to compare:

Saving users who received prizes in Postgres in 20 streams without groupings: 460,713 entries in 42 seconds
Elastic + reactive client for 10 threads + batch for 1000 elements: 596749 records in 11 seconds
Elastic + reactive client for 10 threads + batch for 1000 elements: 23801684 records in 4 minutes

Now we have written an HTTP request manager that builds JSON as Batch / not Batch and sends it through any HTTP client, regardless of the library. You can also choose to send requests synchronously or asynchronously.

In some integrations, we still use the official transport client, but this is only a matter of coming refactoring. At the same time, a custom client built on the basis of Spring WebClient is used for processing.

Big promotion

Once a year, a large campaign for users takes place on the project - this is the same Highload, since at this time we work with tens of millions of users at the same time.

Usually peak loads occur during the holidays, but this promotion is a completely different level. The year before last, on the day of the promotion, we sold 27,580,890 units of goods. The data was processed for more than half an hour, which caused inconvenience for users. Users received prizes for participating, but it became clear that the process needed to be accelerated.

At the beginning of 2019, we decided that we needed ElasticSearch. For a whole year, we organized the processing of received data in Elastic and its output in the api of the mobile application and site. As a result, the next year during the campaign we processed 15 131 783 records in 6 minutes.

Since we have a lot of people who want to buy goods and participate in the prize draw in promotions, this is a temporary measure. Now we are sending relevant information to Elastic, but in the future we plan to transfer archive information from the past months to Postgres as a permanent repository. In order not to clog the Elastic index, which also has its limitations.

Conclusion / Conclusions

At the moment, we have transferred to Elastic all the services that we wanted and have paused for now. Now we are building an index in Elastic on top of the main persistent storage in Postgres, which takes on the user load.

In the future, we plan to transfer services if we understand that the data request becomes too diverse and is searched by an unlimited number of columns. This is no longer a task for Postgres.

If we need a full-text search in the functional, or if we have many different search criteria, then we already know that this needs to be translated into Elastic.

⌘⌘⌘

Thanks for reading. If your company also uses ElasticSearch and has its own implementation cases, then tell us. It will be interesting to know how others have :-)

Load optimization on a Highload project with ElasticSearch