Search scaling. Elasticsearch Its advantages and basic requirements for installation

Good afternoon. My name is Roman Larchikov, I am a technical support engineer at Docsvision. This article has been prepared for those who are interested in the technical details of the implementation of search scaling and familiarity with the work of Elasticsearch. The article will talk about the reasons for using ES, system requirements, as well as the advantages over search from MS SQL Server.

If you are interested in more general terms to learn about how we scaled the search in the latest version of our platform, my colleagues told about this at the webinar β€œ Docsvision ECM. Search Scaling ElasticSearch . "

Why Elasticsearch?


image

To begin with, it is worth noting that the version of the Docsvision 5.5 platform is fundamentally different from the previous ones in its modular architecture. In this regard, we needed to ensure the possibility of almost unlimited scaling of the system while maintaining the speed of work. In particular, it was required to be able to scale the indexing service even at a high speed of its operation.

In this regard, the version of Docsvision 5.5 introduced the possibility of using external (satellite) databases. Now there is no need to store all the data in one database, which, with intensive work, grows daily in volumes, which complicates the process of servicing the database, the speed of its recovery during crashes and slows down the overall operation of the database itself.

image

Using one database for everything is bad. The ability to transfer data to separate external databases implemented in Docsvision 5.5 allows you to correctly restructure the database. Thus, if we talk about search, indexing data can be stored already outside the main database, eliminating the effect on its size.

Ease of customization, flexibility, reliability, scalability, speed of indexing and searching online are all about Elasticsearch.

Elasticsearch is very document-oriented. After indexing, we can search, sort, filter data, rather than rows of data in columns. Which, in turn, demonstrates a different approach to data retrieval, and indicates that Elasticsearch can perform complex full-text searches.

Documents are represented as JSON objects. At the same time, serialization (the process of translating any data structure into a sequence of bits) JSON is supported by most programming languages ​​and is already the standard format for NoSQL.

1. Introducing Elasticsearch


1.1. What it is?


Elasticsearch is an open source scalable full-text search engine using the Lucene library and written in Java. Description of all the advantages of this engine is available on the official website .

It is intended for complex searches in the database of documents / files. In the Elasticsearch database, tables are called indexes, and the process of loading documents is called indexing.

It can be considered both a non-relational repository of documents in JSON format and a search engine based on Lucene full-text search. Official clients are available in Java, NET (C #), Python, Groovy, JavaScript, PHP, Perl, Ruby.

ES is being developed by Elastic, along with related projects called Elastic Stack - Elasticsearch, Logstash, Beats, and Kibana.

Elasticsearch is responsible for storing and searching data (hereinafter, for brevity, we will call it ES).

1.2. Advantages of the Elasticsearch Search Engine Compared to the MS SQL Search Engine


Docsvision 5.5 has a choice of which search engine to work with. In this article I will focus on the use of the Elasticsearch search engine and talk about its advantages over search from MS SQL Server.

Main advantages:

  • The ability of the indexing service to access external data stores. At the same time, the data is correctly indexed and correctly displayed during the search. When using the SQL Server search engine, it was possible to search only by the data stored in the main database.
  • ES is an open source project, and many global companies use it to search huge data sets.
  • ES , , .
  • (). ES, SQL Server .
  • ES β€” , .
  • ES (, , ), .
  • , , , .
  • ES . .
  • . , , , , , .
  • , , , , .
  • ES ( ). Docsvision, ES , SQL .
  • β€” SQL. , , , . , .. , . , , .. . ES .
  • The database does not increase in size when adding languages. Unlike ES, when using full-text search in MS SQL, indexed data increases the size of the main database, especially if indexing is configured in different languages, for example, Russian / neutral / English. In this case, the growth of indexing tables increases several times already, if only one language for indexing would be chosen, for example, neutral.

2. Required software and system requirements for installing Elasticsearch


ES can be deployed not only on high-performance servers, but also on a laptop. But if we are talking about a productive environment, then you should use a separate server and adhere to some recommendations that are worth considering.

2.1. RAM


The most critical resource for ES is RAM. This is the primary resource that will most likely end first. The minimum allowable size is 8Gb, the recommended one is from 16 to 64 Gb. More is allowed if there really is a need.

image

Sorting and aggregation can consume a large amount of memory, so it is important to have a sufficient supply of it. A machine with 64 GB of RAM is an ideal solution, but machines with 32 GB and 16 GB are also common. If 8 GB or less is installed on the machine, this may lead to the opposite results (in the end, you may need several such "small" machines). Using more than 64 GB also has its own characteristics.

2.2. CPU


ES, as a rule, does not have much processor requirements, so its choice is less important than other resources.

image

But you should adhere to the rule. You should choose a modern processor with several cores. Typically, servers in a cluster use two to eight-core machines.

If you have a choice between faster processors or processors with multiple cores, you should choose the latter. The additional parallelism offered by several cores will give a greater result than a slightly higher clock frequency.

2.3. Disk


The disc is also an important resource for the quick operation of ES. It is important when using a cluster and is doubly important for clusters with large volumes of indexed data.

Disks are the slowest subsystem on the server, which means that clusters with intensive recording can put a high load on the disks, which, in turn, becomes the bottleneck of the server. If it is possible to use solid-state drives, then it is necessary to use them, because they are far superior to any rotating media. Hosts with SSD support have a noticeable increase in both query and indexing performance.

image

If you intend to use hard disks (HDD), then it is advisable to use high-performance server disks (disks with a spindle speed of 15,000 rpm).

Using RAID 0 is an effective way to increase disk speed for both spinning disks and SSDs. There is no need to use RAID options with mirroring or parity, as high availability is built into ES through replicas.

2.4. I / O Scheduler


If you are using solid state drives, you should make sure that the I / O scheduler of the operating system is configured correctly. When data is being written to disk, the I / O scheduler decides when that data is actually sent to disk. In most cases, the default cfq scheduler is used (a completely honest queue).

This scheduler allocates time slots for each process and then optimizes the delivery of these various queues to disk. It is optimized for working with HDD: the nature of the rotating plates means that it is more efficient to write data to disk depending on the physical location.

However, this is ineffective for solid state drives, because they do not use rotating plates. Instead, use a deadline or noop. The deadline scheduler optimizes depending on how long the recording has been waiting, while noop is just a simple FIFO queue.

These simple changes can significantly improve write throughput with the right scheduler.

2.5. Network


A fast and reliable network is important for performance in a distributed system. Low latency ensures that nodes can easily exchange data, and high throughput helps move and recover data.

Modern data center networks (1 GbE, 10 GbE) are sufficient for the vast majority of clusters.

Clusters that span multiple data centers should be avoided, even if the data centers are located in close proximity. Definitely avoid clusters that span large geographic distances.

ES clusters assume that all nodes are equal. Long delays tend to exacerbate problems in distributed systems and make debugging and resolution difficult.

2.6. General recommendations


It is worth giving preference to β€œmedium” and β€œlarge” machines, avoiding low-performance machines in order to eliminate the costs of simply starting ES. At the same time, truly huge machines should be avoided: they often lead to an unbalanced use of resources (for example, all memory is used, but not the central processor) and can add logistical complexity if it is necessary to run several nodes on the machine.

3. Conclusion


After we figure out what Elasticsearch is, its key advantages and installation requirements, we can proceed with the ES installation itself for further configuration in Docsvision.

About how to install it and perform configuration, as well as check indexing, i.e. working capacity in Docsvision, read my colleague's publication here .

Interesting topic? Then you can use this link and find out even more! And here you can register for courses.

All Articles