About 1C server cluster

A cluster is a type of parallel
or distributed system that:
1. consists of several interconnected
computers;
2. Used as a single,
unified computer resource

Gregory F. Pfister, “In search of clusters”.


Given: there is a business application (for example, an ERP system) with which thousands (possibly tens of thousands) of users work simultaneously.

It is required:
  1. Make the application scalable, so that with an increase in the number of users, it is possible due to the increase in hardware resources to provide the necessary application performance.
  2. Make the application resistant to failure of system components (both software and hardware), loss of connection between components and other possible problems.
  3. Use system resources as efficiently as possible and provide the desired application performance.
  4. Make the system easy to deploy and administer.

To solve these problems, we use the cluster architecture in the 1C: Enterprise platform.

We did not immediately reach the desired result.

In this article, we will talk about what kind of clusters are, how we chose the type of cluster that suits us and how our cluster evolved from version to version, and which approaches allowed us to eventually create a system that serves tens of thousands of simultaneous users.

image

As Gregory Pfister, the author of the epigraph to this article, wrote in his book “In search of clusters,” the cluster was not invented by any particular manufacturer of hardware or software, but by customers who lacked the power of one computer or needed redundancy. This happened, according to Pfister, back in the 60s of the last century.
Traditionally distinguish the following main types of clusters:

  1. Failover Clusters (HA, High Availability Clusters)
  2. Load balancing clusters (LBC)
  3. Computing clusters (High performance computing clusters, HPC)
  4. (grid) , . grid- , . grid- HPC-, .

To solve our problems (resistance to failure of system components and efficient use of resources), we needed to combine the functionality of a fail-safe cluster and a cluster with load balancing. We did not come to this decision right away, but approached it evolutionarily, step by step.

For those who are not up to date, I will briefly tell you how 1C business applications are arranged. These are applications written in a subject-oriented language , "sharpened" for the automation of accounting business tasks. To run applications written in this language, a 1C: Enterprise platform runtime must be installed on the computer.

1C: Enterprise 8.0


The first version of 1C application server (not yet a cluster) appeared in platform version 8.0. Before that, 1C worked in the client-server version, the data was stored in a file DBMS or MS SQL, and the business logic worked exclusively on the client. In version 8.0, a transition was made to the three-tier architecture "client - application server - DBMS".

Server 1C in platform 8.0 was COM +a server that can execute application code in 1C. Using COM + provided us with ready-made transport, allowing client applications to communicate with the server over the network. A lot of things in the architecture of both client-server interaction and the application objects available to the 1C developer were designed taking into account the use of COM +. At that time, fault tolerance was not built into the architecture, and a server crash caused all clients to shut down. When the server application crashed, COM + lifted it when the first client accessed it, and the clients started their work from the beginning - from the connection to the server. At that time, all customers were served by one process.
image

1C: Enterprise 8.1


In the next version, we wanted:

  • Provide fault tolerance to our customers so that accidents and errors of some users do not lead to accidents and errors of other users.
  • Get rid of COM + technology. COM + worked only on Windows, and at that time the possibility of working under Linux was already becoming relevant.

At the same time, we did not want to develop a new version of the platform from scratch - it would be too resource-intensive. We wanted to make maximum use of our achievements, as well as to maintain compatibility with applications developed for version 8.0.

So in version 8.1 the first cluster appeared. We implemented our protocol for remote procedure call (on top of TCP), which in appearance looked almost like COM + for the end consumer-client (i.e., we practically did not have to rewrite the code responsible for client-server calls). At the same time, we made the server implemented in C ++ platform independent, capable of working on both Windows and Linux.

The monolithic server version 8.0 was replaced by 3 types of processes - a work process that serves clients, and 2 service processes that support the operation of the cluster:

  • rphost is a workflow serving customers and executing application code. A cluster can have more than one workflow, different workflows can be executed on different physical servers - due to this scalability is achieved.
  • ragent - a server agent process that starts all other types of processes, as well as a leading list of clusters located on this server.
  • rmngr is a cluster manager that controls the operation of the entire cluster (but the application code does not work on it).

Under the cut is the operation diagram of these three processes in the cluster.
image

During the session, the client worked with one workflow, the drop in the workflow meant for all clients whom this process served, the session terminated abnormally. The remaining customers continued to work.
image

1C: Enterprise 8.2


In version 8.2, we wanted 1C applications to be able to run not only in the native (executable) client, but also in the browser (without modifying the application code). In this regard, in particular, the task arose of decoupling the current state of the application from the current connection with the rphost workflow and making it stateless. As a result, the concept of a session and session data arose that had to be stored outside the workflow (because it was stateless). A session data service has been developed that stores and caches session information. Other services also appeared - a service of managed transactional locks, a full-text search service, etc.

This version also introduced several important innovations - improved fault tolerance, load balancing, and a cluster redundancy mechanism.

fault tolerance


Since the work process became stateless and all the data necessary for work was stored outside the current client – ​​workflow connection, in the event of a workflow crash, the client switched to another, “live” workflow the next time it accessed the server. In most cases, such a switch was invisible to the client.

The mechanism works like this. If the client’s call to the workflow for some reason could not be completed to the end, then the client part is able, having received a call error, repeat this call by re-establishing the connection to the same workflow or to another. But you can’t always repeat a call; Repeat the call means that we sent the call to the server, but did not receive the result. We try to repeat the call, while making the second call, we evaluate what the result of the previous call was on the server (information about this is stored on the server in the session data), because if the call had time to “inherit” there (close the transaction, save the session data etc.) - it’s just impossible to repeat it, it will lead to data inconsistency. If the call cannot be repeated, the client will receive a message about an unrecoverable error,and the client application will have to restart. If the call did not succeed in “inheriting” (and this is the most common situation, because many calls do not change data, for example, reports, displaying data on a form, etc., and those that change data - until the transaction is committed or until a change in session data has been sent to the manager — there is no trace of the call) —it can be repeated without risk of data inconsistency. If the workflow crashed or a network connection was interrupted, such a call is repeated, and this “disaster” for the client application occurs completely unnoticed.that change data — until the transaction is committed or until the change in session data is sent to the manager — there is no trace of the call) —it can be repeated without risk of data inconsistency. If the workflow crashed or a network connection was interrupted, such a call is repeated, and this “disaster” for the client application occurs completely unnoticed.that change data — until the transaction is committed or until the change in session data is sent to the manager — there is no trace of the call) —it can be repeated without risk of data inconsistency. If the workflow crashed or a network connection was interrupted, such a call is repeated, and this “disaster” for the client application occurs completely unnoticed.

Load balancing


The task of load balancing in our case is as follows: a new client enters the system (or an already working client makes another call). We need to choose which server and workflow to send the client’s call to in order to provide the client with maximum speed.

This is a standard task for a load-balanced cluster. There are several typical algorithms for solving it, for example:


For our cluster, we chose an algorithm that is essentially close to Least Response Time. We have a mechanism that collects statistics on the performance of workflows on all servers in the cluster. It makes a reference call to each server process in the cluster; the reference call involves a subset of the functions of the disk subsystem, memory, processor, and evaluates how quickly such a call is made. The result of these measurements, averaged over the last 10 minutes, is a criterion - which server in the cluster is the most productive and preferred for sending client connections to it in a given period of time. Client requests are distributed in such a way as to better load the most productive server - load the one who is lucky.

The request from the new client is addressed to the most productive server at the moment.

A request from an existing client is in most cases addressed to that server and to the workflow to which its previous request was addressed. An extensive set of data on the server is associated with a working client; transferring it between processes (and even more so between servers) is quite expensive (although we can do this too).

A request from an existing client is transferred to another workflow in two cases:

  1. There is no more process: the workflow with which the client previously interacted is no longer available (the process has fallen, the server has become unavailable, etc.).
  2. : , , , , . , , – . – (.. ).



We decided to increase the cluster resiliency by resorting to the Active / passive scheme . There was an opportunity to configure two clusters - working and reserve. If the primary cluster is unavailable (network problems or, for example, scheduled maintenance), client calls are redirected to the backup cluster.

However, this design was quite difficult to configure. The administrator had to manually assemble two server groups into clusters and configure them. Sometimes administrators made mistakes by setting conflicting settings, as there was no centralized mechanism for checking settings. But, nevertheless, this approach increased the fault tolerance of the system.
image

1C: Enterprise 8.3


In version 8.3, we substantially rewrote the server-side code responsible for fault tolerance. We decided to abandon the Active / passive cluster scheme because of the complexity of its configuration. Only one fault-tolerant cluster remained in the system, consisting of any number of servers - this is closer to the Active / active scheme in which requests for a failed node are distributed among the remaining working nodes. Due to this, the cluster has become easier to configure. A number of operations that increase fault tolerance and improve load balancing have become automated. Of the important innovations:

  • « »: , , . , «» .
  • , , .
  • , , , , , :

image

image

The main idea of ​​these developments is to simplify the work of the administrator, allowing him to configure the cluster in terms familiar to him, at the level of server operation, not dropping lower, and also to minimize the level of “manual control” of the cluster’s work, giving the cluster mechanisms to solve most work tasks and possible problems “ on autopilot. ”

image

Three links of fault tolerance


As you know, even if the components of the system separately are reliable, problems can arise where the components of the system cause each other. We wanted to minimize the number of places critical to system performance. An important additional consideration was the minimization of alterations of applied mechanisms in the platform and the exclusion of changes in applied solutions. In version 8.3, there were 3 links for ensuring fault tolerance “at the joints”:

image
  1. , HTTP(S), -. - -. , HTTP -, ( HTTP) libcurl .
  2. , .
  3. - . 1, - . - . , . , – . - (, , , ) – .
  4. , rmngr. 20 ( ) — , .. . 1: .



Thanks to the fault tolerance mechanism, applications created on the 1C: Enterprise platform successfully survive various types of failures of production servers in the cluster, while most of the clients continue to work without restarting.

There are situations when we cannot repeat the call, or a server crash catches the platform at a very unfortunate point in time, for example, in the middle of a transaction and it is not very clear what to do with them. We try to ensure statistically good client survival when servers fall in the cluster. Typically, the average loss of clients for server failure is a few percent. In this case, all "lost" clients can continue to work in the cluster after restarting the client application.

The reliability of the 1C server cluster in version 8.3 has increased significantly. It has long been not uncommon to introduce 1C products, where the number of simultaneously working users reaches several thousand. There are implementations where both 5,000 and 10,000 users work simultaneously - for example, the introduction in Beeline , where the 1C: Trade Management application serves all Beeline sales outlets in Russia, or the implementation of Business Lines in the freight carrier , where the application, independently created by the developers of the IT department of Business Lines on the 1C: Enterprise platform, serves the full cycle of cargo transportation. Our internal cluster load tests simulate the simultaneous operation of up to 20,000 users.

In conclusion, I would like to briefly list what else is useful in our cluster (the list is incomplete):

  • , . , , .
  • – , , , ( – , ..) . , , (, ) , ERP, , ERP.
  • – , (, , ..). , , , , , .

All Articles