👨🏿‍✈️ 👌 📋 Databases in the IIoT platform: how Mail.ru Cloud Solutions works with petabytes of data from multiple devices 👩🏻‍🚒 🐀 ☕️

Hi, I’m Andrey Sergeev, the head of the IoT solutions development group at Mail.ru Cloud Solutions . It is known that a universal database does not exist. Especially when you need to build an Internet of things platform capable of processing millions of events from sensors per second in near real-time mode.

Our product Mail.ru IoT Platform began with a prototype based on Tarantool. I’ll tell you which way we went, what problems we encountered and how we solved them. And also show the current architecture of the modern platform of the industrial Internet of things. In the article we will talk:

about our database requirements, universal solution and CAP-theorem;
whether the database + application server in one approach is a silver bullet;
about the evolution of the platform and the databases used in it;
, Tarantool’ .

, @Databases Meetup by Mail.ru Cloud Solutions. , :

Mail.ru IoT Platform

Our product, Mail.ru IoT Platform, is a scalable and hardware-independent platform for building solutions for the industrial Internet of things. It allows you to collect data from hundreds of thousands of devices simultaneously and process this stream in near real-time mode (i.e. quasi-real time), including using custom rules - scripts in Python or Lua.

The platform can store an unlimited amount of raw data from sources, there is a set of ready-made components for visualizing and analyzing data, built-in tools for predictive analytics and creating applications based on the platform.

This is how the Mail.ru IoT Platform device looks.

At the moment, the platform is available for installation according to the on-premise model at the customer’s facilities, this year it is planned to release the platform as a service in the public cloud.

Tarantool prototype: how it all began

Our platform began as a pilot project - a prototype with a single instance Tarantool, the main function of which was to receive data flow from an OPC server, process received events using Lua scripts in real time, monitor key indicators based on them, and generate events and alerts in superior systems.

Tarantool prototype scheme

This prototype even worked in combat conditions for several months at a cluster site, it is a platform for oil production on the high seas, in Iraq, in the Persian Gulf. He monitored key indicators and supplied data for the visualization system and the event log. The pilot was recognized as successful, but as is often the case with prototypes - it didn’t go beyond the pilot and the prototype was put into a long box until it fell into our hands.

Our goals in developing an IoT platform

Together with the prototype, we got the task - to make a full-fledged, scalable and fault-tolerant IoT-platform, which could later be launched as a service in a public cloud.

We needed to build a platform with the following introductory:

Connect hundreds of thousands of devices simultaneously.
Receive millions of events per second.
Stream processing in near real-time mode.
Storage of raw data for several years.
Tools for both streaming analytics and historical analytics.
Deployment support in multiple data centers for maximum disaster recovery.

Pros and cons of the platform prototype

At the time of the start of active development, the prototype looked as follows:

Tarantool, which is used as a database + application server (Application Server);
all data is stored in Tarantool memory;
an application on Lua in the same Tarantool, which performs the functions of receiving data, processing it and calling user scripts on incoming data.

This approach to building applications has its advantages :

The code and data are in one place, which allows you to operate on the data directly in the application’s memory and removes the overhead costs of network visits typical of traditional applications.
Tarantool uses JIT (Just in Time Compiler) for Lua, which at compile time compiles Lua code into machine code, which allows simple Lua scripts to run at a speed comparable to C code (40,000 RPS from one core - and this is not the limit !).
Tarantool is based on cooperative multitasking, that is, each stored procedure call is launched in its own fiber, an analogue of coroutine, which gives an even greater boost in performance in tasks with I / O operations, for example, network visits.
Efficient use of resources - few tools can handle 40,000 requests per second from a single CPU core.

But there are significant disadvantages :

We need to store raw data from devices for several years, but we do not have hundreds of petabytes of memory for Tarantool.
A direct consequence of the first plus: the entire code of our platform is stored procedures in the database, which means that any update to the code base of the platform is an update of the database, which is very painful.
, . , Tarantool 24-32 (Tarantool ) . — Tarantool, .
. - , Tarantool Lua , - , LuaJIT .

Conclusion: Tarantool is a good choice for creating MVP, but for a full-fledged, scalable, easily supported and fault-tolerant IoT platform that can receive, process and store data from hundreds of thousands of devices, it is not suitable.

The main pains of the prototype that we wanted to get rid of

First of all, we wanted to cure two pains of our prototype:

Get away from the concept of database + application service. We wanted to update the application code regardless of the data store.
Simplify dynamic scaling under load. I wanted to get easy independent horizontal scaling of as many functions as possible.

To solve these problems, we chose an innovative and not yet tested approach: microservice architecture and separation of services into Stateless - applications and Stateful - database.

To further facilitate the operation and horizontal scaling of Stateless services, we containerized them and adopted Kubernetes.

We figured out Stateless services, it remains to decide what to do with the data.

Basic database requirements for the IoT platform

Initially, we did not want to fence the garden and wanted to store all the platform data in one universal database. After analyzing the goals, we came to the following list of requirements for a universal database:

ACID- — , .
— .
— , near real-time.
— - .
— , .
— ( !), .
— , ( !).
SQL — .

CAP-

Before starting to sort through all the databases that are on the market for compliance with our requirements, we decided to validate our requirements for sanity using a fairly well-known tool - CAP-theorems.

The CAP theorem says that a distributed system can have a maximum of two of the following three properties:

Consistency (data consistency) - in all computing nodes at one point in time, the data does not contradict each other.
Availability - any request to a distributed system ends with a correct response, but without a guarantee that the answers of all nodes in the system match.
Partition tolerance - even if there is no connection between nodes, they continue to work independently of each other.

For example, the classic CA system is a PostgreSQL Master-Slave cluster with synchronous replication, and the classic AP system is Cassandra.

Let us return to our requirements and classify them using the CAP theorem:

ACID transactions and strict consistency (or at least not eventual consistency) are C.
Horizontal scaling for writing and reading plus high availability is A (multi-master).
Fault tolerance is P, when one data center falls out, the platform should not die.

Conclusion : the universal database we need must have all three properties from the CAP theorem, which means that there is no universal database for all our requirements.

Choosing a database for the data with which the IoT platform works

If you can’t choose a universal database, we decided to select the types of data that the platform will work with and select a database for each type.

At a first approximation, we divided the data into two types:

Meta - information is a model of the world, devices, settings, rules, almost all data, except those that transmit end devices.
Raw data from devices - sensor readings, telemetry and service information from devices. In fact, these are time series, where each individual message contains a value and a timestamp.

Choosing a Database for Metadata

Database requirements for metadata . Metadata is inherently relational. They are characterized by a small amount, they are rarely modified, but they are important data, they cannot be lost, therefore consistency is important even within the framework of asynchronous replication, as well as ACID transactions and horizontal read scaling.

There is relatively little such data and they will be relatively infrequently changed, so you can sacrifice horizontal scaling to the recording, as well as the possible inaccessibility of the recording database in the event of an accident. That is, in terms of the CAP theorem, we need a CA system.

Which is suitable in the usual case . With such a statement of the problem, any classical relational database with support for clusters with asynchronous replication like PostgreSQL or MySQL would be quite suitable for us.

Features of our platform . We also needed support for trees with specific requirements. As part of the prototype, there was a feature from the systems of the class BDRV (real-time databases) - modeling the world using a tag tree. They allow you to combine all the client devices in one tree structure, which facilitates the management of a large number of devices and their display.

This is how the display of devices in the form of a tree structure looks like.

Such a tree allows you to link end devices with the environment, for example, you can put devices that are physically located in one room in one subtree, which greatly facilitates the work with them in the future. This is a convenient function, in addition, further we wanted to work in the niche of the airborne detonator system, and there the presence of such functionality is actually the industry standard.

For the full implementation of tag trees, a potential database must meet the following requirements:

Support for trees with arbitrary width and depth.
Modification of tree elements in ACID transactions.
High performance when traversing a tree.

Classic relational databases can handle small trees pretty well, but they do not do as well with arbitrary trees.

Possible Solution. Use two databases - a graph database for storing a tree and a relational database for storing the rest of the meta-information.

This approach has several big disadvantages at once:

To ensure consistency between two databases, you need to add an external transaction coordinator.
This design is difficult to maintain and not very reliable.
At the output we get two databases instead of one, while the graph database is needed only to support limited functionality.

A possible but not very good solution with two databases

Our solution for storing metadata . We also thought and remembered that initially this functionality was implemented in a prototype based on Tarantool and it turned out very well.

Before continuing, I would like to give a non-standard definition of Tarantool: Tarantool is not a database, but a set of primitives for building a database for your specific case.

Available primitives out of the box:

Space - an analogue of the tables in the database for storing data.
Full-fledged ACID transactions.
Replication is asynchronous using WAL logs.
A sharding tool that supports automatic resharding.
Superfast LuaJIT for stored procedures.
Large standard library.
LuaRocks package manager with even more packages.

Our CA solution was a relational + graph database based on Tarantool. We have collected the dream meta information repository based on Tarantool primitives:

Space for data storage.
ACID transactions - were available.
Asynchronous replication - was available.
Relations - done on stored procedures.
Trees - also made on stored procedures.

The cluster installation we have is classic for such systems - one Master for writing and several Slive with asynchronous replication for scaling for reading.

The result: a fast, scalable hybrid of a relational and graph database. A single Tarantool instance is capable of handling thousands of read requests, including those with active tree traversal.

Choosing a database for data from devices

Database requirements for data from devices . These data are characterized by frequent recording and a large amount of data: millions of devices, several years of storage, petabytes of information of both incoming messages and stored data. Their high availability is important, since it is the sensor readings that mainly operate on both user rules and our internal services.

For a database, horizontal scaling for reading and writing, availability and fault tolerance, as well as the availability of ready-made analytical tools for working with this data array, preferably based on SQL, are important. At the same time, we can sacrifice consistency and ACID transactions.

That is, in the framework of the CAP theorem, we need an AP system.

Additional requirements. We had several additional requirements for deciding where the gigantic amounts of data would be stored:

Time Series - data from sensors are time series, I wanted to get a specialized base.
Open source - the advantages of open source code do not need comments.
A free cluster is a common scourge among newfangled databases.
Good compression - given the amount of data and in general their uniformity, I wanted to efficiently compress the stored data.
Successful operation - we wanted to start on a database that someone is already actively exploiting at close to our loads in order to minimize risks.

Our decision . ClickHouse exclusively suited our requirements - a column-based database of time series with replication, multimaster, sharding, SQL support and a free cluster. Moreover, Mail.ru has many years of successful experience in operating one of the largest ClickHouse clusters in storage volume.

But no matter how good ClickHouse is, and we have problems with it.

Database problems for these devices and their solution

The problem with write performance. Immediately there was a problem with the write performance of a large data stream. They need to be brought to the analytical database as quickly as possible so that the rules that analyze the flow of events in real time can look at the history of a particular device and decide whether to raise an alert or not.

Solution to the problem . ClickHouse does not tolerate multiple single inserts (inserts), but works well with large packets (batches) of data - it easily copes with recording batches on millions of lines. We decided to buffer the incoming data stream, and then paste this data in batches.

So we coped with poor recording performance. The recording

problem was resolved, but it cost us a significant delay of several seconds between data entering the system and their appearance in our database.

And this is critical for various algorithms that respond to data from sensors in real time.

Read performance issue. Stream analytics for real-time data processing constantly needs information from the database - these are tens of thousands of small queries. On average, one ClickHouse node holds about a hundred analytical queries at the same time; it was created for infrequent heavy analytical queries for processing large amounts of data. Of course, this is not suitable for calculating trends in the flow of data from hundreds of thousands of sensors.

With a large number of requests ClickHouse does not work well.

Solving the problem . We decided to put a cache in front of ClickHouse, which will contain the most requested hot data for the last 24 hours.

The data for the last 24 hours is not data for a year, but also quite a significant amount of data, therefore, we also need an AP system with horizontal scaling for reading and writing, but with a focus on performance both for recording single events and for multiple reading. It also requires high availability, time series analytics, persistence, and built-in TTL.

In the end, we needed a quick ClickHouse, which can even store everything in memory for speed. We did not find any suitable solution on the market, so we decided to construct it on the basis of Tarantool primitives:

Persistence - is (WAL-logs + snapshots).
Performance - there is, all the data in memory.
Scaling - there is replication + sharding.
High availability - there is.
Time series analytics tools (grouping, aggregation, etc.) - made on stored procedures.
TTL - made on stored procedures with one background fiber (coroutine).

It turned out to be a convenient and productive solution - one instance holds 10,000 RPCs for reading, including analytical queries of up to tens of thousands of queries.

Here is the resulting architecture:

Final architecture: ClickHouse as an analytical database and Tarantool cache that stores data in 24 hours

New data type - state and its storage

We selected specialized databases for all data, but the platform developed and a new data type appeared - state. The state contains the current state of devices and sensors, as well as various global variables for stream analytics rules.

For example, there is a light bulb in the room. It can be both turned off and on, and you must always have access to its current state, including in the rules. Another example is a variable in stream rules, for example, some kind of counter. This type of data is characterized by the need for frequent recording and quick access, but at the same time the data themselves occupy a relatively small amount.

The meta-information repository is poorly suited for these types of data, since the state can change often, and in our case the recording ceiling is limited to one Master. Long-term and operational storages are also poorly suited, since our state last changed three years ago, and it is important for us to have quick read access.

That is, for the database in which the state is stored, horizontal scaling for reading and writing, high availability and fault tolerance are important, while consistency is needed at the level of values / documents. You can draw on the overall consistency and ACID transactions.

A suitable solution could be any Key Value or a document database: a shaded Redis cluster, MongoDB, or again Tarantool.

Tarantool Pros:

This is the most popular way to use Tarantool.
Horizontal scaling - there is asynchronous replication + sharding.
Consistency at the document level - is.

As a result, we now have three Tarantools, which we use for completely different cases: storing meta-information, a cache for quickly reading data from devices and storing status data.

How to choose a database for the IoT platform

A universal database does not exist.
Each type of data has its own database that is most suitable for it.
Sometimes the database you need may not be available on the market.
Tarantool is suitable as the basis for a specialized database.

This talk was first made at @Databases Meetup by Mail.ru Cloud Solutions. Watch a video of other performances and sign up for Telegram event announcements Around Kubernetes at Mail.ru Group .

What else to read :

Databases in the IIoT platform: how Mail.ru Cloud Solutions works with petabytes of data from multiple devices