💼 🛠️ 👨🏽‍⚕️ Data Platform for regulatory reporting 👨‍🍳 👐🏽 🧚🏼

The formation of banking regulatory reporting is a complex process with high requirements for accuracy, reliability, depth of information disclosed. Traditionally, organizations use classic data storage and processing systems to automate reporting. At the same time, the number of tasks is growing rapidly every year, where it is required not only to analyze large volumes of heterogeneous data, but also to do it at the speed required by the business.

The combination of these factors has led to a change in data management processes. Data Platform - an approach that offers a rethinking of the traditional concept of classical data warehouse (QCD) using Big Data technologies and new approaches used in building Data Lake platforms. The Data Platform allows you to qualitatively take into account such important factors as the growth in the number of users, the requirements for time2customer (to provide the possibility of high speed of implementation of changes), as well as the cost of the resulting solution, including taking into account its further scaling and development.

In particular, we propose to consider the experience of automation of reporting under RAS, tax reporting and reporting at Rosfinmonitoring at the National Clearing Center (hereinafter - NCC).
The choice of architecture that allows you to implement the solution, taking into account the following requirements, was extremely careful. The competition was attended by both classic solutions and several bigdat solutions - at Hortonworks and Oracle Appliance.

The main requirements for the solution were:

Automate the construction of regulatory reporting;
At times increase the speed of data collection and processing, the construction of final reports (direct requirements at the time of building all reporting for the day);
To unload the ABS by withdrawing reporting processes beyond the general ledger;
Choose the best solution from a price point of view;
, , ;
, .

A decision was made in favor of introducing the Neoflex Reporting Big Data Edition product based on the open-source Hadoop Hortonworks platform.

The DBMS of source systems is Oracle, also sources are flat files of various formats and images (for tax monitoring purposes), individual information is downloaded via the REST API. Thus, the task of working with both structured and unstructured data appears.

Let us consider in more detail the storage areas of the Hadoop cluster:

Operation Data Store (ODS) - the data is stored "as is" of the source system in the same form and format as defined by the source system. To store history for a number of necessary entities, an additional archive data layer (ADS) is implemented.

CDC (Change Data Capture) - why abandoned delta capture

, , . Hadoop .

( ) :

append-only , , , , ;
, , , .. , ;
, «» ;
CDC- «» , «» «».

, , :

ODS, AS IS. , , Hadoop , ;
ODS , ();
PDS «1 1 » PDS.

Portfolio Data Store (PDS) is an area in which critical data is prepared and stored in a unified centralized format, which is subject to increased demands on the quality of not only data, but also the structure of syntax and semantics. For example, data include customer registers, transactions, balance sheets, etc.

ETL processes are developed using Spark SQL using Datagram. It belongs to the class of solutions - “accelerators”, and allows you to simplify the development process through visual design and description of data transformations using the usual SQL syntax - and, in turn, the code of the work itself in the Scala language is automatically generated. Thus, the level of development complexity is equivalent to developing ETLs on more traditional and familiar tools such as Informatica and IBM InfoSphere DataStage. Therefore, this does not require additional training of specialists or involvement of experts with special knowledge of Big Data technologies and languages.

At the next stage, reporting forms are calculated. The calculation results are placed in the windows of the Oracle DBMS, where interactive reports are built on the basis of Oracle Apex. At first glance, it might seem counterintuitive to use commercial Oracle along with open-source Big Data technologies. Based on the following factors, it was decided to use Oracle and Apex specifically:

Lack of an alternative BI-solution compatible with a free-distributed DBMS and meeting the requirements of the NCC Business in terms of building on-screen / printed forms of regulatory reporting;
Using Oracle for DWH involved as source systems for a Hadoop cluster;
Existence of the flexible Neoflex Reporting platform on Oracle, which has the majority of regulatory reports and is easily integrated with the Big Data technology stack.

The Data Platform stores all data from source systems, unlike the classic QCD, where data is stored for solving specific problems. At the same time, only useful, necessary data are used, described, prepared and managed in the Data Platform, i.e., if certain data is used on an ongoing basis, they are classified according to a number of signs and placed in separate segments, portfolios in our case, and managed according to the characteristics of these portfolios. In QCD, on the contrary, all data uploaded to the system are prepared, regardless of the need for their further use.

Therefore, if it is necessary to expand to a new class of tasks, QCD often faces an actually new implementation project with the corresponding T2C, while in the Data Platform all the data is already in the system and can be used at any time without preliminary preparation. For example, data is collected from ODS, quickly processed, “screwed” to a specific task and transmitted to the end user. If direct use has shown that the functionality is correct and applicable in the future, then the full process is launched, in which the target transformations are built, data portfolios are prepared or enriched, the storefront layer is activated and full-fledged interactive reports or downloads are built.

The project is still under implementation, however, we can note a number of achievements and take intermediate results:

:
- , ;
- LDAP ;
- : 35 HDFS, 15 (50 . ) ;
- HDFS «» Big Data;
- (PDS) Hadoop .
Hadoop;
open-source , .. Hadoop Spark, ( , ) . , ;
«» , ;
Datagram , ETL- .

— , - Big Data Solutions «»

Data Platform for regulatory reporting

More articles: