The development of DATA VAULT and the transition to BUSINESS DATA VAULT

In a previous article, I talked about the basics of DATA VAULT, described the basic elements of DATA VAULT and their purpose. This cannot be considered the topic of DATA VAULT exhausted, it is necessary to talk about the next stages of the evolution of DATA VAULT.

And in this article I will concentrate on the development of DATA VAULT and the transition to BUSINESS DATA VAULT or simply BUSINESS VAULT.

Reasons for the appearance of BUSINESS DATA VAULT


It should be noted that DATA VAULT having certain strengths is not without drawbacks. One of these drawbacks is the difficulty in writing analytical queries. Requests have a significant number of JOINs, the code is long and cumbersome. Also, the data falling into DATA VAULT are not subject to any conversion, therefore, from the point of view of business, DATA VAULT in its pure form does not have unconditional value.

To eliminate these shortcomings, the DATA VAULT methodology has been expanded by such elements as:

  • PIT (point in time) tables;
  • BRIDGE tables;
  • PREDEFINED DERIVATIONS.

Let's take a closer look at the purpose of these elements.

Pit tables


As a rule, one business object (HUB) can include data with different update rates, for example, if we are talking about data characterizing a person, we can say that information about a phone number, address or email has a higher update rate than say, name, passport details, marital status or gender.

Therefore, when determining the satellites, it should be borne in mind the frequency of their updates. Why is it important?

If you store attributes with different refresh rates in one table, you will have to add a row to the table each time you update the most frequently changed attribute. As a consequence, an increase in disk space, an increase in query execution time.

Now that we have divided the satellites according to the update frequency, and we can upload data to them independently, it should be possible to obtain relevant data. Better without using unnecessary JOINs.

I will explain, for example, it is required to obtain up-to-date (by the date of the last update) information from satellites having different update frequencies. To do this, you need to not only make a JOIN, but also create several sub-queries (for each satellite containing information) with a choice of the maximum update date MAX (Update Date). With each new JOIN, such code grows, and very quickly becomes difficult to understand.

The PIT table is designed to simplify such queries; PIT tables are filled at the same time as new data is written to DATA VAULT. PIT table:

image

Thus, we have information on the relevance of data on all satellites at each moment in time. Using JOINs for the PIT table, we can completely exclude nested queries, naturally with the condition that the PIT is filled every day and without gaps. Even if there are gaps in the PIT, the actual data can be obtained only using one sub-request to the PIT itself. One subquery will work faster than subqueries for each satellite.

BRIDGE


BRIDGE tables are also used to simplify analytic queries. However, the difference from PIT is a means of simplifying and speeding up requests between different hubs, links and their satellites.

The table contains all the necessary keys for all satellites that are often used in queries. In addition, if necessary, hashed business keys can be supplemented with keys in text form if key names are needed for analysis.

The fact is that without using BRIDGE, in the process of obtaining data located in satellites belonging to different hubs, it will be necessary to produce JOINs not only of the satellites themselves, but also links connecting hubs.

The presence or absence of BRIDGE is determined by the storage configuration, the need to optimize the speed of query execution. It’s hard to come up with a universal example of BRIGE.

PREDEFINED DERIVATIONS


Another type of objects that brings us closer to BUSINESS DATA VAULT are tables containing pre-calculated indicators. Such tables are really important for business; they contain information aggregated according to the given rules and make it relatively easy to access it.

Architecturally, PREDEFINED DERIVATIONS are nothing more than just another satellite of a certain hub. It, like a regular satellite, contains a business key and the date the record was formed in the satellite. On this, however, the similarities end. The further composition of the attributes of such a “specialized” satellite is determined by business users on the basis of the most popular, pre-calculated indicators.

For example, a hub containing information about an employee may include a satellite with indicators such as:

  • Minimum wage;
  • Maximum salary;
  • Average salary;
  • Cumulative total of accrued salaries, etc.

It is logical to include PREDEFINED DERIVATIONS in the PIT table of the same hub, then you can easily get slices of employee data for a specific date.

FINDINGS


As practice shows, the use of DATA VAULT by business users is somewhat difficult for several reasons:

  • The request code is complex and cumbersome;
  • The abundance of JOINs affects query performance;
  • Writing analytic queries requires outstanding knowledge of the structure of the repository.

To simplify data access, DATA VAULT extends with additional objects:

  • PIT (point in time) tables;
  • BRIDGE tables;
  • PREDEFINED DERIVATIONS.

In the next article, I plan to tell, in my opinion, the most interesting thing for those who work with BI. I will present ways of creating tables - facts and tables - measurements based on DATA VAULT.

Materials of the article are based:

  • On the publication of Kent Graziano, which in addition to a detailed description contains diagrams of the model;
  • Book: “Building a Scalable Data Warehouse with DATA VAULT 2.0”;
  • Data Vault Fundamentals article .

All Articles