🎐 🖐🏾 😮 Denormalization of ERP database systems and its impact on software development: open a tavern on Tortuga 👨🏻‍⚕️ 👩🏾‍🎤 🗿

Hello! My name is Andrey Semenov, I am a senior analyst at Sportmaster. In this post I want to raise the issue of denormalization of ERP database systems. We will consider the general conditions, as well as a specific example - let's say it will be a wonderful monopoly tavern for pirates and sailors. In which pirates and sailors need to be served differently, because the ideas about the beautiful and consumer patterns of these good masters are significantly different.

How to make everyone happy? How not to go crazy designing and maintaining such a system? What to do if not only familiar pirates and sailors begin to come to the tavern? Everything is under the cut. But let's go in order.

1. Limitations and Assumptions

All of the above applies only to relational databases. Well-illuminated, including on the Internet, the effects of denormalization in the form of anomalies of modification, deletion and insertion are not considered. Beyond the scope of the publication, there remain cases where denormalization is a commonplace, with classic examples: series and passport number, date and time, and so on.

The post uses intuitive and practically applicable definitions of normal forms, without reference to mathematical terms. In the form in which they can be applied to the examination of real business processes (BP) and the design of industrial software.

There is an opinion that the design of data warehouses, reporting tools and integration agreements (which use tabular presentation of information) differs from the design of ERP database systems in that ease of consumption and the application of conscious denormalization to achieve it can take precedence over integrity protection data. I share this opinion, and the description below applies exclusively to master data models and transactional data of ERP systems.

The explanation of normal forms is given by an example that is understandable at the household level to most readers. However, as an illustrative illustration, in paragraphs 4-5, the emphasized “invented” task was deliberately used. If you don’t do this and take some kind of textbook example, for example, the same model for storing an order from Section 2, you may find yourself in a situation where the reader’s attention will be shifted from the proposed decomposition of the process into a model, to personal experience and perception of how processes and models of data storage in IP should be built. In other words, take two qualified IT analysts, let one provide services to logisticians transporting passengers, and the other to logisticians transporting machines for the production of microchips. Ask them, without discussing pre-automated BP, to create a data model for storing information about the railway flight.

There is a non-zero probability that in the proposed models you will find not only a noticeably different set of attributes, but also dissimilar sets of entities, because each analyst will rely on his usual processes and tasks. And to say in such a situation which model is “correct” is impossible, because there is no evaluation criterion.

2. Normal forms

The first normal form of the database requires the atomicity of all attributes.
In particular, if object A has non-key attributes a and b, such that c = f (a, b) and in the table describing object A you store the value of attribute c, then the first normal form is violated in the database. For example, if the order specification specifies a quantity whose units of measurement depend on the type of product: in one case it can be pieces, in the other liters, in the third package consisting of pieces (in the model above Good_count_WR), then the atomicity of the attributes is violated in the database. In this case, to say what the bush of the tables should be like for the order specification, you need a targeted description of the process of working in the IP, and since the processes can be different, there can be many “correct” versions.

The second normal form of the databaserequires compliance with the first form and its own table for each entity related to the process of working in IP. If in one table there are dependencies c = f1 (a) and d = f2 (b) and there is no dependence c = f3 (b), then the second normal form is violated in the table. In the example above, in the “Order” table, there is no relationship between the order and the address. Change the name of the street or city and you will not get any influence on the essential attributes of the order.

The third normal form of the databaserequires compliance with the second normal form and the absence of functional dependencies between attributes of different entities. This rule can be formulated as follows: "everything that can be calculated must be calculated." In other words, if there are two objects A and B. In the table that stores the attributes of object A, attribute C is displayed, object B has attribute b such that c = f4 (b) exists, then the third normal form is violated. In the example below, the attribute “Number of pieces” (Total_count_WR) in the order record clearly claims to violate the third normal form

3. My approach to applying normalization

1. Only the target automated business process can provide analytics with criteria for identifying entities and attributes when creating a data storage model. Creating a process model is a prerequisite for creating a normal data model.

2. The achievement of the third normal form in the strict sense may be impractical in the actual practice of creating ERP systems when part or all of the following conditions are met:

automated processes are rarely subject to change,
research and development deadlines are tight,
requirements for data integrity are relatively low (potential errors in industrial software do not lead to the loss of money or customers by the software customer)
etc.

Under the described conditions, the costs of identification, a description of the life cycle of some objects and their attributes may not be justified from the point of view of economic efficiency.

3. Any consequences of the denormalization of the data model in the already created IP can be stopped by a thorough preliminary study of the code and testing.

4. Denormalization is a way to transfer labor costs from the stage of researching data sources and designing a business process to the development stage, from the implementation period to the development period of the system.

5. It is advisable to strive for the third normal form of the database if:

The direction of change in automated business processes is difficult to predict
There is a permeable division of labor within the implementation and / or development team
The systems included in the integration circuit are developing according to their own plans.
Data inconsistency can lead to loss of customers or money by the company

6. The design of the data model should be carried out by the analyst only in connection with the models of the target business process and the process in IP. If a developer is engaged in designing a data model, he will have to dive into the subject area to such an extent that, in particular, he needs to understand the difference between attribute values - a necessary condition for distinguishing atomic attributes. Thus, taking on unusual functions.

4 Task for illustration

Let's say you have a small robotic tavern in the port. Your market segment: sailors and pirates who call at the port and need a rest. For sailors, you sell tea with thyme, and for pirates, rum and bone combs for combing your beard. The service in the tavern itself is provided by a hostess robot and a bartender robot. Thanks to high quality and low prices, you have crowded out everyone with a competitor, so that everyone leaving the ship comes to your tavern, which is the only one in the port.

The complex of tavern information systems consists of the following software:

Client early warning system that recognizes its category by characteristic features
Management system for hostess robots and bartender robots
Warehouse and point of delivery management system
Supplier Relationship Management System (SMSS)

Process:

The early warning system detects people leaving the ship. If a person is clean-shaven, she defines him as a sailor, if a beard is found in a person, then he is defined as a pirate.

Entering the tavern, the guest hears a greeting from the hostess robot according to his category, for example: “Ho-ho-ho, dear pirate, go to table No. ...” The

guest goes to the indicated table, where the robot-bartender has already prepared for him goods according to category. The bartender robot transmits information to the warehouse system that the next portion of delivery should be increased, the warehouse IS, based on the remaining storage, forms an application for purchase in the control system.

Let your internal IT develop an early warning system, an external contractor specially designed for your business to create a bar robot management program. And systems for warehouse management and supplier relationships are customized boxed solutions from the market.

5. Examples of denormalization and its impact on software development

When designing a business process, interviewed subject matter experts unanimously stated that pirates around the world drink rum and comb their beards with bone crests, and sailors drink tea with thyme and are always shaved smoothly.

A directory of customer types appears from two values: 1 - pirates, 2 - sailors, common to the entire information circuit of the company.

The customer notification system immediately saves the image processing result as the identifier (ID) of the recognized customer and its type: sailor or pirate.

Recognized Object ID	Customer category
100500	Pirate
100501	Pirate
100502	Sailor

Once again, we note that

1. Our sailors are actually shaved people
2. Our pirates are actually bearded people

What problems in this case need to be addressed so that our structure strives for a third normal form:

attribute atomic violation - Client category
mixing of the analyzed fact and conclusion in one table
fixed functional relationship between attributes of different entities.

In normalized form, we would get two tables:

recognition result in the form of a set of established features,

Recognized Object ID	Facial hair
100500	Yes
100501	Yes
100502	No

the result of determining the type of client as an application of the logic embedded in the IS for the interpretation of established signs

Recognized Object ID	Identification ID	Customer category
100500	100001	Pirate
100501	100002	Pirate
100502	100003	Sailor

How can a normalized storage organization facilitate the development of an IP complex? Let's say you suddenly have new customers. Let it be Japanese pirates who may not have a beard, but they walk with a parrot on their shoulders, and environmental pirates, you can easily recognize them by the blue profile of Greta on their left chest.

Environmental pirates, of course, cannot use bone crests and require an analogue from recycled marine plastic.

You need to rework the algorithms of the programs in accordance with the new introductory. If the normalization rules were met, then you would only have to add inputs for some process branches and create new branches only for those cases and in those IPs where the hairline on the face is important. But, since the rules were not followed, you will have to analyze the entire code, in the whole circuit, where the values of the client types directory are used and unambiguously establish that in one case the algorithm should take into account the professional activities of the client, and in the other physical features.

In a view that tends to normalize, we would get two tables with operational data and two directories:

recognition result in the form of a set of established features,


100510	1	1	1
100511	0	0	1
100512		1	0

( , )

Does the denormalization detected mean that the systems cannot be modified under new conditions? Of course not. If you imagine that all IPs were created by one team with zero staff turnover, the development is well documented and the information in the team is transmitted without loss, then the required changes can be products with neglectingly low effort. But if we return to the initial conditions of the task, only 1.5 keyboards and another 0.5 for registration of procurement procedures will be erased only for printing protocols of joint discussions.

In the above example, all three normal forms are violated, let's try to violate them individually.

Violation of the first normal form:

Suppose goods are delivered to your warehouse from suppliers' warehouses at your own expense using one 1.5-ton gazelle that belongs to your tavern. The size of your orders is so small relative to the turnover of suppliers that they are always carried out one to one without waiting for production. Do you need separate tables for such a PSU: vehicles, types of vehicles, do you need to separate the plan and the fact in your orders to departed suppliers?

Just imagine how many “extra” connections your programmers will have to write if you use the model below to develop a program.

Suppose we made a decision that the proposed structure is unnecessarily complicated, for our case, separating the plan and the fact in the order record is redundant information, and the generated order specification is overwritten by the results of acceptance of the arrived goods, rare re-sorting and arrival of goods of inadequate quality are settled outside the IP.
And then one day you see how the entire tavern hall is filled with indignant and unkempt pirates. What happened?

It turned out that along with the growth of your business, consumption also grew. A management decision was once made that if a gazelle was overloaded in volume and / or weight, which was extremely rare, the supplier prioritized loading in favor of drinks.

Undelivered goods fell into the next order and departed on a new flight, the presence of an irreducible balance in the warehouse at the tavern made it possible not to notice punctured cases.

The last competitor closed at the port, and the punctured gazelle overload case, circumvented by prioritization based on the assumption of the sufficiency of the minimum balance and periodic underloading of the vehicle, became common practice. The created system will ideally work out in accordance with the algorithms laid down in it and will be deprived of any opportunity to track the systematic non-fulfillment of planned orders. Only a damaged reputation and dissatisfied customers will be able to detect the problem.

An attentive reader must have noticed that the ordered quantity in the order specification (T_ORDER_SPEC) in section 2 and section 5 may or may not meet the requirement of the first normal form. It all depends on whether, for a selected assortment of goods, essentially different units of measurement can fall into the same field.

Violation of the second normal form:

As your needs grow, you get a couple more vehicles of different sizes. In the above context, the creation of a vehicle directory was considered redundant, as a result, all data manipulation algorithms that serve the needs of delivery and warehouse perceive the movement of goods from the supplier to the warehouse as an exclusively 1.5-ton gazelle flight. So, along with the purchase of new vehicles, you still create a directory of vehicles, but when finalizing you will have to analyze the entire code that refers to the movement of cargo to find out if references to the characteristics of the car from which business started.

Violation of the third normal form:

At some point, you begin to create a loyalty program, a regular customer record appears. Why, for example, spend time creating material representations that store aggregated sales data for an individual customer for use in reporting and transferring to analytical systems, if at the start of the loyalty program everything that interests the customer can be placed on the customer’s own records? And, indeed, there is no sense at first glance. But every time your business connects, for example, new sales channels, there should be someone among your analysts who remembers that there is such an aggregation attribute.

When designing each new process, for example, selling on the Internet, selling through distributors connected to a common loyalty system, someone should keep in mind that all new processes should ensure data integrity at the code level. For an industrial database with a thousand tables, this seems like an impossible task.

An experienced developer, of course, knows how to stop all the problems mentioned above, but, in my opinion, the task of an experienced analyst is not to bring them to their attention.

I want to express my gratitude for the valuable feedback during the preparation of the publication to the leading developer Evgeny Yarukhin.

Literature

https://habr.com/en/post/254773/
Connolly Thomas, Begg Carolyn. Database. Design, implementation and maintenance. Theory and practice