🧜🏾 📢 👨🏾‍💼 A modest guide to database schemas 🥠 😡 👩🏼‍✈️

Geometry of Flowers by Mookiezoolook

For applications that will scale according to traffic and complexity, it is extremely important to initially design a competent database scheme. If you make a bad choice, you will have to spend a lot of effort so that this bad template does not spread to services and backend controllers and, finally, to the frontend.

But how to evaluate which circuit is better? And what does “better” mean when we talk about database architecture? The Mail.ru Cloud Solutions teaminvites you to follow the recommendations of Mike Alcha , a software development consultant. It seems to us that he rather succinctly summarized some principles of competent architecture.

Director: " I think we should build an SQL database . " Developer (does he even understand what he’s talking about, or just saw some kind of advertisement in a business magazine? ..): “ What color do you want the database? ". Director: “ Perhaps lilac has the most memory . ”

A few basic tips

So, it’s important to strive for two main things :

When breaking information into tables, all information is stored.
Redundancy of storage is minimal.

As for the second point: do we want to reduce redundancy just because of a problem with storage size? No, we do this mainly because the presence of redundant data leads to inconsistency problems if you do not update all fields representing the same information during the update.

Here are some guidelines to get closer to good architecture :

Use at least third normal form (in which each non-key attribute "must provide information about the key, the complete key, and nothing but the key", according to the wording of Bill Kent).
Create the last line of defense in the form of restrictions.
Never store whole addresses in one field.
Never store the first and last name in one field.
Set conventions for table and field names and stick to them.

- What are you working on?

“ Optimizing this SQL query.” It slows down, and users begin to complain.

- And obscene language in the comments is required for optimization?

- If you saw the original code, you would not ask.

Let's consider these recommendations in more detail.

1. Use at least a third normal form

The database architecture can be divided into the following categories:

The first normal form.
The second normal form.
Third normal form.
The normal form of Boyce-Codd.

These categories represent a classification by quality. We will briefly review all categories and see why at least a third normal form is needed.

First normal form

For the first normal form, each value of each column of each table in the database must be atomic. What does atomic mean? In short, the atomic value is a “single thing”.

For example, we have a table like this:

first_name	last_name	age	areas
Jhon	Doe	27	{“Website Design”, “Clientele Research”}
Mary	Jane	33	{“Long-term strategic planning”, “Recruitment”}
Tom	Smith	35	{"Marketing"}

Here, the areas column contains values that are not atomic. For example, in a John Doe line, the field stores two entities: Web site design and Client research.

So this table is not in the first normal form.

To bring it to this form, only one value should be stored in each field .

Second normal form

In the second normal form, no column that is not part of the primary key (or which can act as part of another primary key) cannot be derived from the smaller part of the primary key .

What does it mean?

Suppose you have such a base architecture (I emphasized the fields corresponding to the primary key in this table):

employee_id	project_id	Hours	employee_name	project_name
1	1	10	John	“Website design”
2	1	twenty	Mary	“Website design”

In this project, the employee name can be directly inferred from employeee_id, because the idea is that the employee name is uniquely determined by its identifier.

Similarly, the project name is uniquely identified by the project_id identifier.

Thus, we have two columns that can be deduced from the primary key part.

Each of these examples would be enough to throw this table out of the second normal form.

Another conclusion is that if the table was in the first normal form and all primary keys are single columns, then the table is already in the second normal form.

Third normal form

For the table to correspond to the third normal form, it must be in the second normal form, while there should not be any attributes (columns) in it, except the primary one, which are transitively dependent on the primary key.

What does it mean?

Let's say you have the following architecture (which is far from ideal):

employee_name	employee_id	age	department_number	department_name
John	1	27	123	"Marketing"
Mary	2	33	456	"Operational"
Tom	3	35	123	"Marketing"

In this table, department_number can be inferred from employee_id, and department_name can be inferred from department_number. So department_name is transitively dependent on employee_id!

If there is such a transitive dependency: employee_id → department_number → department_name, then this table is not in the third normal form.

What problems arise because of this ?

If the name of the department can be derived from its number, then storing this field for each employee introduces excessive redundancy.

Imagine that the marketing department changes its name to "Marketing and Sales." To maintain consistency, you will have to update the cell in each row of the table for each employee in this department! In the third normal form, this would not have happened.

In addition, here is what happens if Mary decides to leave the company: we must delete her row from the table, but if she was the only employee in the operations department, then the department will also have to be deleted.

All of these problems can be completely avoided in a third normal form.

Mom's exploits . Her daughter's name is Help! I'm forced to fake passports

2. Create the last line of defense in the form of restrictions

The database you are working with is more than just a group of tables. Certain functionality is built into it. Many of these features help ensure data quality and accuracy.

Restrictions set the rules, what values can be entered in the database fields.

When defining relationships in a database, be sure to set foreign key constraints.

Be sure to specify what should happen when deleting and updating a row associated with other rows in other tables (ON DELETE and ON UPDATE rules).

Be sure to use NOT NULL for all fields that should never be nullified. It may make sense to set a check on the backend, but remember that crashes always happen, so adding this kind of restriction will not hurt.

Set CHECK check limits to make sure that the table values are in the acceptable range, for example, the price of a product always has a positive value.

An interesting fact : in April 2020, precisely such a restriction in software prevented trading on the Moscow MICEX because the price of WTI oil futures fell below zero. Unlike the Moscow stock exchange, the New York Mercantile Exchange NYMEX updated the software a week before the incident , so it was able to successfully conduct transactions at a negative price, that is, with a surcharge to the buyer from the seller - approx. trans.

All PostgreSQL limitations can be found here .

3. Never store whole addresses in one field

If your application or website has a form with one field where the user enters his address, then it smells bad. It is very likely that in this case you will also have one field in the database to store the address as a simple string.

But what to do if you need to combine customer purchases by city in order to see which city which product is more popular? can you do it?

It will be very hard!

Since the full address is stored as a string in the database field, you will first have to figure out how much of this string is the city! And this is an almost impossible task, given all the possible address formats in this field.

Therefore, be sure to break the universal "Address" field into specific fields: street, house number, city, region, zip code, and so on.

Another Address Problem - Anonymous Fields

Here is an illustration from Michaels Blach’s book, The Copper Bullet to Improve Software Quality:

What potential problems are visible here? Can you easily distinguish the city of Chicago from the streets of Chicago? Probably not.

Therefore, remember to always give clear column names to each unit of information.

How to write a resume

- Do you have experience in SQL?

- No (No).

- So write: NoSQL expert.

4. Never store the first and last name in one field

Similar to the situation with addresses: the number of variations of the name and surname is too large to clearly distinguish between them.

Of course, you can separate the name from the last name, if there is a space between them.

For example, “Mike Alche” → the name “Mike” and the surname “Alche”.

But what if the user entered a middle name? Or does he have a double surname? But what if there is a middle name and a double surname?

How to determine where is the name and where is the last name to split the string? Mistakes are inevitable.

A way to avoid many problems is to create separate fields (in forms) for the first_name and last_name usernames. In this way, you allow users to share their own names and can store data in a consistent way.

Note: I am not saying that spaces are forbidden in the fields of the database. For example, for names like Juan Martin Del Potro, the first part of Juan Martin is in the first_name field, and Del Potro is in the last_name field. Of course, this is not perfect . You can optionally have the columns middle_name and second_last_name. Look in more detail about possible variations of names and surnames in the list “ Misconceptions of programmers about names ” and the article “ Misconceptions of programmers about names - with examples ”. You have to agree on some kind of compromise between accuracy and practicality.

5. Set conventions for table and field names and stick to them

It's pretty annoying to work with data that looks like user.firstName, user.lst_name, user.birthDate and so on.

I would advise you to establish underscore naming rules, because not all SQL engines handle uppercase letters the same way, and enclosing everything in quotation marks is very tedious.

Choose the same as calling the tables - in the plural or singular (for example, users in the plural or user in the singular). I like the singular more, but all the backend frameworks seem to be plural by default. You have to follow the pattern and use the plural.

What else to read :

A modest guide to database schemas