The truth first of all, or why the system needs to be designed based on the database device

Hello, Habr!

We continue to explore the topic of Java and Spring , including at the database level. Today we suggest reading about why, when designing large applications, it is the database structure, and not the Java code, that should have a decisive importance on how this is done, and what exceptions to this rule are.

In this rather belated article, I will explain why I believe that in almost all cases the data model in the application should be designed “based on the database”, and not “based on the capabilities of Java” (or another client language you work with). Choosing the second approach, you embark on a long journey of pain and suffering as soon as your project begins to grow.

This article is based on one question asked on Stack Overflow.

Interesting discussions on reddit in the sections / r / java and / r / programming .

Code generation


How surprised I am that there is such a small layer of users who, having got acquainted with jOOQ, are outraged by the fact that when working jOOQ seriously relies on source code generation. Nobody bothers you to use jOOQ as you see fit, and does not force you to use code generation. But by default (as described in the manual), working with jOOQ happens like this: you start with the (inherited) database schema, reverse engineer it with the jOOQ code generator, so that you get a set of classes representing your tables, and then write type-safe queries to these tables:

	for (Record2<String, String> record : DSL.using(configuration)
//   ^^^^^^^^^^^^^^^^^^^^^^^      
//     ,    
//   SELECT 
 
       .select(ACTOR.FIRST_NAME, ACTOR.LAST_NAME)
//           vvvvv ^^^^^^^^^^^^  ^^^^^^^^^^^^^^^  
       .from(ACTOR)
       .orderBy(1, 2)) {
    // ...
}

The code is generated either manually outside the assembly, or manually with each assembly. For example, such regeneration may follow immediately after the migration of the Flyway database, which can also be done manually or automatically .

Source code generation


Various philosophies, advantages and disadvantages are associated with such approaches to code generation - manual and automatic - which I am not going to discuss in detail in this article. But, in general, the whole point of the generated code is that it allows us to reproduce in Java that “truth” that we take for granted, either within our system or outside it. In a sense, compilers that generate bytecode, machine code, or some other kind of source-based code do the same thing — we get a representation of our “truth” in another language, regardless of specific reasons.

There are many such code generators. For example, XJC can generate Java code based on XSD or WSDL files . The principle is always the same:

  • There is some truth (internal or external) - for example, specification, data model, etc.
  • We need a local representation of this truth in our programming language.

Moreover, generating such a representation is almost always advisable - to avoid redundancy.

Type Providers and Annotation Processing


Note: another, more modern and specific approach to code generation for jOOQ is associated with the use of type providers, in the form in which they are implemented in F # . In this case, the code is generated by the compiler, actually at the compilation stage. In the form of source code, such a code basically does not exist. There are similar, albeit less elegant tools in Java - these are annotation processors such as Lombok .

In a certain sense, the same things happen here as in the first case, with the exception of:

  • You do not see the generated code (perhaps this situation seems to someone not so repulsive?)
  • , , , «» . Lombok, “”. , .

?


In addition to the tricky question of how it is better to start code generation - manually or automatically, I have to mention that there are people who think that code generation is not needed at all. The rationale for this point of view, which I came across most often, is that it is then difficult to configure the assembly pipeline. Yes, really hard. There are additional infrastructure costs. If you are just starting to work with a certain product (be it jOOQ, or JAXB, or Hibernate, etc.), it takes time to set up the working environment that you would like to spend on learning the API itself, and then extract value from it.

If the costs associated with understanding the generator’s device are too great, then, in fact, the API did a little work on the usability of the code generator (and in the future it turns out that the user configuration in it is complicated). Ease of use should be the highest priority for any such API. But this is just one argument against code generation. For the rest, it’s completely completely manual to write a local representation of internal or external truth.

Many will say that they don’t have time to do all this. They have deadlines for their Super Product. Sometime later we comb the assembly conveyors, it will be in time. I will answer them:


Original , Alan O'Rourke, Audience Stack

But in Hibernate / JPA it is so simple to write code “for Java”.

Really. For Hibernate and its users, this is both a blessing and a curse. In Hibernate, you can simply write a couple of entities, like this:

	@Entity
class Book {
  @Id
  int id;
  String title;
}

And almost everything is ready. Now Hibernate’s destiny is to generate complex “details” of exactly how this entity will be defined on the DDL of your SQL “dialect”:

	CREATE TABLE book (
  id INTEGER PRIMARY KEY GENERATED ALWAYS AS IDENTITY,
  title VARCHAR(50),
 
  CONSTRAINT pk_book PRIMARY KEY (id)
);
 
CREATE INDEX i_book_title ON book (title);

... and we start to drive the application. A really cool opportunity to get started quickly and try different things.

However, allow. I was deceiving.

  • Does Hibernate really apply the definition of this named primary key?
  • Will Hibernate create an index in TITLE? “I know for sure that we will need him.”
  • Does Hibernate exactly make this key identifiable in the Identity Specification?

Probably not. If you are developing your project from scratch, it is always convenient to simply drop the old database and generate a new one as soon as you add the necessary annotations. So, the Book entity will eventually take the form:

	@Entity
@Table(name = "book", indexes = {
  @Index(name = "i_book_title", columnList = "title")
})
class Book {
  @Id
  @GeneratedValue(strategy = IDENTITY)
  int id;
  String title;
}

Cool. Re-generate. Again, in this case, at the start it will be very easy.

But then you have to pay for it


Sooner or later you have to go into production. Just then, such a model will stop working. Because:

In production it will no longer be possible, if necessary, to discard the old database and start all over from scratch. Your database will turn into a legacy one.

From now on, you will have to write DDL migration scripts, for example, using Flyway . And what then happens to your entities? You can either adapt them manually (and thus double your workload), or you can order Hibernate to regenerate them for you (how great are the chances that the one generated in this way will meet your expectations?) You lose anyway.

Thus, as soon as you go into production, you will need hot patches. And they need to be put into production very quickly. Since you have not prepared and organized a smooth conveyorization of your migrations for production, you are patching everything wildly. And then you don’t have time to do everything right. And scold Hibernate, because anyone is always to blame, but not you ...

Instead, from the very beginning everything could be done in a completely different way. For example, put round wheels on a bicycle.

Database first


The real "truth" in the schema of your database and the "sovereignty" over it lies inside the database. The scheme is defined only in the database itself and nowhere else, and each of the clients has a copy of this scheme, so it is completely advisable to impose compliance with the scheme and its integrity, do it directly in the database - where the information is stored.
This is old worn wisdom even. Primary and unique keys are good. Foreign keys are good. Checking for restrictions is good. Statements are good.

Moreover, this is not all. For example, using Oracle, you probably want to specify:

  • What tablespace is your table in?
  • What is its PCTFREE value?
  • What is the size of the cache in your sequence (behind the identifier)

Perhaps all this is not important in small systems, but it is not necessary to wait for the transition to the area of ​​"big data" - it is possible and much earlier to begin to benefit from the storage optimizations provided by the supplier, such as those mentioned above. None of the ORMs I've seen (including jOOQ) provide access to the full set of DDL options that you might want to use in your database. ORMs offer some tools that help write DDL.

But in the end, a well-designed circuit is manually written in DDL. Any generated DDL is only an approximation of it.

What about the client model?


As mentioned above, on the client you will need a copy of your database schema, the client view. Needless to say, this client view must be synchronized with the real model. What is the best way to achieve this? Using a code generator.

All databases provide their meta information through SQL. Here's how to get all tables in different dialects of SQL from your database:

	-- H2, HSQLDB, MySQL, PostgreSQL, SQL Server
SELECT table_schema, table_name
FROM information_schema.tables
 
-- DB2
SELECT tabschema, tabname
FROM syscat.tables
 
-- Oracle
SELECT owner, table_name
FROM all_tables
 
-- SQLite
SELECT name
FROM sqlite_master
 
-- Teradata
SELECT databasename, tablename
FROM dbc.tables

These queries (or similar ones, depending on whether you also have to consider representations, materialized representations, functions with a table value) are also performed using a call DatabaseMetaData.getTables()from JDBC, or using the jOOQ meta-module.

From the results of such queries, it is relatively easy to generate any client view of your database model, regardless of what technology is used on your client.

  • If you use JDBC or Spring, you can create a set of string constants
  • If using JPA, you can generate entities themselves
  • If using jOOQ, you can generate the jOOQ meta-model

Depending on how many features are offered by your client API (e.g. jOOQ or JPA), the generated meta-model can be truly rich and complete. Take, for example , the possibility of implicit joins that appeared in jOOQ 3.11 , which relies on the generated meta-information about the relationships of foreign keys between your tables.

Now any increment of the database will automatically lead to updating client code. Imagine for example:

ALTER TABLE book RENAME COLUMN title TO book_title;

Would you really like to do this job twice? In no case. Just fix the DDL, run it through your assembly pipeline and get the updated entity:

@Entity
@Table(name = "book", indexes = {
 
  //    ?
  @Index(name = "i_book_title", columnList = "book_title")
})
class Book {
  @Id
  @GeneratedValue(strategy = IDENTITY)
  int id;
 
  @Column("book_title")
  String bookTitle;
}

Or an updated jOOQ class. Most DDL changes also affect semantics, not just syntax. Therefore, it can be convenient to see in compiled code which code will (or may be) be affected by the increment of your database.

The only truth


No matter what technology you use, there is always one model that is the only source of truth for some subsystem - or, at least, we should strive for this and avoid such enterprise confusion, where the “truth” is everywhere and nowhere. Everything can be much simpler. If you are just exchanging XML files with some other system, just use XSD. Take a look at the INFORMATION_SCHEMA meta-model from jOOQ in an XML form:
https://www.jooq.org/xsd/jooq-meta-3.10.0.xsd

  • XSD is well understood
  • XSD XML
  • XSD
  • XSD Java XJC

The last point is important. When communicating with an external system using XML messages, we want to be sure of the validity of our messages. This is very easy to achieve with JAXB, XJC and XSD. It would be completely insane to expect that, when approaching the design of “Java first”, where we make our messages in the form of Java objects, they could somehow be clearly displayed in XML and sent for consumption to another system. The XML generated in this way would be of very poor quality, not documented, and would be difficult to develop. If there was an agreement on the level of quality of service (SLA) on such an interface, we would immediately ruin it.

Honestly, that’s exactly what happens all the time from the API to JSON, but that's another story, I’ll swear next time ...

Databases: it's the same thing


When working with databases, you understand that they are, in principle, similar. The base owns its data and must manage the scheme. Any modifications made to the circuit must be implemented directly on DDL in order to update a single source of truth.

When a source update has occurred, all clients must also update their copies of the model. Some clients can be written in Java using jOOQ and Hibernate or JDBC (or all at once). Other clients can be written in Perl (it remains to wish them good luck), and others in C #. It does not matter. The main model is in the database. Models generated using ORMs, usually of poor quality, are poorly documented and difficult to develop.

Therefore, do not make mistakes. Make no mistakes right from the start. Work from a database. Build a deployment pipeline that can be automated. Turn on code generators to make it convenient to copy your database model and dump it onto clients. And stop worrying about code generators. They are good. With them you will become more productive. You only need to spend a little time from the very beginning to configure them - and then you will have years of increased productivity that will form the history of your project.

Until then, thank you.

Explanation


For clarity: This article in no way advocates that under the model of your database you need to bend the entire system (i.e., subject area, business logic, etc., etc.). In this article, I say that client code that interacts with the database must act on the basis of the database model so that it does not reproduce the database model in the "first class" status. This logic is usually located at the data access level on your client.

In two-tier architectures, which are still preserved in some places, such a system model may be the only possible one. However, in most systems, the level of data access seems to me to be a "subsystem" encapsulating a database model.

Exceptions


There are exceptions to any rule, and I have already said that an approach with database primacy and source code generation can sometimes be inappropriate. Here are a couple of exceptions (there are probably others):

  • When the circuit is unknown, and it must be opened. For example, you are a provider of a tool to help users navigate any scheme. Phew There is no code generation. But still - the database is above all.
  • When a circuit should be generated on the fly to solve a certain problem. This example seems like a slightly fanciful version of the entity attribute value pattern , i.e., you really don't have a well-defined schema. In this case, it’s often impossible to be sure at all that an RDBMS is right for you.

Exceptions are inherently exceptional. In most cases involving the use of an RDBMS, the scheme is known in advance, it is located inside the RDBMS and is the only source of “truth”, and all customers have to acquire copies that are derived from it. Ideally, you need to use a code generator.

All Articles