Why NoSQL Databases Are a Bad Solution for Modern Applications

Hello, Habr.

Today we bring to your attention a translation of an article from the MemSQL blog, which is originally an advertisement (dedicated to the advantages of MemSQL , updated as of early January 2020). But we decided to nevertheless translate it in an abbreviated form, since it explains in detail why we have not yet decided to publish anything on MongoDB, Cassandra, or other non-relational databases. Maybe we were right, limiting ourselves to the highly successful book " MySQL to the maximum ."

The time has come to recognize the long-known truth: NoSQL databases are not suitable for solving many practical problems facing modern applications, and the time for these databases has passed.

NoSQL databases appeared because traditional databases that existed at the time of their invention could not cope with the required scale of tasks. This new generation of services for working with data that appeared in circulation more than ten years ago, allowed to solve many problems that are relevant across the entire Web, as well as to operate with rapidly growing data sets. NoSQL also offered a new cost-effective method of cold storage / episodic batch access to petabyte data volumes. However, in hasty attempts to answer the challenges of big data and support large numbers of competing users, the NoSQL paradigm required abandoning some of the key properties of traditional databases, which made them so productive and easy to use.

Perhaps finding all of these trade-offs successfully is NoSQL's greatest contribution to the world of databases. They provoked evolution by combining the best features of big data handling with the structure and flexibility of a proven relational model and creating scalable relational databases.

Relational databases have evolved, giving rise to a completely new generation of systems that can cope with virtually any load and satisfy the scalability, reliability and availability requirements that apply to modern applications. We are talking about different workloads - from traditional, for example, transactional applications and business analytics, to more innovative ones, such as sharing software between different subscribers (multi-tenance) and operational analytics. The rise of new databases, in particular Google Spanner, Azure Data Warehouse and MemSQL, has proved that in most cases relational databases are easier to use and, as a rule, show better performance than NoSQL systems.

I know that these are controversial issues, and you can easily reject my point of view as biased. However, let me sort through the history, architecture, and application component of these databases - and then judge for yourself.

NoSQL Sunrise


NoSQL came into its own in the late 2000s, although their history began much earlier. Their development was carried out mainly to solve the scaling problems inherent in existing database systems. It was obvious that horizontal scaling was a more economical model when creating large systems. The largest systems, such as search engines and email services from Google, Facebook, Microsoft and Yahoo, could only scale in this way.

Personally, I first appreciated the full value of horizontal scaling when I read the articleJames Hamilton on designing and deploying Internet-wide services. At first, I managed to scale the level of applications, since a stateless system is easier to scale. The level of data storage is another story. Databases, by definition, work with state preservation, and it is really difficult to give guarantees (meaning ACID ) about this state on the scale of an entire distributed system. Therefore, new levels were built on top of existing database systems (MySQL, SQL Server, etc.) to create a level of distributed data storage.

I had to deal with some situations of this kind when I worked as a product manager in the SQL Server team at Microsoft. The first case was related to a Microsoft internal product; then the company created Webstore, a sharding layer built on SQL Server and used by Hotmail and its related services. In fact, it was Webstore that served as the incentive to create the product that served as the prototype of the current Azure SQL database. Webstore was somewhat clumsy, it lacked a significant part of the key functionality, but it worked and provided Microsoft both scaling to any desired amount of data and high availability. But to create and further support Webstore, a whole team of engineers was required.

In the mid-2000s, MySpace used a large number of SQL servers to manage a rapidly growing site. The company's user audience grew so fast that new instances of SQL servers needed to be installed daily. The operation of all these SQL servers and the execution of queries on all of them turned out to be a matter of such enormous complexity that a whole army of engineers was also engaged in it.
Similar stories were repeated on Facebook and other companies, as all the rapidly growing technology giants ran into the problem of scaling.

It became clear that at such rates of growth and exploitation, these new digital services require a new solution for data absorption, management, and their output to the surface. Ideally, a solution was required that could natively provide a single interface, but scale horizontally on many machines and at the same time have built-in tools to ensure high availability.

As a result, large-scale cloud services (Google, Facebook, Yahoo, Microsoft and others) built their own special systems to meet the need for scaling. These systems were different, but common ideas were laid in them. At the next stage, open source systems that used the same ideas began to multiply, and so the NoSQL movement arose.

To solve web-scale problems, NoSQL diverged from traditional databases on several key indicators. So, let's look at why these specific decisions were made here.

Performance and conformational flaws ultimately


There are two architectural approaches, ACID and BASE .

ACID means "Atomic, Consistent, Isolation, Durable" (Atomic, Consistent, Isolation, Durability). This paradigm covers all the guarantees typically provided in relational databases. ACID ensures that write operations will have to wait until the data gets to disk, and only after that the client will be informed that the operation completed successfully. In addition, if you really care about the longevity of the data (that is, you try not to lose them), then you configure the database so that the write operation can follow through the network to some other machine, and the data will also be written to disk and there . So you get guarantees that exactly what you wrote down always gets into the data, however, in part, you sacrifice write speed.

The BASE architecture typical of NoSQL systems means “Basically Available, Soft State, Eventually Consistent”. Consistency ultimately provides faster recording speed because the application does not have to wait for confirmation that the recording has been saved. As soon as the data store has accepted the recording, but even before the data has been permanently stored on its disk or on the disk of another machine, the database can inform the application that the write operation was successful, and the application can proceed to the next operation. So you win in performance, however, you risk not seeing the data that you just recorded, or the data may be completely lost due to some kind of error.

Consistency is ultimately a reasonable compromise that can be reached while striving for longevity and data availability. If your business involves consumer engagement, then any delays directly affect your earnings (and this applies equally to any content, community, and commercial application). Naturally, you achieve the greatest possible responsiveness of the user interface. If your task is to scale to serve millions of users competing with the system, then any bottlenecks are unacceptable to you. When you implement consistency in your database architecture, you run the risk of accidentally losing someone’s post or comment, and this kind of risk is acceptable in this type of application.

At the other end of the “longevity versus risk” spectrum of opportunities are financial applications. If you conduct a transaction through an ATM, then, of course, consistency will ultimately not suit you. The same applies to trading on the exchange. In such cases, there will still be users who agree only to minimal delays (or disagree at all), but are not ready to put up with the fact that the transaction will not be written to disk.

So, we have where to apply consistency in the long run, but, of course, it is not the only solution. Architects and data system developers must be able to choose what level of consistency they need. This choice should depend on the specifics of use, and not on the capabilities of the platform.

Trying to get by without a scheme


It is not entirely clear why in the NoSQL movement it was decided to abandon the schemas. Yes, at the dawn of NoSQL, it was difficult to build a manager for managing distributed metadata, which would provide support for schemes across the entire distributed system and support operations such as adding a column. Therefore, it is not surprising that schemas disappeared in the earliest projects of such databases. But, instead of finding a way to subsequently re-add the schemes, it was decided to completely abandon them. The point of view of those guys who indicate that if there is a scheme, the database becomes less flexible. It is difficult to design a good scheme, for this it is necessary to think carefully and in advance everything. When the situation is changing rapidly (as it was then and so is now), who wants to imprison themselves in the scheme.

But this is a fallacy.

Indeed, the lack of circuitry benefits the engineer, whose task is to write data to the system. However, in this case, the problems are pushed to the share of those who read the data, and they are usually an order of magnitude larger than the engineers, and often they do not have information about the context in which the data was at the time of recording. It is users who usually derive value from the data, and they need to leave as few obstacles to working with information as possible.

I will give an analogy. Imagine that librarians claim that they are tired of working with Dewey decimal catalogs, and now they will simply drop books into a large hole in the floor - because the work of the librarian is greatly simplified. It sometimes happens that it is appropriate to use partially structured data, because sometimes you do not imagine the form of some data, or the data itself is too sparse. But if you really don’t understand where this or that data will come from, or how it should look like, then what's the use of it?

The truth is that there is always a circuit. Data always carries some meaning for someone. But someone should spend time and embed their knowledge of this meaning in the platform so that other people can use the data after it. If we are dealing with data, some of which are understandable to us, and the other part is changing rapidly, then the second part falls into a column with partially structured information, and then we decide which columns we will subsequently form from this partially structured information. SQL Server and Oracle managed to do this in XML 15 years ago. In MemSQL and some other modern databases today, the same thing is done using JSON data. Document storage of data (and work with key-value pairs) should be features of modern databases, but not the only possibility of this or that product.

Query syntax is not like in SQL


This decision in the design of NoSQL databases followed the abandonment of the scheme. If there is no schema, then it seems appropriate to abandon the SQL syntax. In addition, the query processor is difficult to write for a single computer, but for a distributed system it is much more complicated. Most notably, if you are a developer who needs to quickly launch a new application, then such a new system seems easier.

MongoDB has perfected the art of easy installation and use without experience. However, it turns out that the relational model is very powerful. It is good to get along with the get and put functions if you have never had to solve problems more difficult than “select an object with id 2”. But most existing applications need to do much more. If you want to read an excellent article from the author who came to this conclusion (and at the same time does not work on a product for storing data), working out two separate projects using MongoDB - read this one . A great example showing when document database capabilities are limited.

In any system except the most trivial, sooner or later you will need to request data on a different principle than you saved them. Ironically, the relational model was invented in the 1960s to solve exactly the same problem with the data warehouses that existed at that time (IMS and Codasyl). The relational model that provided the ability to join seemed the only sensible way to extract data. Yes, at first it is quite difficult, but much easier than pulling all the data into your application, then creating the associations yourself. I saw clients try to do this again and again using NoSQL databases, and this always led them to some kind of nonsense.

Many of these NoSQL systems have achieved their primary goal. They provided a single interface for the data warehouse, through which it was possible to scale to many systems, relying on built-in high availability. However, although NoSQL has always made some progress, their implementation has constantly stopped.

There are several different reasons. The key reason is performance, in particular when it comes to performing analytic queries in compliance with a quality of service agreement. Another reason is manageability, because it is known how difficult it is to manage distributed systems. However, nothing has prevented the widespread adoption of NoSQL than the need to retrain people. Many specialists studied and professionally formed in the world of relational databases. NoSQL has been trying to change the world for more than a decade, but has achieved almost nothing. All companies working with NoSQL, combined, occupy only a few percent of the database market, the volume of which is $ 50 billion.

While NoSQL programmers clearly liked it, data specialists (DBA, data architects, analysts) reluctantly moved into the NoSQL world, because it seemed that only this paradigm could solve actual problems with scaling. However, this meant that they would have to relearn to new APIs, tools, develop a new ecosystem, discarding many years spent studying successful approaches, patterns and resources. They wanted to do their work according to the familiar model, but at the same time achieve the necessary scalability, without abandoning the durability, availability and reliability of the system.

Goodbye NoSQL


NoSQL databases arose so that engineers could cope with the scalability requirements that are relevant in modern times of web applications and services designed for different subscribers. Considering how difficult it was to solve such problems, it is clear that the first attempts to cope with scaling at the data storage level forced customers to make difficult compromises.

However, relational databases have evolved. Today they are able to cope with almost any workload, meeting all the requirements for scalability, reliability and availability that are presented to modern applications.

This involves, for example, workloads such as operational analytics. As all companies recognize the value of a data-driven approach, they strive to provide their employees with relevant data. This requires a new generation of analytical systems that can scale to hundreds of competitive queries, issue quick queries without prior aggregation, and absorb data at the same pace as they are generated. On top of all this, you need to provide data to customers and partners, and for this you need to follow certain agreements on the level of quality of service (SLA), security guarantees, performance and scalability capabilities, which are difficult for most modern data warehouses. Here’s one kind of workload that no legacy legacy database can handle.no NoSQL systems.

The relational model has stood the test of time and continues to grow in innovation, such as MemSQL's SingleStore. In addition, the old paradigm has absorbed many new data types (search, spatial, semi-structured, etc.) and matching models that allow all these data types to coexist in the same system. There are no insurmountable obstacles to the relational model and syntax of SQL queries. It just needs a different implementation of the data warehouse that would allow you to take full advantage of the vertically scalable architecture.

New databases, such as MemSQL, prove that in most practical cases, relational databases are easier to use and, in general, demonstrate better performance than NoSQL systems.

Thank you, NoSQL. You have provided the necessary pressure on the database development community, which made us give a worthy answer to the challenges of the cloud world. It worked. Relational databases began to evolve and began to meet modern requirements. Thanks to you.

All Articles