Book “Databases. Reliability Engineering

imageHello, habrozhiteli! In the field of IT, a real revolution took place - they began to work with the infrastructure as with code. This process creates not only new problems, but also opportunities to ensure the uptime of databases. The authors have prepared this practical guide for anyone who wishes to join the community of modern database reliability engineers (DBRE).

In this book: • storage requirements and risk management requirements. • Creation and development of an architecture that provides transparent database support. • streamlining the release management process. • storage, indexing and replication of data. • determining the characteristics of the data warehouse and selecting the best options for its use. • research of architecture components and creation of architectures oriented to big data processing.

Who is this book for?
, , . , . , . , , . .

, Linux/Unix, - / . , — — — . , , .

, , , . , , .

, , . , - , . , Excel, .

Publication structure
. . , , , . : , , , . , . !

, : (DBRE), (RE). , . DBR- , , .

, . . , . — , . , , , , , , , . .
, .

1 . , — , DBRE, — , , DBRE .

2 . , . , , — , . , .
3 . . .

4 . . , , . , .

5 6 . . , , , , , .

7 . , , DBE. — , . , , , .

8 . , ? SQL? — , .

9 . . .

10 , . , , . , .

11 . , , . , , , .

12 (), , , . , « » () , , . — .

, 13 . , .

Backup & Restore


In chapters 5 and 6, we focused on designing and managing infrastructure. This means that you have a good idea of ​​how to create and deploy distributed infrastructures in which databases work, as well as manage them. We looked at methods for quickly adding new nodes to increase capacity or replace a failed node. Now it’s time to discuss the most important thing - backing up and restoring data.

Let's face it: everyone considers backing up and restoring boring and tedious activities. For most, these procedures are the epitome of routine. The team does not like to interact with junior engineers and external contractors and work with third-party tools. Before, we had to deal with just awful backup software. We sympathize with you, honestly.

Nevertheless, this is one of the most significant processes in your work. Moving important data between nodes, data centers and transferring it to long-term archives is a constant movement of the most valuable asset of your business - information. We strongly recommend that you do not consider recovery and backup operations as second-class operations, but treat them as VIP operations. Everyone should not only understand the goals of data recovery, but also be familiar with the principles of this work and process monitoring. The DevOps philosophy assumes that everyone should be able to write code and implement it in a really working system. We invite each engineer to take part in critical data recovery processes at least once.

We create and store copies of data - backups and archives - as a means of meeting a specific need. They are needed for recovery. Sometimes recovery is a pleasant and leisurely affair, for example, when creating an environment for auditors or setting up an alternative environment. But most often it is required to quickly replace failed nodes or increase the capacity of existing clusters.

Today, in distributed environments, we are facing new challenges in data backup and recovery. As before, most local data sets are distributed within reasonable limits, up to a few terabytes maximum. The difference is that these local data sets are only part of a large distributed set. Site recovery is a relatively controlled process, but maintaining state in a cluster is a more difficult task.

Basic principles


We start by discussing the basic principles of data backup and recovery. To an experienced database specialist or system engineer, some of them may seem elementary. If so, you can easily scroll through several pages.

Physical or logical?


A physical backup of the database backs up the real files in which the data is stored. This means that the file formats typical of the database are supported, and usually there is a set of metadata in the database that determines what files are and what database structures are in them. If, when creating backup copies of files, you expect another database instance to be able to use them, then you will need to make a backup and save the metadata associated with it, on which this database relies, so that the backup is portable.

When creating a logical backup, the data is exported from the database to a format that theoretically can be transferred to any system. Usually metadata is also saved, but most likely it will be relevant for the moment when the backup was performed. An example is the export of all insert statements needed to populate an empty database when updating it. Another example is a JSON string. As a result, logical backups, as a rule, take a lot of time, because this is not a physical copy and write operation, but row-by-line data extraction. Similarly, recovery is accompanied by all the usual database overhead, such as locking and creating redo and undo logs.

A great example of this separation is the distinction between row-based replication and statement-based replication. In many relational databases, agent-based replication means that after writing to the version control system, a journal of operators of the data manipulation language (DML, or insert, update, replace, delete) is added to them. These statements are passed to the replicas in which they are played. Another approach to replication is based on strings or Change Data Capture (CDC).

Autonomous or operational?


An offline (or cold) backup is one in which the database instance using the files is disabled. Thanks to this, you can quickly copy files without worrying about maintaining the state at the moment, while other processes read and write data. This is an ideal, but very rare condition.

During an online (or hot) backup, you still copy all the files, but there is additional complexity associated with the need to obtain a consistent snapshot of the data, which must exist at that time during which the backup is performed. In addition, if the current traffic accesses the database during the backup, you must be careful not to exceed the throughput of the I / O subsystem at the storage level. Even limiting the load, you may find that the mechanisms used to maintain consistency introduce unreasonable delays in the application.

Full, incremental and differential


Having a full backup, no matter what method it is created, means that the local data set is fully reserved. For small datasets, this is pretty common. For 10 TB, this can take an incredible amount of time.

Differential backups allow you to back up only data that has changed since the last full backup. But in practice, more data is usually backed up than just changed because the data is organized in the form of structures of a certain size - pages. The page size is, for example, 16 or 64 Kbytes, and the page contains many lines of data. Differential backups back up all pages on which data has been changed. Thus, with large page sizes, backups of a much larger size are obtained than if only modified data were stored there.

An incremental backup is similar to a differential one, except that the date of the last backup, incremental or full, will be used as the time point to which the changed data relates. Thus, when restoring from an incremental backup, you may need to restore the last full backup, as well as one or more incremental backups, to get to the current point.

Knowing this, we will discuss several points that should be considered when choosing an effective backup and data recovery strategy.

Data Recovery Considerations


When choosing an effective strategy for the first time, you should again consider your service quality targets (SLOs), which were discussed in Chapter 2. In particular, you need to consider the availability and reliability indicators. Whatever strategy you choose in the end, it should still include the ability to recover data within the predefined limits on uptime. And you will have to back up quickly to ensure compliance with your reliability specifications.

If you back up every day and keep the transaction logs between backups in storage at the site level, you can easily lose these transactions until the next backup.

In addition, you need to consider how the data set functions within a holistic ecosystem. For example, your orders can be stored in a relational database, where everything is fixed in the form of transactions and, therefore, is easily restored in relation to other data stored in this database. However, after the order is formed, the workflow can be triggered by an event stored in the queue system or storage of the "key - value" type. These systems can ensure data integrity only occasionally or even briefly (to be ephemeral), referring, if necessary, to the relational database or using it for recovery. How to consider these workflows during recovery?

If you are dealing with an environment where rapid development is underway, it may turn out that the data stored in the backup was written and used by one version of the application, and after restoring another one is executed. How will the application interact with outdated data? Well, if the data is versioned - then this can be taken into account, but you should be aware of this possibility and be prepared for such cases. Otherwise, the application may logically damage this data, which will lead to even greater problems in the future.

All these and many other nuances that cannot be predicted must be taken into account when planning data recovery. As stated in chapter 3, it is impossible to prepare for any situation. But it’s very important to try to do it. Data recovery is one of the most important duties of an engineer to ensure database reliability. Thus, your data recovery plan should be as broad as possible and take into account as many potential problems as possible.

Recovery Scenarios


Having taken all of the above into account, we will discuss the types of incidents and operations that may require data recovery so that all needs can be planned. First, you can divide all the scenarios into planned and unplanned. Considering data recovery only as a tool for resolving emergencies, you will limit the tools of your team only to emergency care and simulate accidents. Conversely, if data recovery is included in everyday activities, a higher degree of awareness and successful resolution of emergencies can be expected. In addition, we will have more data to determine if the recovery strategy supports our SLOs. With a few daily runs of the script, it will be easier to get a set of samples,which includes limit values ​​and which can be used quite confidently for planning.

Scheduled Recovery Scenarios


What daily tasks can restore processes integrate into? Here is the list that we most often met on different sites:

  • the creation of new nodes and clusters in an industrial operation environment;
  • creation of various environments;
  • performing extraction, transformation and loading (Extract, Transform and Load, ETL) and stages of the data processing technological process for sequentially placed storages;
  • performance testing.

When performing these operations, be sure to include the recovery process on the operational control stack. Consider the following indicators.

  • Time. How long does it take to complete each component and the whole process? Unpacking? Copy? Log execution? Testing?
  • The size. How much space does a backup take, compressed and uncompressed?
  • . ?

This information will help you avoid bandwidth problems, which will help ensure the stability of the recovery process.

New Nodes and Clusters in an Industrial Operating Environment

Whether your databases are part of an unchanging infrastructure or not, there are opportunities for regular rebuilds, in which recovery procedures will be used as necessary.

Databases are rarely included in the automatic scaling of systems due to the amount of time that may be required for the initial loading of a new node and its placement in a cluster. Nevertheless, there are no reasons preventing the team from creating a schedule for regularly adding new nodes to the cluster to test these processes. Chaos Monkey ( http://bit.ly/2zy1qsE) - a tool developed by Netflix that randomly shuts down systems, allows you to do this in such a way that you can test the entire process of monitoring, issuing notifications, sorting and restoring. If you have not already done so, you can include this in the plan of the checklist of processes that your operations department must perform at regular intervals so that all employees familiarize themselves with the procedure. These actions allow you to test not only full and incremental recovery, but also the inclusion of replication and the process of putting the node into operation.

Create different environments

You will inevitably create development, integration and operational testing environments for demonstration and other purposes. Some of these environments require full data recovery, and they need to implement node recovery and full cluster recovery. Some environments have other requirements, such as partial recovery support for feature testing and data cleansing to ensure user privacy. This allows you to test data recovery at a specific point in time, as well as the recovery of specific objects. All this is very different from the standard full recovery and is useful for repairing damage caused by operator actions and application errors. By creating an API,that provide data recovery at the facility level and at a specific point in time, you can facilitate the automation and familiarization of employees with these processes.

ETL and pipeline processes for data warehouses located further down the pipeline

As for the tasks of building an environment, processes and recovery APIs from snapshots or at the level of individual objects can be successfully applied when transferring data from working databases to pipelines for further analysis and to stream storage .

Field Testing

During the execution of various test scenarios, you will need copies of the data. Some tests, for example for capacity and load, require a full set of data, which is great for full recovery. Functional testing may require smaller datasets, which will allow recovery at a specific point in time and at the facility level.

Data recovery testing itself can be a continuous operation. In addition to using data recovery processes in everyday scenarios, you can configure continuous recovery operations. This will automate testing and validation in order to quickly identify any problems that may arise if the backup process is interrupted. When it comes to implementing this process, many ask how to verify the success of a recovery.

When creating a backup, you can get a lot of data that you can then use for testing, for example:

  • The most recent identifier in the auto-increment sequence
  • line counter for objects;
  • checksums for subsets of data that are only inserted and therefore can be considered immutable;
  • checksums in schema definition files.

As with any testing, the approach should be multilevel. There are a number of tests that will either succeed or quickly fail. This should be the first level of testing. Examples are comparing checksums for metadata / object definitions, successfully starting a database instance, and successfully connecting to a replication stream. Operations that may take longer, such as calculating checksums and counting tables, should be performed later during the check.

Unplanned scripts


Taking into account all the daily planned scenarios that can be used, the data recovery process should be well debugged, documented, worked out and sufficiently free from errors and problems. Thanks to this, unplanned scenarios rarely turn out to be as frightening as they could be. The team should not see the difference between planned and unplanned recovery. We list and consider in detail situations that may require us to perform recovery processes:

  • user error
  • application error;
  • availability of infrastructure services;
  • operating system errors and hardware errors;
  • hardware failures;
  • data center failures.

User

error Ideally, user errors should rarely occur. By building a “railing” and a “barrier” for engineers, you can prevent many of these situations. However, there is always the possibility that the operator will accidentally cause damage. A typical example is the everywhere and for all forgotten WHERE clause when executing UPDATE or DELETE in a database client application. Or, for example, the execution of the data cleaning script is not in a test environment, but in a working "production" system. Often the operation is performed correctly, but at the wrong time or not for those hosts. All this relates to user errors. Often they are identified and corrected immediately. However, sometimes the consequences of such changes go unnoticed for several days or weeks.

Application errors

Application errors are the worst of the scenarios discussed, as they can be very insidious. Applications are constantly changing the way they interact with data warehouses. Many applications also manage referential integrity and external pointers to resources such as files or third-party identifiers. It’s scary to imagine what will happen if you just make a change that spoils the data, deletes them or adds incorrect data in ways that can go unnoticed for quite some time.

Infrastructure Services

In Chapter 6, we introduced the magic of infrastructure management services. Unfortunately, these systems can turn out to be as destructive as useful, which can lead to large-scale consequences related to editing a file, pointing to another environment or incorrect configuration settings.

OS errors and hardware errors

The operating systems and equipment with which they interact are also systems created by people, and thus errors can occur in them that can have unexpected consequences due to undocumented or poorly known configurations. In the context of data recovery, this is especially true with respect to the way data is transferred from the database via OS caches to file systems, controllers, and eventually to disks. Damage or loss of data happens much more often than we think. Unfortunately, our trust in and reliance on these mechanisms gives rise to the expectation of data integrity instead of skepticism about it.



Netflix 2008 . (ECC). ECC . , ECC- , , . , 46 512- 92 . , , , « » S.M.A.R.T. 92 . . , ?

. , , . , . — .

, ZFS, , . RAID-, , .

Hardware failures

Hardware components fail in principle, and in distributed systems this happens regularly. You constantly encounter disk, memory, processor, controller, and network device failures. The consequences of these hardware failures can be the failure of nodes or delays on nodes, which makes the system unusable. Shared systems, such as network devices, can affect entire clusters, making them inaccessible or breaking them into smaller clusters that are not aware that the network has been divided. This can lead to quick and significant discrepancies in the data that needs to be combined and corrected.

Data Center Failures

Sometimes hardware problems at the network level lead to crashes in the data center. It happens that overloading the storage backplanes causes cascading failures, as was the case with Amazon web services in 2012 ( http://bit.ly/2zxSpzR ). Sometimes hurricanes, earthquakes and other disasters lead to the failure of entire data centers. Subsequent recovery will test even the most reliable recovery strategies for strength.

Script scope


Having enumerated the planned and unplanned scenarios that may necessitate recovery, we add one more dimension to these events so that our presentation becomes voluminous. This will be useful for choosing the most appropriate response method. Consider the following options:

  • failure localized within a single node;
  • failure of the entire cluster;
  • A failure that affects the entire data center or multiple clusters.

In the event of a local, or single-site, failure, recovery is limited to one host. You may add a new node to the cluster to increase capacity or replace a failed node. Or, the system implements continuous updating and then restoration will be performed node by node. In any case, this is a local recovery.

At the cluster level, the need for recovery is global for all members of this cluster. Perhaps there was a destructive change or deletion of data that cascaded to all nodes. Or you need to introduce a new cluster to test capacity.

If there was a failure on the scale of the data center or several clusters, it means that it is necessary to restore all the data in the place of their physical location or throughout the area of ​​failure. This may be due to a failure of the shared data warehouse or a failure that caused a catastrophic failure of the data center. Such recovery may also be required during the planned deployment of a new secondary site.

In addition to the scope, there is a dataset scope. Here you can list three possible options:

  • one object;
  • several objects;
  • database metadata.

At the scale of one object, only this particular object requires data recovery - some or all. The case discussed earlier, as a result of which when the DELETE operation was performed, more data was deleted than planned, refers to a failure within the same object. If several objects fail, several or, possibly, all objects in a particular database are affected. This can happen if the application is damaged, the update or segment migration fails. Finally, there are crashes on the scale of the database metadata when everything is in order with the data stored in the database, but metadata is lost that makes it usable, such as user data, security privileges, or compliance with OS files.

Script consequences


It is important not only to identify the scenario requiring recovery and to identify the area of ​​failure, but also to determine the possible consequences, since they will be significant when choosing an approach to recovery. If data loss does not affect SLO, you can choose a methodical and slow approach that minimizes the expansion of consequences. The more global changes that lead to the disruption of SLO should be approached with caution, choosing a quick service recovery and only then moving on to a long-term cleanup. All approaches can be divided into the following three categories.

  • Impact on SLO, application failure, affected most users.
  • Threat SLO, some users have suffered.
  • Functions not threatening SLO are affected.

About the authors


Campbell Lane (Laine Campbell) is a senior manager (Senior Director) for the design company Fastly. She was also the founder and CEO of PalominoDB / Blackbird, a consulting service that maintains databases for several companies, including Obama for America, Activision Call of Duty, Adobe Echosign, Technorati, Livejournal, and Zendesk. She has 18 years of experience operating databases and scalable distributed systems.

Cherity Majors(Charity Majors) is the CEO and co-founder of honeycomb.io. Combining the accuracy of log aggregators, time series speed metrics, and the flexibility of Application Performance Metrics (APMs), honeycomb is the first truly new generation analytic service in the world. Cheriti was previously a Parse / Facebook operations specialist, managing a huge fleet of MongoDB replica sets, as well as Redis, Cassandra, and MySQL. She also worked closely with the RocksDB team on Facebook, participating in the development and deployment of the world's first Mongo + Rocks installation using the storage plug-in API.

»More information on the book can be found on the publisher’s website
» Contents
» Excerpt

For Khabrozhiteley 25% discount on coupon - Databases

Upon payment of the paper version of the book, an electronic book is sent by e-mail.

All Articles