📱 👨‍🔧 ❄️ Resilience Engineering: Notes from the REDeploy Conference 🏂🏻 🐍 👨🏿‍🔬

While conferences around the world are looking for optimal formats online, it's time to remember how they (and we all) lived in the pre-viral era. At the end of last year, I attended the REDeploy 2019 conference dedicated to Resilience Engineering. For a very long time I tried to understand how to translate this term into Russian, until I found out that the term has long been used in unwritten niches - like "engineering of resilience." Further it would be necessary to write a definition of resilience, but it is difficult to do this with one simple sentence. It also turned out that the topics raised six months ago are very relevant in our new reality.

First of all, it is important to understand that resilience engineering is a cross-disciplinary field of science, which aims at research, formalization and formation of practices that increase the ability of complex socio-technical systems to be prepared for unusual situations and accidents, adapt to them and improve their abilities to adapt.

For many years, the mechanical picture of the world prevailed in the software development process - the belief that we are able to develop software that will work without crashes, and if an accident occurs, it will have some root cause, which can be corrected to prevent the recurrence of such errors in the future - and thus, since the number of errors is of course, in the end, correct all errors that lead to accidents (see excellent articleDev, Ops and Determinism about it).

The same “engineering” approach is also applied to how people interact during an accident: it’s enough to create some tools, using which people can fix the problem (while avoiding mistakes).

But the catch is that the software continues to be updated, it becomes complex, fragmented and branched, and the accidents that occur do not have one single reason (moreover, they can even be located outside the system), and people in the process of communication to correct this accident can commit mistakes.

Thus, the task is not almost impossible to avoid errors and accidents in the system, but to prepare people and the system so that a potential accident has the least impact on the system, its users and creators.

Software development for a long time remained aloof from other “offline” engineering disciplines, while the practice of “harm reduction” from accidents has been used there for a long time. And these practices are more likely to relate to people than utilities and technical solutions to prevent accidents.

The questions asked by engineering resilience are thus as follows:

What cultural, social features of the interaction of people should be considered in order to better understand what can and cannot happen in the communication of people in the process of an accident? How can the process of adaptation and communication be improved? And vice versa, how can the situation be made worse?
What knowledge from other disciplines can we apply in order to make the system more flexible and stable in the event of an accident?
How should training and human interaction be organized so that in the event of an accident both the damage from it and the stress, for those who will eliminate it, are the least?
What technical solutions, practices can be applied for this? How, through conscious actions, can we increase the stability of the system and its adaptability to accidents?

About this was the conference. And below - what some speakers talked about.

A Few Observations on the Marvelous Resilience of Bone and Resilience Engineering. Richard Cook

First of all, it must be said about the person of the speaker. Richard Cook is a doctor, scientist, and one of the main “popularizers” of IT resilience engineering. Together with David Woods and John Olspau (the man who actually launched DevOps as a referral), he co-founded Adaptive Capacity Labs, a company that implements sustainability engineering in other organizations.

It should be noted that REDeploy is not an IT conference, and this report is a vivid example of this.

Most of the report is a detailed analysis of how the broken bone heals, the healing of which is an archetype of resilience. The bones themselves are not fused properly. Medicine studied how to treat fractures, understanding the healing process. In fact, medicine does not even treat the bone itself, it creates processes that promote healing.

In general, the history of treatment can be divided into two directions:

treatment as a process that creates the most favorable conditions for bone healing (in the process of treatment we apply gypsum so that the bone does not move).
treatment as a process of “improving” the healing process (understanding - at the biochemical level - how the healing process goes, we use drugs that accelerate it).

And here the main thesis is, in fact, “programmatic” for the entire discipline of the report: why do we need to understand how sociotechnical processes occur during an accident?

Understanding how the “treatment” mechanism (for example, resolving an emergency) works, we can at least arrange such favorable conditions that the accident causes minimal damage and, as a maximum, accelerates the process of resolving the accident. We cannot prevent cases when a person breaks a bone, but we can improve the healing process.

The Art of Embracing Failure at Scale. Adrian hornsby

And now, a technical report on the evolution of fault tolerance in AWS infrastructure.
Without going into technical features (you can see them in the presentation ), we will consider the main thesis of the report. AWS in the process of constructing various systems develops the architecture taking into account the fact that an accident can happen sooner or later, and, accordingly, the system architecture should be designed in such a way as to limit the "explosion radius" in the event of an accident. For example, client databases, storages are divided into groups of "cells", and the load created by one client affects only users of this cell. Clients in cell replicas do not duplicate the main cells, but are mixed together, thus limiting the impact radius to a minimum.

By increasing the number of such combinations, we reduce the risk of customer involvement in the event of an impact.

Getting Comfortable With Being Underwater. Ronnie chen

A report from a Twitter manager with experience in technical deep-sea diving on safety features during diving.

Team deep diving is a high-risk job. And if the organization of such dives proceed from the possibility of diving only in the case of a complete leveling of such risks - there will be no deep-sea dives at all. One way or another, problems can occur, and this is normal - the speaker as a whole compares the conscious taking of risks as a method of developing human potential. If we mitigate risks, this will limit our potential. The task, again, is to organize the easiest resolution of situations in case these risks are realized.

How can you try to live with the pressure that rests with the team in case of risky activities?

An example of the rules of interaction of a team of divers:

Reliable and constant communication between the participants and maximum psychological safety: each team member must feel safe, any dive participant can stop diving at any time (and charges are prohibited).
Acceptance of errors. Any person can make mistakes, and mistakes are inevitable in the process of work; blaming errors is also unacceptable.
The team can redefine the objectives of the project and determine the success of the project in the dive process depending on changing conditions.
, .
— . , , .
( ) , root cause ( ), , .

The Practice of Practice: Teamwork in Complexity. Matt Davis

In the event of an accident, engineers are largely intuitive, and the intuition in the report is compared to musical improvisation. Improvisation is a process of intuitive playing music, but this intuition is based on experience - knowledge of musical scales, the experience of previous improvisation, teamwork. Moreover, this is a bidirectional process: intuition is built on experience, and processes are built on the analysis of intuitive actions (in music - notes of a composed composition are written, in technologies - the process of correcting an accident is described).

Two ways of teaching / forming intuition:
- Post-mortem not as a means of blaming or preventing problems in the future, but a means of training and a way to share experience. Regularly share your experience in resolving accidents in the form of a post-mortem / accident report in order to share your experience in resolving problems with other people.
- Chaos Engineering as a way of generating experience in a controlled environment. By artificially creating an accident in the system that needs to be resolved, we form the experience of intuition with those engineers who will deal with its solution. At the same time, we can determine the necessary stack in which we want to develop competencies by limiting the radius of the impact in the system.

Here are the most memorable reports to me. It seems to me that these are very useful things right now, when it may seem that in general all reality has broken down, “carry another”. Perhaps some points will help you look at reality and accidents from a new angle.

And I’m a little more regular than the blog here, I keep my telegram channel , subscribe :-)

Resilience Engineering: Notes from the REDeploy Conference

A Few Observations on the Marvelous Resilience of Bone and Resilience Engineering. Richard Cook

The Art of Embracing Failure at Scale. Adrian hornsby

Getting Comfortable With Being Underwater. Ronnie chen

The Practice of Practice: Teamwork in Complexity. Matt Davis

More articles: