Normalization of deviance. How bad practices are becoming the norm in our industry

Have you ever happened that you say something completely normal for you, but everyone else is very surprised? This happens to me constantly when I describe what was considered normal in the company where I worked. For some reason, the face of the interlocutor gradually turns from a pleasant smile into a grimace of extreme amazement. Here are some typical examples.

There is one very good company, one of the most pleasant places I've ever worked, a combination of all the goodies of Valve and Netflix. The people here are amazing, and they give you almost complete freedom to do whatever you want. But as a side effect of such a culture, approximately 50% of new employees leave them in the first year, some voluntarily and some not. Absolutely normal, huh?

There is a company that is incredibly secretive about its infrastructure. For example, it is afraid to report bugs to the equipment supplier, because then the errors will be fixed and competitors will be able to use the corrections. This cannot be allowed. Solution: request firmware and fix bugs yourself! Fine.

Most recently, I met with specialists who tried to reproduce the algorithm published in the article of that company. They could not reproduce the result. Moreover, the algorithm from the article led to an unusually high level of instability. When one of the authors was asked about this, he replied: “Well, yes, there are some tricks that did not appear in the article,” and refused to share these tricks. That is, the company intentionally published an unproductive result so as not to give out details, as it usually did with bugs.

The company threatens to instantly fire any employee who leaks information. This is told to each newcomer with examples of people fired for the leak (for example, one guy leaked information that the concert will take place in a certain office). Each such dismissal is loudly reported and communicated to all employees. As a result of this policy, many are afraid to forward emails even with innocent information such as updating insurance data. Instead, they print a letter from another computer - and transmit it in paper form. Or take pictures on the phone. Fine.

In one office, I once asked why two specific employees seem to be avoiding each other. I was told that their enmity has been going on for ten years. In fact, the situation has improved lately. For many years, they literally could not be in the same room, otherwise one of them would get too angry and do something unfortunate. But now the guys have cooled down, so they can sometimes be found in the same wing of the office or even in the same room. And these are not just random people. These are the managers of the two only teams in this company. Fine!

There is a company with such a strange culture that you can write a little book about it. In fact, I recently started writing a post about this company, and already wrote more than 100 thousand words  - more than all the posts on my blog combined.

This company explained to me that it is much better to make decisions not on the basis of data, but on the basis of political relations, and that the idea of ​​making decisions on the basis of data is in any case a myth - no one does it.

In this company I was given four reasons to work with them. All four turned out to be a lie. As a result, my work responsibilities boiled down to the very fact that when I hired, I agreed not to do anything.

When I joined this company, my team did not touch the version control system for several months. I had to fight to get everyone to use it. I won this battle. But he could not convince employees to run tests first. Therefore, the assembly breaks several times a day.

In a conversation with the management, I hinted that I consider this a performance problem in our department. I was told that this is normal. This is the situation for all employees, so everyone is on an equal footing. My task is to rank them, so if the problem affects everyone equally, then there is no reason for concern.

Another company has launched many large-scale initiatives to attract women developers, but women are still screened out at interviews with questions like “Did you have experience with algorithms or only coding?” I thought that my candidate with a very good recommendation would go through this barrier, but I forgot how normal the company was.

In another company, I worked in a team of four on a project with a budget of several hundred million dollars and an annual effect of one billion dollars. At the same time, requests for things worth hundreds of dollars were considered for months or rejected.

It may seem to you that I worked only in unusually bad companies. But no, they have a good reputation, and two of them are considered one of the best companies for employment. And I heard similar stories from employees of other companies, even with an excellent engineering reputation. The only difference is that now I was in shock, and the interlocutor believed that everything was fine.

Many companies use the Flaky library .adding python annotations to unreliable tests that pass or fail. I asked employees of three different companies what Flaky does. All of them suggested that she repeatedly runs the test and reports in case of failure. Close, but not quite. Technically, you can use it anyway, but in practice, it repeatedly restarts the test and reports successful completion. Flaky was developed by a company that deals with the storage infrastructure, while the library is actively using its largest competitor. Marking as passed tests with potential bugs is completely normal .

There is a company that is known for good engineering practices. When I last checked, she had an uptime of 99.99%, which is fully explained by engineering practices adopted there. If a startup looks like Twitter or Reddit, then only one nine is enough, but we are talking about an infrastructure platform that really needs two. Many companies that build infrastructure for two nines consider their practices that lead to such reliability to be completely normal.

As far as I can tell, many of these companies have come a long way. At first, they focused only on product growth. This is absolutely reasonable, because initially the value of the company is approximately zero. She does not implement competent system administration or real security, because she has nothing to lose. Well, with the exception of user data, when they are inevitably hacked, and if you talk to security officers at large private startups (they are called unicorns, unicorn), then this happens all the time.

The result is a culture that is overly focused on growth without risk. This culture is often preserved even when the company has grown to a billion dollars and it already has something to lose. If a person used to work for Google, Amazon or another company with solid procedures, then the situation will shock him. Often he tries to fix something, but can not do anything and resigns.

Probably, Google today has the best operational practices and security methods among all IT companies in the world. It is easy to say that we should take an example from it. But it is instructive to see how they achieved this. And what happened before. If you look at the code base, you will see many services whose names end in z, as well as a surprisingly large number of variables. One of the old employees said that once upon a time someone wanted to add monitoring. It was not very safe to set for monitoring google.com/somename, so they added z, that is google.com/somenamez , for security. This is in the company, which is now considered the best security in the world.

Now she has come so far in safety that new employees vehemently deny such practices in the past. At the same time, reasons are called that do not really make sense (for example, to avoid name collisions).

Google’s tremendous progress in security — from adding the letter z to the world's best IB practices — didn’t happen because someone made a pep talk or wrote a compelling essay. It began after several “fakaps”. Only then were the security guards given the authority to solve fundamental problems. For good and right companies, reforms almost always begin in this way. They laughed at Microsoft for many years, but then several disastrous exploits forced them to change their attitude towards security. It just sounds simple. But eyewitnesses say the change was cruel. Despite being indicated from above, the inertia remained very strong. Why change what worked? Therefore, there was very strong pressure from people who were used to doing everything the old way.

Such things can be seen in any industry. A classic example often cited by technicians is handwashing by doctors and nurses. It is well known that microbes exist, and washing hands with soap greatly reduces the likelihood of microbial transmission and thereby significantly reduces the mortality rate in hospitals. Despite this, even experienced doctors and nurses still often do not. Intervention required. Signs reminiscent of washing hands save lives. And even better, that living people stood and demanded to wash their hands: this way even more lives are saved. People can ignore the signs, but they cannot pass by the person in charge.

Something like this, IT companies are trying to implement best practices. If you tell employees what to do, it helps a little. If you implement code verification, the effect is immediately apparent.

Statistics clearly show that people really poorly master routine habits that do not give a visible effect, but irrefutably reduce the risk of rare but catastrophic events. It seems to a person that cutting a path is the right, reasonable route. There is a special term for this: "normalization of deviance." It has been well studied in a number of other contexts, including healthcare, aviation, engineering, aerospace engineering, and civil engineering, but has not yet been discussed in the software context.

Is it possible to learn from the mistakes of others, and not from your own? The state of the industry does not give a reason to count on this, but let's try. John Banya wrote a brief report on normalizing health deviancewhose findings can be applied to software development. It can be noted that the treatment of patients can be compared with the actions of devops. However, the normalization of deviance also occurs in a cultural context, where the analogy is not so obvious.

The first section of the article describes in detail a number of catastrophic situations, both in health care and in other areas. Here is one typical example:

, -, , (, 2005, . 87-101). , , , . , , . . 9 , 2 . , « » , . , , , .

Turn off or ignore notifications because there are too many and are too annoying? Acting manually with the risk of making a mistake? Yes, I can name a few companies at once, where the debriefing after a disaster begins precisely from these points, unless in the end no one dies, and only a few million dollars are lost. If you read a lot of analysis of such incidents in IT , then each example in the article of Bunny will seem familiar to you, even if the details are different.

The section ends with this conclusion :

As a rule, these disasters are explained by “a long violation of the rules, contradictory events that accumulated undetected, and an incorrect cultural idea of ​​the dangers. Together, these factors prevented an intervention that could have prevented the harmful effects. ” It is especially striking how numerous violations of the rules and errors are combined together to provide an opportunity for disaster.

Again, the text seems to be from an article about technical failures. Therefore, the next section on the causes of these failures deserves attention. And the reasons are as follows.

Silly and ineffective rules


The article provides an example of the administration of drugs to newborns. To prevent the "leakage of drugs," the nurse must enter a password on the computer. She gets access to the medicine box and takes the right amount of medicine. To ensure that the first nurse has not stolen anything, the second nurse should monitor the process. Then she should enter her password into the computer as confirmation that she had observed the correct procedure for handling the medicine.

Sounds familiar. A lot of incident reports start with the fact that "someone skipped some steps because they are ineffective." For example, "a programmer launched a bad configuration or bad code because he was sure of it and did not want to spend time staging or testing." The notorious Azure shutdown in November 2014It happened for this reason.

At about the same time, one of the Azure competitors, developers canceled a rule prohibiting pushing a configuration that fails tests into the Canary branch. The developers were sure that the configuration was in order. When Canary began to fail, they canceled the rule prohibiting deploying from Canary to Staging with an error. They were sure that the configuration was in order, and that the failure was caused by something else. Subsequent analysis showed that the configuration was technically correct, but with it a bug manifested itself in the main software. Pure luck is that the hidden bug revealed by the configuration was not as serious as the Azure bug.

People have a poor understanding of how errors overlap. Therefore, we accept the rules for a safe deployment. But for the very reason why people have a poor understanding of how errors overlap, these rules seem silly and ineffective!

Knowledge is imperfect and uneven


The concept of a norm is not innate. When new people come to the company, they easily absorb the deviant processes that have become the norm.

Julia Evans described to me how this happens:

a newbie
newbie arrives : WTF WTF WTF WTF WTF
veteran : yes, we know we are doing this.
newbie : WTF WTF wTF wtf wtf w ... the
newbie gets used to
comes the second newbie
newbie # 2 : WTF WTF WTF WTF
newbie : yes, we know, we do this.

The most insidious thing is that people really accept the idea of ​​WTF, and then they can spread it themselves in other places throughout their career. I once workedwith one open source project that regularly crashed. I was told that this is normal, and their product is better than average. I checked and found that he was the worst in his class in almost all respects. And he sketched the idea of how to release builds with relatively little effort, which will almost always pass tests. The most common answer was: “Wow, this guy must be working with superstar programmers. But we will be realistic. Everyone’s assembly breaks at least several times a week. ” As if running tests (or, for that matter, even trying to compile) before checking the code requires superhuman efforts. But as soon as people believe that some kind of deviation is normal, they often really absorb the idea.

I break the rule for the good of the patient


The article provides an example of a doctor who violates the rule that gloves should be worn when searching for a vein. He believes that wearing gloves makes it difficult to find a vein, so he will have to poke a child with a needle several times. It's hard to argue with that. Nobody wants to hurt a child!

The second largest failure of all that I have seen in my life has happened for this reason. Someone noticed a database slowdown. They quickly wrote a patch, and so that degradation did not spread further, they ignored the rule of a slow, phased deployment. Instead, they rolled the patch onto all the machines. It's hard to argue with that. Nobody wants customers to experience service degradation! Unfortunately, the patch detected an error that caused a global shutdown of the service.

The rules do not apply to me / You can trust me


Most people perceive themselves as good and decent, therefore they can consider their violation of the rules as a completely rational and ethically acceptable response to problem situations. They are sure that they are not doing anything wrong, and they will be outraged and often fiercely defend themselves if they are faced with evidence to the contrary.

As the company grows, it is necessary to introduce a security system that does not allow every employee to have access to almost everything. And when this happens, in most companies, some employees are really upset: “Do you not trust me? If you trust, then why are you denying access to X, Y, and Z? ”

It is known that Facebook has long provided employees access to the profile of any user. Some recruiters even mentioned this as an advantage of working on Facebook. And I know more than one respected startup, where, as before, every employee has access to almost all information, even after one or two leaks of information. A certain political will is needed to limit people's access to what they consider necessary or habitual by law. Many trendy startups have declared the core values ​​of “trust” and “transparency,” which make it difficult to justify access restrictions.

Workers are afraid to perform


I don’t want to express my opinion to some people, because they can meet him with hostility, and they cannot return the spoken words. In the mentioned scientific article, the author gives an example of a doctor with poor handwriting. He gets angry when someone asks to clarify what he wrote. As a result, people wonder, not ask.

Most companies have developed a culture that feedback is difficult. Many projects were delayed for several months, and then stopped, because each employee was afraid from the very beginning to express his opinion, fearing criticism. The problem is present even in cultures that encourage politeness: it is also difficult to express sincere criticism there. It turns out that in some companies people are afraid to speak, because someone will be attacked by someone evil. In others, they are afraid to speak, because they themselves are branded as evil. Tough problem.

Manual hides problem


A scientific article says how information about a problem gets washed away when it is passed up the chain. One example is how a manager takes non-optimal actions in order not to look bad in front of his superiors.

I was shocked when I saw this for the first time. People understand that they are doing something clearly wrong. But if you optimize, there is a non-zero probability of failure, and then it will be very embarrassing. Therefore, it is easier to leave it as it is. With years of professional experience, I better understand how and why people play this game, but still find it absurd.

Solutions


Suppose your company has a typical problem: people are rewarded for heroism in extinguishing fires, and not for preventing them. And people get promoted for releasing new features, not for doing critical maintenance and bug fixes. How to change this?

The easiest option is to just do the right thing on your own and ignore what is happening around. You will bring some benefit, but the extent of your influence is limited. Next, you can convince your team to do the right thing: I have done this several times to implement practices that I think are really important.

But if incentives are directed against you, it will take constant and irregular efforts to get people to do the right thing. In this case, the problem is to persuade someone to change the incentives, and then make sure that the changes work as expected. How to convince management to change incentives is the topic of a separate article. Regarding the implementation of the changes, I have seen many “obvious” mistakes that are repeated in different companies.

Small companies find this easy. When I worked for a company of one hundred people, there was a simple hierarchy: individual participant (IC) -> team lead (TL) -> general director (CEO). That's all. The director did not particularly intervene, but if he said something, it was implicitly executed. It is important that he knew very well what each employee was doing, and that he could generally regulate remuneration in real time. If you did something good for the company, then you could expect a raise. Not nine months later, when the next cycle of personnel efficiency analysis came up, but almost immediately. Not for all small companies, this works effectively, but with the right leadership, this is possible. In large companies, nothing.

One large company had such a problem. The management ordered to reward employees for performing critical but inconspicuous work. There were too many employees to immediately distribute bonuses, but the manager could analyze reports, make decisions about spot checks and give bonuses, so that over time the right incentives will become part of the culture. In my personal opinion, the company has not reached parity between boring maintenance work and brilliant new projects. But people at least started working on infrastructure and fixing bugs without much damage to their careers.

At another large company, rank-and-file employees agreed that it’s wrong to reward more generously for creating new functions than for doing critical work. When I spoke with managers, they also often agreed. Nevertheless, the increase was received mainly by developers of brilliant new things. Management has attempted cultural and technological change. Basically, in the form of inspirational statements from people with fancy posts. For really important things, you had to watch the video and pass the required test with several answer options after watching the video. The only result of this campaign was the general opinion that management is very far from the lives of ordinary employees.

It's a little funny that in the end it all comes down to the problem of incentives. We in the industry think a lot about how to encourage consumers to do what we want. But inside the company, we create a system of incentives for ourselves that pushes us to the wrong things. A kind of mixture of a spoiled phone and the cult of cargo. In the old days, Microsoft was a role model - we copied their methods and asked interview puzzles. Now Google has become a model - and we ask questions about algorithms. If you look at fashion companies that are younger than Google, most of them basically copy the Google post system, with some minor changes. The good news is that Google has well thought out most of its processes, and decisions are made based on data. The bad news is that Google is in many ways a unique company. Their practices often do not work for the rest, so people just practice cargo cult. And for a long time after Google has already abandoned this practice .

Such diffusion also occurs in technical solutions. Stripe built a robust message queue on top of Mongo , so we will also build robust message queues on top of Mongo 1. Cargo cult goes down the chain 2.

The medical article has special subsections on how to prevent the normalization of deviance.

  • Pay attention to weak signals.
  • Resist the desire to be unduly optimistic.
  • Teach employees how to conduct emotionally uncomfortable conversations.
  • System operators should feel safe when expressing an opinion.
  • Realize that surveillance and monitoring never stops.

Let's see how the first principle works when a newcomer comes to the company and starts to scream “WTF WTF WTF”.

When the vice president expresses his opinion, people usually listen to him. This is a strong signal. If not, the vice president still knows how to implement his decision. The beginner does not know what leverage to pull, with whom to talk. They produce weak signals that are easy to ignore. By the time he has studied the system enough to give out strong signals, he is already acclimatizing.

“Pay attention to weak signals” sounds good, but how to do it? Strong signals are few and rare, so they are easy to pay attention to. There are too many weak signals. How to filter out noise? And how to make the whole team or organization really do this? There is no simple answer to such questions; considerable attention needs to be paid to this.

Unfortunately, companies rarely do this. Startups think a lot about growth. Although everyone says that they are very concerned about the engineering culture, in practice this is not so. With some exceptions, large companies are not much different. In one of these companies, I saw slides of competitive analysis, and they are amazing. The smallest details are studied in hundreds of products to ensure that users receive the perfect quality in all respects, from implementation to interaction with competing products. If even one parameter is more complicated or confusing than any competitor’s, people get upset and try to fix the situation. This is very impressive.

Then the company accepts new employees, and every third does not have an account in the system, or a place in the office, or a computer - and this condition can persist for weeks or months. Competitive analysis slides say that you have only one chance to make a first impression, and then employees get the impression that the company is not able to take care of them. And that constantly disrupting work processes is normal.

The company cannot even understand the basics of onboarding, not to mention such really complex things as acculturation. The reasons are clear. External indicators, such as audience growth or decrease, are measurable, unlike acculturation of newcomers, so that they do not ignore weak signals. But this does not mean that the latter is less important. There is a lot of talk about how new languages ​​or methods, such as TDD or Agile, increase productivity, but having a strong engineering culture is a much more powerful animator.



1. People seem to think I'm joking. And try to google it mongodb message queue. You will find statements like “replica sets in MongoDB very well provide redundancy and automatic failover”. Almost all the companies I know that have tried it on a large scale, found the system to be suboptimal, to say the least. But you will not find anything about it. Only articles and presentations from companies that have tried and are fascinated by this DBMS. This is characteristic of many technologies. There are brilliant recommendations in public, and in private, people will tell you about all the problems. If you run such a search query today, you will find a ton of admiring articles about how cool it is to build a message queue on top of Mongo, you will find this article and maybe several articles on Kyle Kingsbury’s blog, depending on the specific search phrase.[to return]

If a serious malfunction occurs, you will see a debriefing with a technical description. But we like to do this analysis for accidents such as “The site has been shut down for 30 seconds”, but rarely we analyze situations like “This requires 10 times more effort than the alternative, and this is the death of a thousand cuts” or “We poorly designed the system, and now it’s very difficult to make changes that should be trivial, ”or“ Our competitor was able to do the same, spending an order of magnitude less effort. ” I sometimes conduct an informal debriefing, asking all involved people leading questions, but this is more for myself, because I'm not sure that people really want to hear the truth. Especially if several employees received a promotion for developing this project. It seems that the more damaged the project, the more often it is awarded. The larger the project,the more noticeable it is and the more bonuses, even if it could be done with much less effort.

2. I often asked this question in successful companies, and in others, where everything is bad. Where everything is bad, everyone has ideas. But where everything is good, no one has an idea why everything works, as in the aforementioned small company with a director who does not particularly interfere in matters. Amazing People literally say that everything looks like some other company where they worked, except that everything was bad there, but here it is magically good. For reasons they don’t understand. But this is not magic. This is hard work that few are aware of. Many times I saw the vice president leave, and it becomes unpleasant for the company to work. Gradually it comes to people: the vice president made sure that all employees were happy in their workplaces. It’s hard to understand until the situation goes bad. If you don’t see anything clearly wrong,either you don’t pay attention, or someone made a lot of efforts so that everything went smoothly.[to return]

Source: https://habr.com/ru/post/undefined/


All Articles