Why are we writing programs of such poor quality?


:
— , ,  — .
- :
— . .
:
— .
— ?
— . , .
— ?
— , , , , , .
- They say that reliability is guaranteed by a technology called blockchain.
- Ahhhhhhhh !!! Whatever they say, do not touch it! Dig deeper. Do not forget the gloves!

Source: XKCD , Creative Commons 2.5 license. A

glitch last week’s mobile counting application added havoc to the Iowa Democratic Party Congress. A few hours after the opening of meetings across the state, it became clear that something had gone wrong. The results are still unknown. There were messages describing technical problems and misunderstandings. The Iowa Democratic Party issued a statement that denies rumors of a cyber attack, but confirms technical issues with the mobile app.

A week later, we already better understand what happened. A mobile app was written specifically for this event in Iowa. It was distributed as a beta version , bypassing large application directories. Users suffered, but struggled to get the program to work. After installation, she was very likely to not respond to calls. In some meeting venues, there was no internet connection, which made the online application useless. The Democrats had a backup plan: to report the results by telephone, as usual. But the telephone lines were jammed with online trolls, which jammed them for the sake of lulz.

Twitter followed a wave of hashtags #app and #problems, and software engineers cited KDPV. I thought so too. Words from this caricature well summarize the general feeling of what is happening: “I don’t quite know how to express it, but our whole area is bad in what we do, and if you rely on us, then everyone will die.” Software engineers do not say it directly. But it sounds very similar. What do we mean?

Here's what we mean: we do software well, provided that the consequences of failure do not matter. On average, programs are good enough to work somehow. However, we are not particularly surprised at errors in most programs. These are not some rare incidents. Many common practices in software engineering come from the assumption that crashes are the norm, and most importantly, new features. Failure is really cheap. If the online service from one of the largest companies is completely disconnected for two hours, they will forget about it in a week. This assumption is embodied in the common mantras: “Move fast and break everything” (Move fast and break things), “Roll out and repeat the cycle” (launch and iterate).

The market generously rewards for such "irresponsible" behavior. In many web companies, a small profit per user is multiplied by millions (or billions!) Of users. This is beneficial for companies with consumer apps or websites. The implementation is expensive, but the cost is finite, and the distribution is almost free. The consumer software industry has found a compromise: we are reducing the implementation speed just enough to keep our defect level moderately low, but not too much.

We call such software development the “economic model of the website”: when the benefits of implementation are high and the cost of retries is low, management encourages high-speed release of functions. This is reflected in modern practices of project management and their implementation, which I will discuss below.

But, as I said, "we do software well, provided that the consequences of failure do not matter." This approach leads to terrible failures if the consequences are not cheap, like in Iowa. The common practice of software development has grown from the economic model of the Internet, and when the assumptions of this model are violated, software engineers do a poor job.

How does programming work in web companies?


Imagine the hypothetical company QwertyCo. It is a consumer-oriented software company that makes $ 100 million a year. We can estimate the size of QwertyCo by comparing with other companies. WP Engine, WordPress hosting, reached $ 100 million ARR in 2018 . Blue Apron generated $ 667 million for the year . Thus, QwertyCo is a medium-sized company. She has from several dozen to several hundred engineers and she did not issue shares in public circulation.

First, consider the economics of project management at QwertyCo. Leaders have learned that you don’t want to instantly announce a new feature. You have a compromise between software quality, deadline and speed of implementation.

How important is software quality to them? Not really. If the QwertyCo website is down for 24 hours a year, they estimate the loss at only $ 273,972 (assuming that uptime is linearly correlated with revenue). They say that the site often goes offline for 15 minutes, and nobody really cares. If the function crashes the site, they will roll it back and try again later. Repeated attempts are cheap.

How valuable is the new feature to QwertyCo? Based on my personal observations, one month of work of one engineer is able to change the income of an optimized site in the range from −2% to 1%. This is a monthly chance to get $ 1 million of additional QwertyCo revenue for each engineer. Methods such as A / B testing even mitigate errors: within a few weeks, you can detect negative or neutral changes and remove these features. Bad characteristics are cheap - they are active for a finite amount of time. Winning is obtained forever. Even a low percentage of winning bets makes QwertyCo profit.

Given the pros and cons, when should QwertyCo release a function? Economic calculation shows that even high-risk functions should be launched if they sometimes make a profit. Accordingly, each project turns into an optimization game: “How much can be implemented by this date?”, “How much time is required to implement the entire project? What if you remove X from it? What if you remove X and Y? How to speed up the implementation of a certain part? ”

Now consider a software project from the point of view of a software engineer.

His main resource is time. Safe software development is time consuming. As soon as a product crosses a certain threshold of complexity, it has several stages of development (even if they do not pass as part of an explicit process). They should be planned with the help of the product manager. The product is converted into a technical project or plan, if necessary, is divided into subtasks. Then a code with tests is written, a code review is made, statistics are recorded, the product is integrated with information panels and alerts. If necessary, manual testing is performed. In addition, coding often has an extra overhead known as refactoring.: Modify an existing system to facilitate the implementation of a new feature. In the implementation of the "small" function, coding itself can take only 10-30% of the time.

How do developers lose time? Most often these are system failures. During the downtime of the site, everything is included in the work. The most experienced engineers stop ongoing projects to get the site back on track. But the time taken to extinguish a fire is the time when they do not bring additional benefits to the company. Their projects are behind schedule. How to reduce downtime? Written tests, monitoring, automated notifications, and manual testing all reduce the risk of catastrophic events.

How else do engineers lose time? Through more subtle and rare bugs. Some errors appear rarely, but cause great damage. Perhaps users lose data if they perform a certain sequence of actions. When an engineer receives a report of such an error, he must quit everything and fix it. This distracts from the current project, and gradually the time of such downtime can increase.

Accordingly, experienced software engineers begin to pay close attention to the quality of the code, they want to carefully check it. That is why engineering organizations use methods that seem to slow down the development speed: code analysis, continuous integration, observability, monitoring, etc. Errors are cheaper if you catch them at an early stage, so engineers invest heavily in early error detection . They also focus on refactoring, which simplifies implementation. In a simpler implementation, there is less chance of error.

Thus, management and development have opposing points of view on quality. The manual agrees with a high error rate (but not too high), and the engineers want the error to be an absolute minimum.

How does this affect project management? Managers and developers divide the project into small tasks. Project lead time depends on the number of tasks and the number of engineers. Most often, a project takes too much time - and it is adjusted by removing functions. Then the engineers perform the tasks. Realization of the task is often performed inside the sprint.. If the sprint time is two weeks, then each task has an implicit two-week timer. However, assignments often take longer than you think. Engineers make tough prioritization decisions to finish on time: “I can do this by the end of the sprint if I write basic tests and if I skip the refactoring I was planning.” Sprints exert constant pressure, pushing the developer. This means that the engineer can either compromise on quality, or admit failure at the next meeting.

Some will say that I'm too harsh on sprints, and they are right. In fact, all this is due to the pressure of time. The sprint process is just a convenient way to increase this pressure by applying it several times: once during the evaluation of the entire project and once for each task. If the product is valued at the added value for the company, it is natural that the timing of implementation is adjusted by itself. Engineers are also interested in rapid implementation, but often try to optimize costs in the long run, rather than in the short term. However, many organizations often only stimulate current speed in the short term.

Having established such incentives, the manager gets what he wants: he can name the function and the future date, and management and developers will discuss how to do it. “I want you to make one-click purchases within two months without creating an account.” Managers and developers will write out all the tasks for two weeks and will shorten the list until they can launch the function called “one-click purchases”. She will have a moderate risk of failure and will probably only work after a few iterations. But the failure is temporary, and the function is forever.

What happens if the assumptions of such an economic model are violated?


As I said, we do software well, provided that the consequences of failure do not matter. This is indicated by the slogans "Move fast and break everything," "Roll out and repeat." But everyone can imagine a situation where remaking is expensive or even impossible. In a pinch, collapsing a building can kill thousands of people and cause billions of dollars in damage. The 2020 Iowa Congress of Factions is a milder example. If the event fails, in the evening everyone will go home alive. But the party cannot organize these meetings for the second time ... without spending a lot of time, money and effort.

Brief note: in this section I use the phrase “high-risk” as an abbreviation for “situations with the impossibility of retrying” and “situations with an expensive possibility of retrying”.

What happens when an economic site model is applied in a high-risk situation? Let’s take an example at random: say you are writing an application to report on the results of a meeting in Iowa. What steps will you take to write, test and test the application?

First, engineering logistics: you have to write both an Android app and an iPhone app. Reporting is a central requirement, so a server is needed. Confusing collection rules must be encoded both on the client and on the server. The system should report the results to the end user; this is another interface that needs to be programmed. The Democratic Party probably has validation and reporting requirements that you should write to the application. In addition, it will be very inappropriate if the server fails during the meeting, so you need to implement some kind of monitoring system.

Next, how to check the application? One option is user testing. You show pictures of a hypothetical application to potential users - and ask questions like “What do you think this screen does?” and "If you want to get to $ a_thing, where will you click?" Design always requires several iterations, so it is reasonable to expect high quality after several rounds of such testing. Large companies often spend several rounds before introducing important features. Sometimes they even cancel functions after receiving feedback before writing at least a line of code. User testing is cheap. Is it hard to find five people who will spend 15 minutes on the questionnaire, having received a gift card for five dollars as a gift? In our case, the most difficult thing is to make a representative sample,which corresponds to the democratic representatives of Iowa.

Then you need to check the application in action: install and configure it on your smartphone. The Democratic Party must understand how to get results. In the event of a failure, you need a backup plan. A good test may include a “trial meeting,” where several members of the Iowa Democratic Party download the application and report the results to a central server on a given date. This will identify problems and outline the overall situation. Verification can be carried out in stages as individual parts of the product are introduced.

Further, the Internet is full of villains. For example, Russian groups widely disseminated misinformationvia social media like Facebook, Reddit and Twitter. Therefore, you need to make sure that no stranger intervenes. How to check the authenticity of the results? In addition to villains, the Internet is full of jokers who are ready to disrupt any event just for fun . How does our system resist DDoS attacks? If not, is there a backup plan? Who is responsible for introducing a fallback plan by reporting this at meetings? What happens if member accounts are hacked? If the company does not have security experts, it is likely that the application should undergo an independent audit.

Further, how do you guarantee that there is no error in the software that will distort the results? Accordingly, the Democratic Party should be suspicious of itself: is it possible to believe the results if there is a traitor in its ranks? Results should be available for verification using paper copies.

Okay, let's stop listing problems. One thing is clear: it takes a lot of time and resources to make sure that everything works.

The creators of the Iowa Caucus app were given $ 60,000 and two months. They had four programmers. This amount is not enough to pay four good programmers and other expenses. Money cannot be exchanged for time. There is practically no outside help.

Imagine that you are using the generally accepted practice of removing tasks from a project until the timeline is feasible. You will do your best to save time. A preview for the application catalog often takes less than a day, but in the worst case, it can last a week, and the application may be rejected. So let's skip this: Democratic members will download the application via beta links. Even with a free security check, it will take too long to fulfill all their recommendations. Therefore, we refuse the security check. Perhaps, during the development of the backend, you paid the designer $ 1000 for creating the layout of the application and logo. You plan one round of user testing (but then skip it when the deadlines come out).Roll out quickly and repeat the cycle! Everything can always be fixed.

And programming always takes longer than expected! You will encounter plugs. First, the rules for holding meetings are not entirely clear. It always turns out when a digital solution is imposed on the analog world. The real world can come to terms with ambiguity and inconsistency, but the digital world can not. In response to your questions, the Democratic Party Committee will prepare clarifications. This will hold you back. The committee may also change the rules at the last second. This will cause you to change the application just before the deadline. Next, you have several developers, which means coordination overhead. Is each encoder 100% versed in mobile,and in server development? Do they all master React Native perfectly? Js? Typescript? Client-server communications? Which frameworks and libraries did you choose? Each “No” adds time to development to take into account coordination and training. Is everyone happy with the test frameworks you use? Just kidding. What tests are there ... Yes, at first they wrote a couple of tests, but the application changed so quickly that they were deleted.

Time does not wait. Two months have expired - you get to the finish line with the last effort and release the final release.

Based on the economic model of a website, finishing in a hurry is good. In the end, the rush does not matter, because you crossed the finish line! All problems can be resolved in a few weeks, and then move on to the next project.

But the rush was reflected in the Iowa Democratic Assembly. In the course of the event, calls began to arrive with complaints about the application. Theoretically impossible results or duplicates began to come. Soon, fun programmers happily publish pictures from the KDPV and say that the Congress of Factions in Iowa should not have ordered an application at all, and that voting can only be trusted with paper technology.

findings


This essay helped me personally to conclude: when planning a project, it is necessary to formalize the cost of the alteration. I have done this intuitively in the past, but it should be concretely concretized. Such formalization helps to understand what tasks cannot be failed in any case. It’s like it was in my mobile robotics: there are long implementation cycles, and the damage from the malfunction could go through the roof. We spent a lot of time developing monitoring and creating reliable methods of suppressing and stopping out-of-control systems. I have also been working with consumer web services for ten years, where the consequences of failure are lower. There is a higher willingness to take short-term debts and move forward with the risk of a temporary failure, especially when the rollback is cheap and data loss is unlikely. At the very least, stimuli are pushing for just such behavior.There are special techniques in our industry to prevent such problems. One of them -investigation of imaginary failures (pre-mortem). You need to do this more often.

Iowa failure has a positive result. Some unrelated to IT realized that there were errors in the programs. In the coming years, sponsors of the development of applications for political parties will begin to ask: “What guarantees that the situation with the Congress of fractions in Iowa will not be repeated?” Perhaps they will get acquainted with the literature, which is trying to teach managers how to work properly with engineers. For example, the US Department of Defense has a guide called “How to Recognize Agile Fake Projects,” which describes suspicious signs about a contract. The forums for startups are full of non-techies who ask (and get) advice on hiring developers.

The IT industry has not learned anything. The Iowa Faction Congress provided an opportunity to examine how the assumption of a “high cost of error” will change our core processes. But we did not take this opportunity and did not extract anything from it. The consumer software industry does not pay attention to the risks of errors. In fact, we are even happy with the mistakes. If the outside world is interested in improving the quality of our code in certain areas, then they should regulate these areas. This will not be the first time. Sarbanes Law - Oxley and HIPAA are examples of regulation in the development of an economic model of a website. Regulation is not enough, but it may turn out to be necessary.

This is what we mean when we say: “I don’t quite know how to express it, but our whole area is bad in what we do, and if you rely on us, then everyone will die.” Our industry was formed in an environment where setbacks are cheap. And we are interested in rapid progress. If alteration is impossible or too expensive, then our usual processes work poorly.

All Articles