What to think about when implementing duty

Ryn Daniels, author of Effective DevOps, shares strategies that anyone can use to create better, non-annoying, and more stable rotations of Oncall duty.



With the advent of Devops, many engineers these days somehow organize shifts, which was once the sole responsibility of system administrators or maintenance engineers. Watch, especially during off-hours, is not a task that most people enjoy. Duty Oncall can disrupt our sleep, interfere with the usual work that we try to do during the day and interfere with our lives in general. As more and more teams take part on duty, we asked ourselves, “What can we, as individuals, teams and organizations, do to make duty more humane and stable?”

Keep sleep


Often the first thing people think about when they remember about duty is that it will negatively affect their sleep; no one wants an alert to wake them in the middle of the night. If your organization or team becomes large enough, you can use the “follow-the-sun” rotation, when teams located in several time zones participate in the same rotation, with duty shifts being shorter, so that each time zone will be on duty only during its working (or at least waking up) hours. Installing such a rotation can wonderfully help with reducing the night load that the duty officer takes on.

If you do not have enough engineers and their geographical distribution to provide “follow-the-sun” rotation, there is still something that can be done to reduce the likelihood that people will be awakened in the middle of the night unnecessarily. In the end, it is one thing to get out of bed at 4 a.m. in order to solve an urgent problem facing the client; it’s completely different to wake up only in order to discover that you are dealing with a false alarm. This can help check all the alerts you set up and ask your team which ones you really need to wake someone up after hours, and whether these alerts can wait until morning. It can be difficult to get people to agree to turn off some non-working alerts, especially if missed problems caused problems in the past, but it’s important to rememberthat an engineer deprived of sleep is not the most effective engineer. Set these alerts for business hours when they are really important. Most notification tools these days allow you to set up different rules for notifications during off hours, whether it's Nagios notification periods or setting up different schedules in PagerDuty.

Sleep, Duty, and Team Culture


Other ways to solve the problem of sleep disturbance are associated with more significant cultural changes. One way to solve this problem is to track alerts, paying particular attention to when they arrive and whether they are effective. Opsweekly is a tool created and published by Etsy that allows teams to track and classify alerts received. It can generate graphs showing how many warnings woke people up (using the "About Sleep" data from fitness trackers), as well as how many warnings actually needed action from the person. Using these technologies, you can track the effectiveness of your duty rotation and its effect on sleep over time.

The team can play a role in providing sufficient rest for each duty officer. Create a culture that encourages people to take care of themselves: if you lose sleep due to being called at night, you can sleep a little longer in the morning to try to make up for lost time. Team members can keep an eye on each other: when teams share information about their sleep with each other through something like Opsweekly, they can go to their duty colleagues and say, “Hey, it looks like you had a hard night with PagerDuty last night “Do you want me to cover you tonight so you can rest a bit?” Encourage people to support each other in this way and do not approve of the “culture of heroes”, where people will reach the limit, avoiding requests for help.

Reducing the impact of duty shifts at work


When the engineers are tired because they were woken up while on duty, they obviously won’t work with 100% strength during the day, but even without taking into account the lack of sleep, duty can also have other consequences for work. One of the most significant losses during duty is related to the interruption factor, a change in context: one interruption can lead to a loss of at least 20 minutes due to loss of focus and context switching. It is likely that your teams will have other sources of interruptions, such as tickets generated by other teams, requests or questions received via chat and / or email. Depending on the scope of these other interrupts, you might consider adding them to an existing rotation while on duty or setting up a second rotation just to handle these other requests.

It is important to take this into account when you are planning the work that the team will perform, both long-term and short-term. If your team has a tendency toward fairly intense shift shifts, this fact must be taken into account during long-term planning, since you may have a situation where all staff are effectively engaged in duty at any time, and not other work. In short-term planning, you may find that the duty officer is unable to meet the deadlines due to his duty duties - this is to be expected, and the rest of the team should be ready to adapt and help to make sure that the work is done and the duty officer gets support in their work tasks. Regardless of whether the attendant is called,an on-duty shift will affect his ability to do other work - do not expect the attendant to work at night to complete planned projects in addition to off-duty duty.

Teams will need to find a way to handle the extra work generated on duty. This work can be real work to fix real problems detected by monitoring and warning systems, or it can be work to fix monitoring and alerts to reduce the number of false positive alerts. Whatever the nature of the work being created, it is important to fairly and steadily distribute this work across the team. Not all duty shifts are equivalent, some are more complicated than others, therefore, the statement that the person who received the alert is the person responsible for eliminating all the consequences of this alert can lead to an uneven distribution of work. It may be more reasonable for the attendant to be responsible for scheduling or distributing work, expectingthat the rest of the team will be ready to help with the completion of the work created.

(work-life balance)


Think about the impact of being on duty outside of work. When you are on duty, you will probably feel attached to your mobile phone and laptop, which means that you always carry a laptop and a mobile router (usb modem) with you or just do not leave your home / office. Being on duty usually means giving up things like meeting friends or family during your shift. This means that the duration of each shift depends on the number of people in your team, and the frequency of shifts can be an undue burden for people. You may need to experiment with the duration and schedule of your shifts in order to find a schedule suitable for at least the majority of people involved, since different teams and people will have different priorities and preferences.

It is imperative to be aware of the impact that duty will have on people's lives, both at the management level and at the individual level. It should be noted that the impact will be felt more by people with less privileges. For example, if you have to spend time caring for children or other family members, or if you find that most of the housework falls on your shoulders, you already have less time and energy than someone who does not have these duties. Such “second shift” or “third shift” work tends to disproportionately affect people, and if you set up a rotation on duty with a schedule or intensity that assumes that participants do not have a personal life outside the office, you limit people who will be able to participate at your command.

Encourage people to try to keep most of their regular schedule. You should think about providing the team with mobile routers (usb modems) so that people can leave the house with their laptop and still have some semblance of life. Encourage people to exchange hours of duty with each other, if necessary, for short periods of time so that people can go to the gym or visit a doctor while on duty. Do not create a culture where duty should mean that engineers literally do nothing but watch. A balance between work and personal life is an important part of any job, but especially when you consider non-working hours, older members of your team should set an example to others in terms of balancing work and personal life, as much as possible while on duty.

On an individual level, do not forget to explain what duty means to your friends, family members, partners, pets, etc. (your cats will most likely not care, because they get up at 4 a.m. when you receive an alert, although they will by no means want to help you with his decision). Make sure that you make up for lost time after your shift is over, be it meeting friends, family or, for example, a dream. If you can, think about setting a silent alarm clock (for example, a smart watch) that can wake you by buzzing your wrist so as not to wake anyone around you. Find ways to take care of yourself when you are in the midst of a shift shift and when it is finished. You might want to put together a “survival kit on duty” that helps you relax: listen to your favorite music’s playlist,read your favorite book or take time to play with your pet. Managers should encourage self-care by giving people a day off after a week of duty and making sure people ask for (and receive) help when they need it.


In general, being on duty should not be perceived only as a terrible job: you have the opportunity and responsibility as a person participating in duty in order to work actively to make them better for people who will be on duty in the future, which means that people will receive less messages and they will be more accurate. Again, keeping track of the value of your alerts, using something like Opsweekly, can help figure out what makes your duty annoying and fix it. For inactive alerts, ask yourself if there are ways to get rid of these alerts - perhaps this means that they will only work during business hours, because there are some things that you simply don’t need to respond to in the middle of the night. Don't be afraid to delete alerts,change them or change the sending method from “send by phone and email” to “email only”. Experimentation and iteration are the key to improving watchdog over time.

For alerts that are actually valid, you should consider how easy it is for the engineer to complete the necessary actions. Every working alert should have a runbook that comes with it - consider using a tool like nagios-herald to add Runbook links to your alerts. If the alert is so simple that it does not need a Runbook module, it is probably simple enough so that you can automate the response using something like Nagios event handlers, which saves people from having to wake up or interrupt for easily automated tasks. Both runbooks and nagios-herald can help you add valuable context to your alerts, which helps people respond more effectively to them. LookCan you answer such common questions as: When was the last time this alert was triggered? Who answered him last time, and what actions did they ultimately take (if any)? What other alerts appear at the same time and are they related? This type of contextual information is often found only in the brains of people, therefore, encouraging a culture of documenting and sharing contextual information can reduce the amount of overhead required to respond to warnings.therefore, encouraging a culture of documenting and sharing contextual information can reduce the overhead required to respond to warnings.therefore, encouraging a culture of documenting and sharing contextual information can reduce the overhead required to respond to warnings.

A significant part of the fatigue that arises from shifts is that they never end — if your team has shifts, it is unlikely that they will end anytime in the foreseeable future. Duty never ends, and we may feel that they will always be terrible. This lack of hope is a big mental problem that can contribute to stress and exhaustion, so turning to the perception (in addition to reality) that being on duty will always be terrible is a good start to start thinking about your duty on the long run.

In order to give people hope that the situation on duty will ever improve, it is necessary to have an observable system (the same tracking and categorization of duty, which I mentioned earlier). Keep track of how many warnings you have, what percentage of them require the intervention of an attendant, how many of them wake people up, and then work to create a culture that encourages people to do things better. If you have a large team, it may be tempting, as soon as your duty is about to end, give up and say "this is the problem of the future duty officer", and not dig around to fix something - who wants to spend more effort on duty than from them is required? This is where a culture of empathy can make a big difference, because you care not only about your well-being on duty, but also about your colleagues.

It's all about empathy.


Empathy is an important part of what allows us to stimulate work that improves the on-duty experience. As a manager or participant, you can positively evaluate or even reward people for their behavior, which makes duty better. System support (operations) is one of those areas where engineers often feel that people pay attention to them only when something goes wrong: people will be there to scream at them when the site crashes, but they rarely they will learn about the off-screen efforts that operator-engineers put into making the site work the rest of the time. Recognition of the work can be of great importance, whether it is gratitude to someone at the meeting or in a general e-mail for improving a specific alert, the technical aspect of duty,or giving someone time to replace another engineer in the shift for a while.

Encourage people to spend time and effort improving the situation on duty in the long run. If there is a duty on your team, you should plan and prioritize this work as you would any other job on your roadmap. Duty is 90% entropy, and if you do not actively work to improve them, over time they will become worse and worse. Work with your team to find out what motivates and encourages people better, and then use it to encourage people to reduce the noise of alerts, write runbooks, and create tools that will solve their problems on duty. Whatever you do, do not settle for terrible duty, as an invariable part of the situation.

All Articles