Incident management for remote/WFH teams

As the world tries to battle COVID-19, most of our customers here at Zenduty have started implementing social distancing measures within their companies by asking all their employees, including the NOC, SRE, ITOps, Support, and software engineering teams to work remotely or from home. While that may appear to be a drastic change in your day-to-day operations, it need not disrupt your reliability and support operations. With the right incident response practices, teams can be just as effective remotely as they are on-site. Here are some tips to supercharge your incident response with Zenduty if you’re part of a remote operations team.

Tip 1: Set up an incident command channel

Since everybody is remotely communicating on Slack or Teams, your favorite team chat application is the obvious choice for being the “command center” of incidents. Now, depending on the service/component affected or the severity of the incident, you can setup outgoing integrations with Slack/Teams and send your incident data to a specific channel — let’s call it the “incidents” channel. You can also go for the multi-channel(one-channel-per-incident) configuration if you’re worried about noise in your channels. Once an incident is triggered, the on-call engineer will acknowledge the incident, look at the incident context and if necessary, add responders through the Zenduty bot. Zenduty will notify the responders via SMS/IVR/Push/Email/Slack/Teams and add the responders to the channel automatically.

Tip 2: Setup incident roles

If you haven’t already done so, now would be a good time to experiment with setting up an incident command system — a practice inspired by the US NIMS, that is followed by the SRE teams at Google, Dropbox, Fastly, Github, New Relic, Stripe and thousands of other experienced SRE teams globally. The goal of having incident roles is to have a recursive separation of responsibilities.

It’s important to make sure that everybody involved in the incident knows their role and doesn’t stray onto someone else’s turf. Somewhat counterintuitively, a clear separation of responsibilities — Google’s SRE handbook

Roles may include an Incident Commander(leads the incident), a Communications Lead(talks to stakeholders), Operations Lead(RCA, changes), DB Lead, Infra Lead etc. Setting up roles will help you effectively delegate specific tasks as well as increase transparency into your response operations. You can define the incident roles in Zenduty within your Team pages.

Team ICS roles

Tip 3: Setup Task templates/playbooks

Most teams have a 40–80% coverage of their incidents within their playbooks, which they host on a team wiki or git or something like Confluence. A major challenge with remote teams is ensuring that the playbooks are followed to the tee by responders. Take a good look at your incident playbook document and create a task template comprising of discrete role-mapped tasks. Each playbook you have can be converted into a task template on Zenduty. Once you create the task template, go ahead and map those task templates to your services. As the incidents start pouring in, Zenduty will automatically take the tasks from the task template and append them to your incident tasks. As the incident commander assigns the roles to various responders, Zenduty will assign the role-mapped tasks to the respective responders.

Task templates can dramatically increase your incident preparedness, decrease on-call anxiety, reduce RCA/response errors and improve response times.

Tip 4: Comms integrations(Jira, Zoom, Statuspage)

In the ensuing confusion of a major incident, setting up Jira tickets, Zoom rooms and Status updates might not be a priority. Luckily, adding outgoing integrations on Zenduty with Jira, Zoom, and Statuspage will certainly help to reduce the friction. Once you setup these outgoing integrations, Zenduty will automatically create a linked(two-way) Jira ticket, a Zoom room and update your Statuspage. The links to these comms channels will appear on your incident page as well as your Slack messages.

Tip 5: Setting up Alert Rules

Remote on-call teams will invariably face challenges when it comes to finding the right person or team to triage or assign an incident. One way to bypass such scenarios is to use Alert Rules to override the escalation policies or assign specific subject matter experts to an incident depending on the nature or source of the alert. Alert rules can also help you suppress incidents.

Bringing all of this together

The above tips can help remote teams supercharge their incident response, improve incident preparedness as well as give your team the right confidence to handle critical incidents. While working remotely for an SRE, ITOps, NOC, Support role might be overwhelming at first, be rest assured that it only gets better with every passing day.

As the world gears up towards fighting COVID-19, we wish the patients a speedy recovery, and we continue to be inspired by our healthcare workers and by others who are caring for people around the world.

Alka Gupta

Lover of all things organic - digitally and otherwise! Founding team @Zenduty. Taekwondo Black Belt, Potter, and a coffee connoisseur.