Post Mortems - Bringing clarity to incident reviews
Last updated
An incident post mortem is known by many names- incident review, root cause analysis (RCA), learning review, but what do they entail? A post mortem is a post-incident activity to help organizations understand how the incident happened and to learn from it. Service incidents are an unavoidable hurdle for any company when they do happen, the teams working will be wholly focussed on restoring service as quickly as possible.
Most times they don’t have the time and find the root of the problem as they focus on fixing the problem at hand. Organizations for the most part always have post-mortem meetings to assess what happened once a service has been restored. An incident resolution time-frame only ends once the entire team participates in the post-mortem meetings.
The entire team needs to participate in post mortems as people who worked on the frontlines of incident management can share their insight from experience and increase the collective knowledge of the entire team. Post mortems tend to be structured differently depending upon the organization and there is no single playbook to help you run them efficiently. Teams need to view post mortems as a learning exercise to understand what they have to do to avoid such an issue in the future.
Here are some guidelines to help your company conduct streamlined post mortems:
Document Everything : Post mortems are not just meetings where people talk, it is a record of everything that was done during an incident. The documentation from these meetings will be reference summaries for future teams to help them resolve downtimes when they happen. If all the steps taken while the incident is happening are documented immediately after it happens, it will serve as a game plan for engineers down the line when they are also facing an incident.
Devise a system to ensure that this documentation is available to the entire team for transparency and preparedness. This will serve as effective for stakeholder communication as well.
Foster Blamelessness: Industry leaders like Google talk about this extensively in their Site Reliability Engineering handbook, the importance of blamelessness during post mortems. Post mortems are time-consuming processes and need to be freed from unnecessary assigning of blame on individuals who may or may not be at fault. This creates an environment of fear with people unwilling to come forward and explain where they went wrong. Studies have shown that this negatively impacts the growth of an organization in the long run.
Blamelessness encourages team members to talk about the actions they took, assumptions they made and establish timeframes. This also builds trust among team members where they can openly talk about their concerns without fearing repercussions. A trusting team contributes to the strength of the system as a whole.
Track Post mortems: As companies perfect their incident review game they build up a database of valuable documentation for future incident response professionals. This can benefit current teams as well, especially if the tracking system has been set up for quick access during times of crisis.
Over time they can be reviewed by executives to understand patterns where potential weakness can be exploited.
Every technology worker knows that downtimes are unavoidable, they come when least expected, leaving behind a trail of chaos. Post mortems help teams to view incidents calmly without the stress of an ongoing incident. There are dozens of publications and handbooks by industry leaders to help organizations with best practices in post-incident reviews.
Zenduty is a cutting edge incident management platform designed by developers keeping the well-being of engineers in mind. Sign up for free here.
Deepak Kumar
Hybrid Cloud Engineering.