Table of Content
Incident management is more important than ever, however, many IT teams still don’t follow proven best practices and often fall behind. A high performing incident response team needs to stay on top of their monitoring, on-call management, and escalation policies to assure the best possible performance.
Let’s have a look what are the best practices you can apply to improve your team’s incident management performance today.
Starting with monitoring
Monitoring is naturally the first part of any incident management process. Whether you use only uptime monitoring or whether you go with a full server monitoring as well, having a reliable monitoring service or tool is absolutely vital. Since the monitoring tool spots and verifies website incidents it should not be underestimated. Picking the right tool should be a clear focus of any incident response team.
The monitoring tool should be chosen to have an incident verification feature, an option to change check frequencies, and also configurable alert thresholds. What do those features mean? Well, incident verification is essentially that the monitoring tool checks the incidents for false positives so that no duplicate alerts get created. The check frequency is the frequency in which the monitor checks the specific website. The alert thresholds are conditions under which the given monitor triggers an alert.
Having those capabilities helps your incident response team significantly as they can really customize what kinds of situations trigger an incident. Furthermore, they can easily set what websites will be under strict monitoring with super short check frequency. The industry standard for that is 30 seconds. This together with the appropriate alert assures that there is no time wasted. When the incident is created and when users or customers start experiencing the incident.
Mastering on-call management
On-call management is a heavily discussed theme, however, there is not a one size fits all solution. This is mainly because of the large number of variables that affect each individual company. Those factors include among others: team size, team locations, individual preferences, and team member abilities. Team size is the most important factor in on-call management. The on-call rotations – those are pre-set repeating on-call schedules for each team – are very different depending on the given team size. In a team of two or three, the “every other day” rotation is very popular and it is considered to be the best practice. It means that one person on the team takes Monday, Wednesday, and Friday while the other person takes Tuesday, Thursday, and Saturday. Sunday on-call duty is then changing every other week.
In larger teams, there is much more flexibility so there can be even hourly on-call rotations so no one has to be on-call more than 8 to 12 hours. On the other hand in the case of incident management done by individual developers, there is no flexibility. In those single founder or single developer situations must the person choose selectively, which incidents he or she wants to get alerted to in order to avoid alert fatigue but still maintain a reasonable MTTR. Another major part of creating the best possible on-call schedule is the location of the individual team members. Why? Well, in cases when the on-call team is working remotely from different locations all over the world there might be an option to actually get rid of night shifts. In this regard, remote work is a great benefit to easier incident management.
Follow the sun approach
Specifically, different time zone team locations can help the on-call manager create an on-call schedule. In which no one has a shift during the night as his colleague in the different time location takes his place. This is called the follow-the-sun approach. It can be very beneficial in creating a high performing on-call team. Lastly, individual team member’s preferences and their respective abilities play a role. The preferences must be taken into account as there can be people on the on-call team that are happy to do night shifts. While others might be on board with doing early morning once.
Checking in with the team significantly helps to build the most personalized on-call schedule possible. And ensure that everyone will be performing to the best of their abilities. It must be noted that in some cases not all the team members possess the same set of abilities. Which means that in some cases they won’t be able to solve the incidents on their own. Because of this, the best incident management practices include proper incident escalation policies.
Creating the right escalation policies
Escalation policies are basically runbooks that the team or the incident management tool executes when the original on-call person doesn’t acknowledge the incident alert in time or when the on-call person is not able to solve the incident on his own and needs assistance from a colleague. The first time of escalation based on assistance need is seniority-based escalation. This escalation type is usually the default escalation policy for most incident response teams. Seniority-based means that the second in line team member based on seniority is alerted. If the next in line is unable to solve the issue it is escalated again.
If you have a more complex system set up the seniority-based escalations might not be the best practice. You should instead opt for function-based escalation. This means that the person who has the necessary skills will be alerted in order to help with the incident. Depending on the size of your operations and the complexity of incident management you want to go into you can even combine the two to streamline the process even more.
Now, in cases that the regular on-call person doesn’t acknowledge the incident in the preset time. What is called automated escalation is necessary. Automated escalation should be set up in any on-call team. As it assures that incidents get solved as fast as possible. Automated escalation can follow the seniority-based or function-based escalation policies as mentioned above. However, it should be balanced regarding response time. What does this mean? Well, it means that the automated escalation must be set up. So that the on-call person has enough time to acknowledge the incident. But also not long so the incident wouldn’t go without notice for a significant chunk of time.