Often I am asked how I always seem calm and poised during incidents. This persona is not entirely accurate as I am more like a duck in these scenarios, calm on top of the water, but paddling frantically under the water. I have learned some tricks that have helped me stay calm and drive incidents to resolution.
Tip # 1: Checklists
For some reason when our adrenaline starts pumping we lose the ability to think straight. I keep a checklist of the mandatory items that I must do in front of me, so I do not have to remember them.
Tip # 2: Drills
This is my military service talking, but I like drills because they allow you not to have to think. Things become instinctual. I have run exercises in the past for the HipChat DevOps team to great success and will start the practice for Stride Major Incident Manager’s as we distribute this task across the organization. Drills should be on standard protocol processes only. These should be second nature for a MIM. If you do not have to think about the process during an incident, you can focus your energy on restoring the service. Here are the things that I drill the teams on:
- Establish Comms
- Update Statuspage
- Send initial M-Team Comms
- Assess impact
- Assign a tech lead
- Quiz on intervals to update M-Team/Status Pages
Tip # 3: Stop the chaos
Managing change during an incident is crucial. We must know what happened when and by whom. Set the ground rules with the incident team. All changes must go through you. You will decide what to do and in which order. Of course, you will take input from the group, but you are the Incident Manager.
Tip # 4: Focus on restoring service
Be quick to revert any changes that just occurred that may be related to the incident. Engineers tend to want to fix it for good, but that could take much longer than a revert.
Tip # 5: Think creatively
- Turn off feature flags that could shed load from the problem areas.
- Remove instances manually from ELB’s to limit upstream requests.
- Eliminate possibilities – try to identify ways to eliminate large sections of the systems to allow the team to focus troubleshooting efforts in the right areas.
- Divide and conquer – Assign owners to separate subcomponents (database, vendors, frontend, backend, proxies).
Tip # 6: Take charge
Remember the organization is looking to you to run the incident. If they see that you are not doing that, someone must, and you will see them take over. Recognize this and take ownership back. You can not be a passive bystander.