Incident management for high-velocity teams
Get to know the incident response lifecycle
Hang around security and incident management pros long enough, and you’ll notice a pattern. The smartest people in these industries think in cycles, not straight lines.
Why is that? What does that even mean? That means every incident and outage isn’t an isolated event with a beginning and end point (though it may seem like that). Incidents are a learning opportunity.
Just because a service is “operational” again, doesn’t mean your team’s work is over. Post-incident activities should have you putting plans on future roadmaps, changing the way you prepare for future incidents, and discovering new things to build which will prevent more incidents in the future. It’s a never-ending cycle of improvement, and there are a few different ways to think about the various stages, depending on what school of thought you subscribe to.
What is an incident response lifecycle?
Incident response is an organization’s process of reacting to IT threats such as cyberattack, security breach, and server downtime.
The incident response lifecycle is your organization’s step-by-step framework for identifying and reacting to a service outage or security threat.
Atlassian’s incident response lifecycle
1. Detect the incident
Our incident detection typically starts with monitoring and alerting tools. Though sometimes we first learn about an incident from customers or team members.
Since incident alerts can originate from different sources, having a solution that integrates a variety of alerting and reporting tools can be the difference between a disjointed, cumbersome response and a cohesive, collaborative one. A solution like Jira Service Management allows teams to customize and filter alerts across all monitoring, logging, and CI/CD tools to ensure teams swarm incidents quickly while avoiding alert fatigue.
2. Set up team communication channels
An important first step is to set up the incident team's communication channels. The goal at this point is to focus team communications in well-known places, such as a dedicated Slack channel and video conference bridge.
Within Jira Service Management, coordinating incident responses can be a smooth process. Not only are teams enabled to communicate in ways that work best for them — like Slack and video conferencing — but communicating with customers becomes easier with automation and customization, too. We'll cover external communication in Step 4.
3. Assess the impact and apply a severity level
Now it’s time to assess the impact of the incident so the team can decide who else to contact and what to communicate with customers and stakeholders. Assigning a severity level not only identifies the impact of the incident, but it also lays the groundwork for resolution plans and external communications. In Jira Service Management, escalating an incident and assigning severity triggers automated actions as well as notifications to responders to stay on top of resolution progress.
4. Communicate with customers
We aim to communicate to stakeholders internally and externally as soon as possible. Communicating quickly and accurately helps build trust with customers and the rest of the organization. Like we mentioned before, the ability to customize the way you communicate empowers your team to work they way they want, facilitating a faster resolution. The ability to customize communication also empowers your team to take control of what message they want to project and when. Moreover, save your team time in the midst of an incident with automated replies from within a ticket, sent directly to the customer.
5. Escalate to the right responders
Initial responders often need to bring other teams into the incident by paging them using an alerting features in Jira Service Management. Bring responders directly to the incident ticket by grouping related tickets and tagging relevant responders directly on the ticket. This way, notifications are coordinated and everyone has the full context.
6. Delegate incident response roles
As additional team members join the response, the incident manager delegates a role to them. This is where it's helpful to have s proper incident response playbook — developed beforehand — that outlines clear roles and responsibilities. Individuals on the incident response team are familiar with each role and know what they’re responsible for during an incident.
7. Resolve the incident
An incident is resolved when the current or imminent business impact has ended. At that point, the emergency response process ends and the team transitions onto any cleanup tasks and the postmortem.
Ideally, your incident management solution is keeping a robust incident timeline — which is the case when using Jira Service Management. Responders can access crucial incident data afterwards and develop a report that helps teams avoid similar incidents in the future and find the root cause. Postmortems can also act as a resource, on the off chance something similar should happen again.
The NIST incident response lifecycle
Another industry standard incident response lifecycle comes from The National Institute of Standards and Technology, or NIST. NIST is a government agency which sets standards and practices around topics like incident response and cybersecurity.
NIST stands for National Institute of Standards and Technology. They’re a U.S. government agency proudly proclaiming themselves as “one of the nation’s oldest physical science laboratories”. They work in all-things-technology, including cybersecurity, where they’ve become one of the two industry standard go-tos for incident response with their incident response steps.
Like Atlassian, NIST believes that not every incident can be prevented. So it’s best to be prepared:
“Preventive activities based on the results of risk assessments can lower the number of incidents, but not all incidents can be prevented. An incident response capability is therefore necessary for rapidly detecting incidents, minimizing loss and destruction, mitigating the weaknesses that were exploited, and restoring IT services.” — NIST
The NIST incident response lifecycle breaks incident response down into four main phases: Preparation; Detection and Analysis; Containment, Eradication, and Recovery; and Post-Event Activity.
Phase 1: Preparation
The Preparation phase covers the work an organization does to get ready for incident response, including establishing the right tools and resources and training the team. This phase includes work done to prevent incidents from happening.
Phase 2: Detection and Analysis
Accurately detecting and assessing incidents is often the most difficult part of incident response for many organizations, according to NIST.
Phase 3: Containment, Eradication, and Recovery
This phase focuses on keeping the incident impact as small as possible and mitigating service disruptions.
Phase 4: Post-Event Activity
Learning and improving after an incident is one of the most important parts of incident response and the most often ignored. In this phase the incident and incident response efforts are analyzed. The goals here are to limit the chances of the incident happening again and to identify ways of improving future incident response activity.
Incident response for modern DevOps teams
Over the past decade, the DevOps movement has helped teams reshape how they build, deploy, and operate software. Along with that are innovations on how these teams respond to incidents.
The DevOps approach to managing incidents isn’t dramatically different from the traditional steps to effective incident management. DevOps incident management includes an explicit emphasis on involving developer teams from the beginning--including on call--and assigning work based on expertise, not job titles.
Incident response and continuous improvement
We started the article by talking about cycles vs. straight lines. You’ll notice something all these incident management approaches have in common: they are not linear. Each of them include the same basic component parts: ways of defining, detecting and identifying incidents; ways of quickly responding and taking action to mitigate incidents; and ways of analyzing incidents to improve future detection and response. There is no point in analyzing an incident that already happened just for the sake of that incident. You can’t go back in time and change what happened. You’re learning from the incident to improve the future detection and response. Constant, continuous learning and improvement is how teams close that cycle.
There are many moving parts to the (nonlinear) incident response process. Keeping track of each step with integrated collaboration and communication tools is easy with an incident management solution like Jira Service Management. Centralize alerts and unify teams with flexibility to respond to and resolve incidents quickly.
Setting up an on-call schedule with Opsgenie
In this tutorial, you’ll learn how to set up an on-call schedule, apply override rules, configure on-call notifications, and more, all within Opsgenie.
Read this tutorialPros and cons of different approaches to on-call management
On call teams are rapidly evolving. Explore the pros and cons of different approaches to on call management.
Read this article