Incident management for high-velocity teams
The changing roles of incident and problem management
In the last decade, incident management has changed a lot.
ITIL guidelines have been updated. IT teams have started sharing responsibility with DevOps and SecOps. Ever more complicated systems have led to more complicated incident management solutions. And many companies are embracing blameless postmortems and new ways of measuring performance.
As incident management continues to shift and evolve, so too does its close cousin, problem management, and the relationship between the two practices.
What is a problem and how does it differ from an incident?
As ITIL defines it, a problem is “a cause or potential cause of one or more incidents.”
And an incident is a single unplanned event that causes a service disruption.
In other words, incidents are the nasty episodes on-call employees are typically scrambling to resolve as quickly and completely as possible. And problems are the root cause of those disruptive events.
A problem can cause a single incident, or it can cause multiple incidents. And an incident may be traced back to a single problem or—sometimes—multiple problems.
For example, the five-hour outage that cost Delta Airlines $150 million in 2016 was an incident. The problem that caused that incident was a loss of power at an operations center and, presumably, no backup plan in case of that loss of power.
Similarly, the 12-hour app store outage that cost Apple an estimated $25 million was an incident. The problem behind it? A DNS issue.
If we used these terms outside the world of tech, rushing to the doctor for a migraine would be an incident. The cause of the migraine—allergies or vision issues or stress—would be the problem.
Problem management vs. incident management
Obviously, problems and incidents are inextricably linked. One causes the other and teams have to pay attention to both.
For traditional IT teams, the latest ITIL guidelines call for teams to manage both problems and incidents, but to do so separately. Problem management is a practice focused on preventing incidents or reducing their impact. Incident management is focused on addressing incidents in real time.
The benefit of the ITIL approach is that it prioritizes the core goals of both problem management and incident management. By making them separate and equally important practices, presumably, the guidelines are attempting to avoid the common problem of IT teams constantly putting out incident fires without dealing with the root cause of those fires.
If an incident manager’s primary goal is the quick resolution of incidents and a problem manager’s primary goal is prevention, combining these roles may mean one of those goals—both of which are vital to an organization—may get deprioritized in favor of the other.
The downside to this approach is that separating the two practices—which are so tightly linked in reality—can create knowledge gaps and a breakdown in communication between incident resolution and the root cause analysis that leads to the underlying cause.
DevOps and the changing roles of problem and incident management
As usual, the collaborative DevOps movement has blurred the lines of traditional IT thinking—seeing problem and incident management not as two distinct practices, but as overlapping halves of a holistic view.
This shift comes from not only the fact that the practices are two sides of the same coin—preventing and resolving incidents—but also from a DevOps approach that typically affirms that:
- There is often more than one root cause of an incident
- Postmortems should be blameless and inclusive of any team impacted by an incident
- Collaboration is at the core of continuous improvement
The overlap in problem and incident management may also be connected with the industry-wide shift toward a “you build it, you run it” approach. As the teams who build systems become responsible for resolving incidents within those systems, it’s logical that the same team be responsible for running postmortems, doing the detective work to get to the root cause of an incident, and making recommendations that will prevent or lessen the impact of future incidents.
The bridge between problem and incident management here is the blameless postmortem, where once the urgency has cooled, incident managers turn detective and turn to the task of problem management and prevention.
The key challenge DevOps teams who blur the lines between these two practices will face is making sure that problem management—with its less urgent but deeply valuable long-term goals—doesn’t get deprioritized in favor of the in-your-face urgency of incident management.
Of course, uniting incident management and problem management is oftentimes easier said than done—but is imperative to find and resolve the root cause. Discover how incident management solution Jira Service Management gives teams the flexibility to work collaboratively: record context and create rich timelines while resolving incidents and use that to help teams better manage problems.
Setting up an on-call schedule with Opsgenie
In this tutorial, you’ll learn how to set up an on-call schedule, apply override rules, configure on-call notifications, and more, all within Opsgenie.
Read this tutorialPros and cons of different approaches to on-call management
On call teams are rapidly evolving. Explore the pros and cons of different approaches to on call management.
Read this article