Incident management for high-velocity teams
Incident management in the age of DevOps
Applying principles of open, blameless communication to incident management teams
You can’t rethink how you build, deploy, and operate software without rethinking how you respond to incidents.
In their seminal 2009 talk, “10+ Deploys Per Day: Dev and Ops Cooperation at Flickr," John Allspaw and Paul Hammond sketched out a vision of a world where developers and IT Ops teams work together and ship often. Over the next decade, that vision took shape as the DevOps movement.
The nature of DevOps relies on new ways of responding to incidents. It’s not surprising that incident management got so much attention in Allspaw and Hammond’s talk.
“The important thing to realize is that failure is going to happen,” Hammond said in the talk. “It’s not a question of if, it’s a question of when.”
Unlike frameworks like ITIL, there is no “official” document of best practices for a DevOps team. But, we can generally agree that, at its core, DevOps is about delivering business value to an organization by breaking down organizational silos, increasing transparency, and fostering open communication between developers and IT operations teams.
That same culture of transparency, visibility, and rapid learning extends to incident management.
Why? Because the first and most critical steps in incident management involve understanding what's gone wrong, getting the right people working on the problem, and fostering a blameless culture.
DevOps incident management calls for a culture of open, blameless communication between developers and IT ops teams. And establishing lightweight processes that improve the reliability of IT services, increase customer satisfaction, and drive business value. A DevOps engineer can help to implement DevOps culture and practices.
ITIL, by comparison, is a prescribed set of 26 processes, procedures, tasks, and checklists designed to improve specific practices in IT service management. ITIL focuses on service quality and consistency and improving the resilience of systems.
One of the benefits of ITIL is that organizations that want to improve ITSM can begin with templated best practices instead of starting from scratch. And while some believe ITIL is best suited for large enterprises, the framework is flexible enough that smaller companies can pick and choose the processes that make sense for their business and still find value.
One downside to ITIL—if you're in a hurry to make changes to your incident response process—is that it can involve formal change management and an expert consultant, delaying improvements.
For teams who want to get started right away, the DevOps incident management approach will help them come together and realize benefits immediately.
Setting up an on-call schedule with Opsgenie
In this tutorial, you’ll learn how to set up an on-call schedule, apply override rules, configure on-call notifications, and more, all within Opsgenie.
Read this tutorialIncident communication best practices
Incident communication is the process of alerting users that a service is experiencing some type of outage or degraded performance.
Read this article