Incident management for high-velocity teams
Atlassian Incident Handbook
Teams running tech services today are expected to maintain 24/7 availability.
When something goes wrong, whether it's an outage or a broken feature, team members need to respond immediately and restore service. This process is called incident management, and it’s an ongoing, complex challenge for companies big and small.
We want to help teams everywhere improve their incident management. Inspired by teams like Google, we've created this handbook as a summary of Atlassian's incident management process. These are the lessons we've learned responding to incidents for more than a decade. While it’s based on our unique experiences, we hope it can be adapted to suit the needs of your own team.
Get the handbook in print or PDF
We've got a limited supply of print versions of the Incident Management Handbook that we're shipping out for free. Or download a PDF version.
We want to help teams everywhere improve their incident management. Inspired by teams like Google, we've created this handbook as a summary of Atlassian's incident management process. These are the lessons we've learned responding to incidents for more than a decade. While it’s based on our unique experiences, we hope it can be adapted to suit the needs of your own team.
Stage | Incident Value | Related Atlassian Value | Rationale |
1. Detect | Atlassian knows before our customers do | Build with Heart and Balance | A balanced service includes enough monitoring and alerting to detect incidents before our customers do. The best monitoring alerts us to problems before they even become incidents. |
2. Respond | Escalate, escalate, escalate | Play, As a team | Nobody likes being woken up and we don’t take the responsibility lightly. But people understand that occasionally they will be woken for an incident where it turns out they aren't needed. What’s usually harder is waking up to a major incident and playing catch up when you should have been alerted earlier. We won't always have all the answers, so "don't hesitate to escalate." |
3. Recover | Shit happens, clean it up quickly | Don't !@#$ the Customer | Our customers don't care why their service is down, only that we restore service as quickly as possible. Never hesitate in getting an incident resolved quickly so that we can minimise impact to our customers. |
4. Learn | Always Blameless | Open Company, No Bullshit | Incidents are part of running services. We improve services by holding teams accountable, not by apportioning blame. |
5. Improve | Never have the same incident twice | Be the change you seek | Identify the root cause and the changes that will prevent the whole class of incident from occuring again. Commit to delivering specific changes by specific dates. |
Setting up an on-call schedule with Opsgenie
In this tutorial, you’ll learn how to set up an on-call schedule, apply override rules, configure on-call notifications, and more, all within Opsgenie.
Read this tutorialHow we respond to an incident
Here's Atlassian's process for responding to incidents, from our handbook. Learn the steps an incident manager takes from detection to resolution.
Read this article