Incident management for high-velocity teams
Incident postmortems
We practice blameless postmortems at Atlassian to ensure we understand and remediate the root cause of every incident with a severity of level 2 or higher. Here's a summarized version of our internal documentation describing how we run postmortems at Atlassian.
Get the handbook in print or PDF
We've got a limited supply of print versions of our Incident Management Handbook that we're shipping out for free. Or download a PDF version.
Field | Instructions | Example |
Incident summary | Summarize the incident in a few sentences. Include what the severity was, why, and how long impact lasted. | Between on The event was detected by This |
Leadup | Describe the circumstances that led to this incident, for example, prior changes that introduced latent bugs. | At on |
Fault | Describe what didn't work as expected. Attach screenshots of relevant graphs or data showing the fault. | |
Impact | Describe what internal and external customers saw during the incident. Include how many support cases were raised. | For This affected |
Detection | How and when did Atlassian detect the incident? How could time to detection be improved? As a thought exercise, how would you have cut the time in half? | The incident was detected when the |
Response | Who responded, when and how? Were there any delays or barriers to our response? | After being paged at 14:34 UTC, KITT engineer came online at 14:38 in the incident chat room. However, the on-call engineer did not have sufficient background on the Escalator autoscaler, so a further alert was sent at 14:50 and brought a senior KITT engineer into the room at 14:58. |
Recovery | Describe how and when service was restored. How did you reach the point where you knew how to mitigate the impact? Additional questions to ask, depending on the scenario: How could time to mitigation be improved? As a thought exercise, how would you have cut the time in half? | Recovery was a three-pronged response:
|
Timeline | Provide a detailed incident timeline, in chronological order, timestamped with timezone(s). Include any lead-up; start of impact; detection time; escalations, decisions, and changes; and end of impact. | All times are UTC. 11:48 - K8S 1.9 upgrade of control plane finished12:46 - Goliath upgrade to V1.9 completed, including cluster-autoscaler and the BuildEng scheduler instance 14:20 - Build Engineering reports a problem to the KITT Disturbed 14:27 - KITT Disturbed starts investigating failures of a specific EC2 instance (ip-203-153-8-204) 14:42 - KITT Disturbed cordons the specific node 14:49 - BuildEng reports the problem as affecting more than just one node. 86 instances of the problem show failures are more systemic 15:00 - KITT Disturbed suggests switching to the standard scheduler 15:34 - BuildEng reports 300 pods failed 16:00 - BuildEng kills all failed builds with OutOfCpu reports 16:13 - BuildEng reports the failures are consistently recurring with new builds and were not just transient. 16:30 - KITT recognize the failures as an incident and run it as an incident. 16:36 - KITT disable the Escalator autoscaler to prevent the autoscaler from removing compute to alleviate the problem. 16:40 - KITT confirms ASG is stable, cluster load is normal and customer impact resolved. |
Five whys | Use the root cause identification technique. Start with the impact and ask why it happened and why it had the impact it did. Continue asking why until you arrive at the root cause. Document your "whys" as a list here or in a diagram attached to the issue. |
|
Root cause | What was the root cause? This is the thing that needs to change in order to stop this class of incident from recurring. | A bug in |
Backlog check | Is there anything on your backlog that would have prevented this or greatly reduced its impact? If so, why wasn't it done? An honest assessment here helps clarify past decisions around priority and risk. | Not specifically. Improvements to flow typing were known ongoing tasks that had rituals in place (e.g. add flow types when you change/create a file). Tickets for fixing up integration tests have been made but haven't been successful when attempted |
Recurrence | Has this incident (with the same root cause) occurred before? If so, why did it happen again? | This same root cause resulted in incidents HOT-13432, HOT-14932 and HOT-19452. |
Lessons learned | What have we learned? Discuss what went well, what could have gone better, and where did we get lucky to find improvement opportunities. |
|
Corrective actions | What are we going to do to make sure this class of incident doesn't happen again? Who will take the actions and by when? Create "Priority action" issue links to issues tracking each action. |
|
Scenario | Proximate cause & action | Root cause | Root cause mitigation |
Stride "Red Dawn" squad's services did not have Datadog monitors and on-call alerts for their services, or they were not properly configured. | Team members did not configure monitoring and alerting for new services. Configure it for these services. | There is no process for standing up new services, which includes monitoring and alerting. | Create a process for standing up new services and teach the team to follow it. |
Stride unusable on IE11 due to an upgrade to Fabric Editor that doesn't work on this browser version. | An upgrade of a dependency. Revert the upgrade. | Lack of cross-browser compatibility testing. | Automate continuous cross-browser compatibility testing. |
Logs from Micros EU were not reaching the logging service. | The role provided to micros to send logs with was incorrect. Correct the role. | We can't tell when logging from an environment isn't working. | Add monitoring and alerting on missing logs for any environment. |
Triggered by an earlier AWS incident, Confluence Vertigo nodes exhausted their connection pool to Media, leading to intermittent attachment and media errors for customers. | AWS fault. Get the AWS postmortem. | A bug in Confluence connection pool handling led to leaked connections under failure conditions, combined with lack of visibility into connection state. | Fix the bug & add monitoring that will detect similar future situations before they have an impact. |
Category | Definition | What should you do about it? |
Bug | A change to code made by Atlassian (this is a specific type of change) | Test. Canary. Do incremental rollouts and watch them. Use feature flags. Talk to your quality engineer. |
Change | A change made by Atlassian (other than to code) | Improve the way you make changes, for example, your change reviews or change management processes. Everything next to "bug" also applies here. |
Scale | Failure to scale (eg blind to resource constraints, or lack of capacity planning) | What are your service's resource constraints? Are they monitored and alerted? If you don't have a capacity plan, make one. If you do have one, what new constraint do you need to factor in? |
Architecture | Design misalignment with operational conditions | Review your design. Do you need to change platforms? |
Dependency | Third party (non-Atlassian) service fault | Are you managing the risk of third party fault? Have we made the business decision to accept a risk, or do we need to build mitigations? See "Root causes with dependencies" below. |
Unknown | Indeterminable (action is to increase the ability to diagnose) | Improve your system's observability by adding logging, monitoring, debugging, and similar things. |
Category | Question to ask | Examples |
Investigate this incident | "What happened to cause this incident and why?" Determining the root causes is your ultimate goal. | logs analysis, diagramming the request path, reviewing heap dumps |
Mitigate this incident | "What immediate actions did we take to resolve and manage this specific event?" | rolling back, cherry-picking, pushing configs, communicating with affected users |
Repair damage from this incident | "How did we resolve immediate or collateral damage from this incident?" | restoring data, fixing machines, removing traffic re-routes |
Detect future incidents | "How can we decrease the time to accurately detect a similar failure?" | monitoring, alerting, plausibility checks on input/ output |
Mitigate future incidents | "How can we decrease the severity and/or duration of future incidents like this?" "How can we reduce the percentage of users affected by this class of failure the next time it happens?" | graceful degradation; dropping non-critical results; failing open; augmenting current practices with dashboards or playbooks; incident process changes |
Prevent future incidents | "How can we prevent a recurrence of this sort of failure?" | stability improvements in the code base, more thorough unit tests, input validation and robustness to error conditions, provisioning changes |
We also use Lueder and Beyer's advice on how to word our postmortem actions.
Wording postmortem actions:
The right wording for a postmortem action can make the difference between an easy completion and indefinite delay due to infeasibility or procrastination. A well-crafted postmortem action should have these properties:
Actionable: Phrase each action as a sentence starting with a verb. The action should result in a useful outcome, not a process. For example, “Enumerate the list of critical dependencies” is a good action, while “Investigate dependencies” is not.
Specific: Define each action's scope as narrowly as possible, making clear what is and what is not included in the work.
Bounded: Word each action to indicate how to tell when it is finished, as opposed to leaving the action open-ended or ongoing.
From... | To... |
Investigate monitoring for this scenario. | (Actionable) Add alerting for all cases where this service returns >1% errors. |
Fix the issue that caused the outage. | (Specific) Handle invalid postal code in user address form input safely. |
Make sure engineer checks that database schema can be parsed before updating. | (Bounded) Add automated pre-submit check for schema changes. |
Setting up an on-call schedule with Opsgenie
In this tutorial, you’ll learn how to set up an on-call schedule, apply override rules, configure on-call notifications, and more, all within Opsgenie.
Read this tutorialLearn incident communication with Statuspage
In this tutorial, we’ll show you how to use incident templates to communicate effectively during outages. Adaptable to many types of service interruption.
Read this tutorial