Downtime happens. While it can certainly be chaotic and stressful, if handled properly, it can also be a chance to build customer trust and loyalty. The way you respond to and communicate around incidents and downtime tells customers a lot about what you value.
Therefore, it’s essential to show customers you value them by communicating early, often, and candidly during an incident.
You wouldn’t wait until you felt the floor shake to buy an earthquake kit; don’t wait until an incident is detected to create a communication plan. Spend the time creating a plan now so you can enjoy peace of mind, team harmony, and happier customers later.
✅ Tip: Download our incident communication plan template, grab any teammate who touches incident response (dev, ops, support, social media, etc.), and create your own custom plan to reference next time things go wrong.
Define an “incident”
A great incident communication plan is pretty useless if you don’t know when to use it. That’s why you’ll want to start by defining what constitutes an incident for your team.
Some things to keep in mind when creating your incident definition:
- Service impact: Is the service completely down or partially down?
- Customer impact: How many customers are affected? Which ones?
- Data/Security: Is there any data loss? Has private customer information been leaked?
At Atlassian, we define an incident as:
An event which is not part of the standard operation of an Atlassian Service and which causes disruption to or a reduction in the quality of services and customer productivity.
We also go a step further to define what an incident is not:
Something that could potentially cause an incident in the future, or that is undesirable but has no effect on service availability or usability.
Having a solid definition ensures you won’t waste time determining whether an event is or isn’t an incident.
Create incident roles & responsibilities
Once you determine an event is in fact an incident, you need to know who does what. Without documented and trained incident roles, it’s likely you’ll have delayed customer comms and duplicate work being done across the company. This can lead to frustrated teams, and, worse, confused and disgruntled customers. By establishing the proper roles now, you’re much more likely to have quick and consistent comms that give your internal teams harmony and your customers clarity.
While roles will inevitably vary team to team and company to company, we recommend starting out with roles similar to these:
- Major Incident Manager: Think of this as the main incident lead. Someone who’s there to make the tough decisions, escalate/delegate as needed, and close the loop on incidents with a Post-Incident Review (PIR) or post-mortem.
- Communications Managers (internal and external): These folks are in charge of internal stakeholder and external customer communications. They determine which channels should be used, when they should be used, and what level of detail is appropriate to communicate out. They should work with product, engineering, and marketing teams to ensure accurate and consistent comms are being sent out throughout the duration of an incident.
- Customer Support Lead: This person, probably already leading your support team, will be in charge of making sure all incoming tickets, emails, phone calls, tweets etc. are taken care of in a timely manner.
- Social Media Lead: This is often someone from your support or marketing team. Their job is to work alongside support to field questions or complaints that come in via any social media channels you’re using to communicate with customers (Twitter, Facebook, etc.)
Once you document and define your roles, you’ll want to properly train the team and assign backups. You’ll want at least 1-2 backups for each role in case someone is not available during the incident.
Choose communication channels
Once you have roles set, you need to know where and how you’re going to communicate with customers. There’s no shortage of tools to use for this; the key is finding what works best for your team. We recommend reducing confusion and streamlining your comms by minimizing the number of channels you use. This is where Statuspage can really come in handy, since it’s a single tool that lets you communicate the same message out of various channels (email, in-app, SMS, Twitter, etc.).
You’ll want to document the channels you decide on so your incident response teams know what to use and how to use it. It’s also important to have a team or individual dedicated to owning each channel. Last, but certainly not least, you’ll want to surface the channels to your customers so they know exactly where to go for information or help during an incident. The more in-the-know they are, the more trust they’ll have.
Communication tips
Once you know who’s involved and what tools they’re using, you need to know what you’re actually going to communicate. What should you say? What should it sound like?
We like to use these 5 tips as a guide when crafting incident comms:
✅ Tip: Save or print this card so you can have it on-hand next time you’re tasked with customer comms.
Communication templates
It’s hard to write clear and eloquent updates when things are on fire and you’re under pressure. We recommend creating pre-written communication templates to help. Having a pre-defined structure and tone established will let you get updates out quickly and clearly every time. The faster you can tell customers what’s going on, the less incoming comms you’ll get, and the more trust you’ll build.
The templates you write can be generic – you can fill in more details as they become available during each incident. We have a library of incident update templates you can pull from, or you can use these 4 we’ve written up below for each stage of an incident:
- Investigating: We’re currently experiencing a service disruption. Our **team name** team is working to identify the root cause and implement a solution. **General impact** users may be affected. Next update in **time until next update**.
- Identified: We’ve determined that **describe problem**. **General impact** users may be affected. We’re working to **fix you are working on**. Next update in **time until next update**.
- Monitoring: We’ve deployed a fix that *describe fix**. Users should now see **current expected behavior**. We’re monitoring to ensure **expected result**. Next update in **time until next update**.
- Resolved: The **type of issue** is resolved. We’re sorry for **insert apology for what users may have experienced or felt**. For more information, please see **link to post-mortem (if applicable).**
When in doubt, communicate! Even if there’s nothing new to say, tell your customers you’re still working on the problem and let them know when they can expect the next update.
Once you have a couple incidents under your belt, we recommend meeting to discuss what’s working well and where you can improve. Try running our Incident Communication Play from the Atlassian Team Playbook to spot gaps in your current process and create an improved plan you can use in future incidents.
Incident Values
Congrats! You’re close to having a complete incident communicate plan. However, even the most comprehensive incident communication plan lacks instructions for more subjective, nuanced situations that can and will arise when things go wrong. It’s best when teams are armed with the right knowledge and a documented plan, but it also helps to have guidance in the form of mutually agreed upon values that can align behavior during an incident.
At Atlassian, we’ve created a set of incident values that act as north star for our teams during incident response. You can use them as your own, or as inspiration when creating a set with your team. It’s a powerful exercise to discover what your team values during incident response and to work towards living those values more closely.
Fail to prepare, prepare to fail
A great incident communication plan is all about preparation and practice. Put in the time now using the tips above, and you’re sure to see an improvement in the way your team responds to and communicates around incidents.
The best part is, you don’t have to wait for a real incident to practice and improve. Begin to run mock incidents to see how your plan holds up, and continue to make tweaks to perfect it before the real deal.
Tweet us @Statuspage to let us know how it goes!