Top tips from great incident response teams

Published August 15, 2018 in Statuspage

Shannon Winter

Sr. Brand Manager

Learn how support, operations, and development teams like Mixpanel, Front, and Grand Rounds come together for great incident response.

We caught up with three Statuspage customers to learn how their teams come together for awesome incident response. Want to learn from more incident response teams or submit your own tips and stories for us to feature? Head over to our new HugOps hub.

Mixpanel

Cassie (far left): Role: Senior Support Engineer **Top tip for incident response**: Make sure incident comms are written clearly and in layman’s terms. Users shouldn’t have to be product expert to understand the situation. Will (back left): Role: Senior Support Engineer Favorite activity during (the good kind of) downtime: Baking sourdough bread (yum!) Marie (front right): Role: Support Engineer **Top tip for incident response**: Agree on rules of engagement between teams before downtime strikes. Assign ownership of both communication and troubleshooting, and decide how these owners communicate with each other. Ted (back right): Role: Engineer Favorite activity during (the good kind of) downtime: Spending time with his dog and baby, and singing/playing music Ariadne (front pup) Role: Team mascot **Top tip for incident response**: lots of treats makes it less ruff 😉

Mixpanel’s two-time award-winning support team is the heart of the company. So when the leading user analytics platform has a problem to solve, the skilled team of incident responders knows how to quickly resolve problems as a team. Even though downtime is a rare occurrence (Mixpanel has an impressive track record of 99.98% uptime, which translates to a mere 17 seconds a day) they are always prepared and able to build trust with their customers when it matters most.

What makes them so good at incident response? Our bet’s on the strong collaboration and trust formed between dev and support teams, their excellent processes and documentation, and a habit of over-communicating both internally and externally when things go wrong. We caught up with a few members of their support team and an engineer to learn more.

Unify dev & support

Traditionally, technical folks (developers, SREs, etc.) are the ones getting paged when something goes wrong. But at Mixpanel, the support team leads the organization through incident response, too.

Support team members are on the on-call list right beside their engineering counterparts so they can start updating users as soon an issue is detected. They work in lockstep during incident response so customers receive the most up-to-date and accurate information as possible. Jira tickets, dedicated Slack channels, and Statuspage act as sources of incident truth that keep teams in sync during an incident.

Tools are only part of the equation during incident response, though. The Mixpanel team also has well-defined roles, a communication style guide, and incident communication templates down pat before an incident strikes so everyone is aligned when it matters most. They created the style guide in collaboration with their marketing team so they could quickly reference tips for tone, words to avoid, etc. while writing incident updates. One of their guiding communication principles is to be “honest, but not alarmist” aiming to be as transparent as possible, without ever giving users inaccurate or irrelevant information.

Ultimately, Mixpanel is able to provide legendary support not only with their solid technical skill set, but also with a deep level of empathy. By quickly identifying the root cause of someone’s question, support engineers are able to connect and teach customers how to make more informed decisions about their products and company, faster. By updating users early and honestly, they’re able to clear up confusion and build lasting trust.

Over-communicate to stay in sync

Clear, comprehensive, and organized communication is the name of the game during incident response. “Over-communication internally is key,” Cassie told us. “If I know something is an issue and an engineer knows about it, it doesn’t mean everyone knows… we need to make sure that all stakeholders and all people communicating with customers are on the same page.”

Mixpanel organizes communication during an incident by breaking out different types of conversations into different Slack channels and documenting which channels to use for what. Anyone can reference these documents and jump into the right chat at the right time. For example, they talk through incident fixes in their “Ops team” channel, but use “downtime chatter” for related convos not connected to the actual fix. Strong collaboration and communication internally helps them deliver quick and consistent comms externally.

Front

Pierre (left): Role: Engineer Lead (Mobile & Support) **Top tips for incident response**: Remain calm and think twice (don’t make a rash decision that makes the problem worse); call for help (don’t try and be the lone hero); keep communication lines open between engineering and support so everyone is on the same page. Cori (middle): Role: Customer Support Manager **Top tip for incident response**: Be as transparent as possible in your incident communication, but never lie or make promises you can’t keep! Samantha (right): Role: Customer Success Manager Favorite activity during (the good kind of) downtime: Experimenting with new recipes in the kitchen

Front brings teams together in shared inboxes, where they can collaborate on all kinds of communication: email, SMS, social media, and more. By design, Front helps their customers work with more transparency, so it’s no surprise that Front excels with communication and transparency as an organization. We spent some time with folks from their support, customer success, and engineering teams to learn how they keep customers in the loop during an incident.

Transparency during and after an incident

“Transparency is key — making sure we are keeping customers updated and not leaving them in the dark is huge for us,” Sam, Front’s Customer Success Manager, told us. Front keeps transparency at the forefront of their incident response process with a handy product integration and a solid post-incident review process.

Front typically uses their own product to communicate with customers, but brings in Statuspage during an incident. They brought the two together by creating a Front + Statuspage integration that provides current status information via pop-ups right inside the Front app. Serving up incident information before a customer even has a chance to submit a support ticket builds a lot of trust and good will. Customers know exactly what’s going on and where they can go for more information during the entirety of the incident.

The Front teams aim to be open with users after resolution, too. They conduct internal post-incident reviews, often translating their findings into public postmortems that tell customers what happened, how they fixed it, and how they’re making sure the same problem doesn’t happen again. Sometimes their CTO sends this information out in an email that also urges users to subscribe to Statuspage notifications so they can to stay in the loop during future incidents.

Their commitment to transparency definitely pays off. Customers often send notes of gratitude for clear and candid postmortems, which help to boost team morale after an incident. “It’s really nice to see the encouragement and support during a stressful time,” said Sam. “It makes me want to give out more #HugOps to services I use since I know how good it feels.”

Grand Rounds

Bryan (left): Role: Engineering Ops veteran of 15+ years Favorite activity during (the good kind of) downtime: Watching dystopian thrillers on Netflix Aaron (right): Role: Operations **Top tip for incident response**: Support your team and have a positive attitude

When Bryan Kroger isn’t hanging at the beach, doing burpees in the ocean (we have photo evidence!), he’s helping to build system resiliency and create a culture of iterative development as an Engineering Ops Lead at Grand Rounds. Grand Rounds is a healthcare company that helps connect patients with specialty medical care through technology, information, and support. They play a critical role in patients’ medical choices and decisions, so it’s crucial that bugs or incidents are identified and resolved before ever affecting patients. Bryan has been in the Ops game for over 15 years, so he has a lot of experience and insight into what makes an incident response team strong.

Find your tribe

“We’re in ops, we’re used to being in the mud together and having each others’ backs,” Bryan told us. Shared experiences – especially high-stress ones – are great for quickly bringing folks together. Ops teams like Bryan’s rely on each other down in the incident trenches, and have a deep sense of camaraderie because of it. This support becomes essential when sleep has been lost and frustration sets in. If you’re reading this, you most likely have felt that “I got paged at 3am” exhaustion, and know how far a little positivity and support can go when s#*t hits the fan.

“Coming into the office dreary and mad doesn’t help,” Bryan told us. “A good attitude can save a lot of time and effort for our whole team.” And when a good attitude just isn’t in the cards on a particularly rough day, a supportive team that encourages you to take the morning off, or buys you your favorite coffee or beer, makes all the difference.

A little empathy goes a long way

This camaraderie and empathy extends outside the walls of the office, too. Bryan is part of a “Brotherhood of Ops” Slack channel where fellow DevOps folks help share tips and send notes of encouragement during downtime. And his team makes a point to continually improve their on-call processes to make sure their families stay balanced and well-rested, too: “I like to remind people that it’s not just you who is on-call… the person next to you is getting woken up, too. It’s beneficial for everyone involved that we get it right.”

Learn even more from the incident response community on our HugOps hub. We hope you’ll feel inspired to share tips and stories of your own. In return, we’ll send ya a HugOps poster to proudly display in your workspace. 🙌

Learn from great incident response teams

Mixpanel

Unify dev & support

Over-communicate to stay in sync

Front

Transparency during and after an incident

Grand Rounds

Find your tribe

A little empathy goes a long way

More in Statuspage

Incident response: how to keep tech problems from becoming people problems

Stay code-connected with 12 new DevOps features

5 tips for incident management when you’re suddenly remote

Why you need a status page