Key principles for tackling this pervasive problem at the root.
If your phone is constantly interrupting your beauty sleep with false alarms, you eventually stop paying attention. And once faith is lost in alerting, you start to assume that every alert is false, and inevitably issues are missed. This phenomenon is known as alert fatigue.
Alert fatigue is a problem in many industries, including software, healthcare, and emergency response. The consequences of this desensitization have a deep impact on businesses, and in some extreme cases can cost lives. In 2010, it was reported that a Massachusetts hospital patient died after alarms signaling a critical event went unnoticed by ten nurses. Patient safety officials shared that there are many reported deaths because of malfunctioned, switched-off, ignored, or unheard alarms.
Though software alerts are likely not life-or-death matters, unhappy customers, lost revenue, and waning customer trust aren’t great results either. When an on-call engineer misses an important alert, the blast radius of incidents increases. At this point, the stress makes an unhappy sharing circle involving upper management, customers, and whoever is on call. Eventually, being on call becomes a miserable task and leads to burned-out, exhausted engineers.
As Kishore Jalleda pointed out in his talk at SRECon17, while prioritizing alerts is useful, it can’t solve the fatigue problem alone – you need a combination of good practices to protect uptime and prevent burnout. We’ve included five key principles that are great starting points for solving alerting problems at the root.
1. Distribute responsibilities
If there are just a couple people responding to alerts at a big company, alert fatigue is inevitable. Operational complexity scales as companies scale, and to keep alerts under control, you must distribute ops responsibilities. Put developers on call, not just an ops team; holding developers accountable for the code they ship can reduce alerts in the long run. This also creates incentives to resolve alerts in a timely manner. Nowadays, most leading tech companies follow this model. In Increment’s on call issue, the SRE Manager at Airbnb says that having a separate operations team “creates a divide and simply doesn’t scale.” This is true for alerting too.
Alerts should be routed to the appropriate team via well-planned escalations that ensure acknowledgment of the alert. Don’t notify people unnecessarily and empower them with personalized notification options. Distributing responsibilities means that teams deal with fewer alerts and have visibility into what works and what doesn’t.
2. Set reliability objectives and tie them to the right incentives
Each team should have its own reliability objectives, or what SRE teams often call service level objectives (SLOs). Setting them requires understanding each service and its importance. Use caution when choosing SLOs that can be tied to business metrics; for example, 90 percent test coverage may be an important metric for the reliability of service for the team, but it doesn’t mean anything to clients, who are usually interested in metrics like availability, error rate, request latency, and system throughput.
Once objectives are in place, the next step is tying those objectives to the right incentives to create a culture of ownership. Site Reliability Engineering practices are a great way of dealing with this problem – development teams only get SRE support for on-call if they’ve met their reliability targets over a period of time.
3. Be cautious when introducing new alerts
Uptime is key to running a successful online business – most orgs need to catch problems before customers are affected. The decision to alert engineers of potential problems is a common challenge. Separating alerts that need immediate attention from those that don’t is also critical for maintaining a positive on-call experience.
Problems such as latency, raising 5xx HTTP responses (error rate), or, in general, any failures that might exasperate internal and external clients are in this category. We can leverage SLOs to define what is important by ensuring that alerts have the right priorities set according to the severity of the problem. If there are many alerts that seem similar to each other, no one can focus on what’s important. Reference the key principles of actionable alerts, such as routing them to the right people at the right time, providing context, giving clues to start investigating, and solving the issue.
4. Collect data and regularly review alert reports
To identify an alert fatigue problem, don’t wait for employees (or customers) to complain – collect data. Tying everything to a consolidated alerting solution like Opsgenie allows you to trace the alert’s lifecycle as well as evaluate on-call and service performance. These tools provide key metrics such as alerts per priority, daily created alerts, alert sources over a period of time, mean time to acknowledge, mean time to resolve, and more. Over time, this data can be used to assess various indicators of alert fatigue.
Based on your findings, respond with actions like adjusting the thresholds, changing priorities, alert grouping, and deleting unnecessary alerts. For example, if there are a lot of alerts that can be traced to no action, and also no impact, get rid of them.
5. Create a culture of “effective alerting”
Culture change is key in truly solving the alert fatigue and management buy-in, and support is imperative. Share key statistics with management stakeholders to justify the extra effort needed to work on the alerting. Once management is on board, introduce useful routines to create the alerting culture that all on-call teams need. There are a few ways to do this: regular meetings with the team to go over alerts, reducing repetitive alerts, and holding engineers responsible for taking corrective actions that are long-term solutions. Finally, always remember to ask for feedback from teams on a regular basis to get a pulse on how they’re feeling. Every team is unique in its own way, and you can use that to the team’s advantage to create an alerting culture that works for them.
Remember, fast fixes won’t solve the problem of alert fatigue; taking proactive actions for each of the principles outlined above helps get at the root of the issue. Start small by introducing these five tactics slowly. Then, after implementation, regularly review the alerting reports and, most importantly, create an effective alerting culture. The fight against alert fatigue isn’t a short one, and it requires everyone’s participation and learning from mistakes and achievements.
To learn more about the impact and prevention of alert fatigue, check out our site on the topic.