Incident management for high-velocity teams
What is IT incident alerting?
Incident alerting is when monitoring tools generate alerts to notify your team of changes, high-risk actions, or failures in the IT environment.
For example, a system built to allow doctors to prescribe medication may generate an alert if the dose a doctor requests is unusually high, not matching up with the body weight listed in a patient file, or poses a drug interaction risk with other common medications.
Similarly, a system built to monitor a tech product may generate an alert if a system goes offline, web requests are taking longer than usual to process, or database latency slows beyond a set threshold.
The goal of IT alerting is to quickly identify and resolve issues that impact product uptime, speed, and functionality—around the clock and without manual monitoring.
Why is IT alerting important?
As the importance of always-on systems continues to rise, so too does the cost of downtime, with experts estimating an average cost between $5,600 and $9,000 per minute. Since every minute of system failure is so pricey, identifying issues before they get out of hand has a big impact on the business bottom line (not to mention IT teams’ schedules and stress levels).
IT alerts are the first line of defense against system outages or changes that can turn into major incidents. By automatically monitoring systems and generating alerts for outages and risky changes, IT teams can minimize downtime—and the high cost that comes with it.
Alerting best practices
IT alerts are undeniably an important part of incident management, but the truth is that they’re not just a simple fix you can set and forget. Setting alert thresholds too low can lead to overflowing inboxes, unhappy on-call teams, and alert fatigue. Setting thresholds too high can mean missing critical issues and costing the company millions.
Which is why the most effective IT alerting systems are set up with these best practices in mind.
Automate your monitoring
The best way to quickly and effectively identify issues is to automate monitoring.
Is a database responding slower than usual? Are users experiencing slower-than-average load times on your app? Is a vital system down? Has one of your technicians made a request that seems like a red flag? Your system should automatically be watching out for problems like these and letting you know when they arise.
Set smart alerting thresholds
Does every alert need immediate attention? For most companies, the answer is no—which is why you need to set sensible alert thresholds.
Knowing whether something is worth waking a developer in the middle of the night—or if it can wait until morning—can be the difference between happy developers with fast response times and alert-fatigued teams who spend their weekends looking for a new job.
De-duplicate your alerts
A study on alert fatigue found that—for clinicians in a hospital setting—alert attention dropped by 30% every time a duplicate alert came in. And it’s likely that the study results would be the same for developers. The more we see the same alert, the less we pay attention to it. Which is why the best practice here is to de-duplicate your alerts and minimize reminders.
Set priority and severity levels
Obviously, some alerts are more important than others. A website outage is probably going to take precedence over a brief slow-down on an infrequently-used feature. Malicious hacking is probably a higher priority than an image that isn’t rendering correctly in your app.
Not only should your system recognize alert priority and severity, but it should also communicate that priority clearly to the people responsible for resolving incidents. The best practice here is to use visual, audible, and sensory cues to quickly and clearly indicate what teams should focus on next.
Make alerts actionable
Knowing what’s wrong is good. Knowing what to do next is better. Which is why if your alerts aren’t actionable, they should be.
This is one place where DevOps teams can learn from the aviation industry. When an alert shows up on pilot’s dashboard during a flight, it comes with an actionable checklist. Building this kind of detail into your alert system cuts down on diagnostic time and helps developers move quickly through your process.
This is especially helpful when a developer is up in the middle of the night, bleary-eyed and not at the top of their game.
Choosing the right alerting technology
Developing an IT alerting system that follows these best practices means being strategic about alerts up front. It also means choosing the right technology to do so. When choosing a vendor, we recommend looking for:
Multiple alerting channels
Email is often the channel of choice when it comes to alerts. But the truth is that email doesn’t always cut it. For urgent alerts, you may want or need SMS, mobile push notifications, or even voice calls. Look for a system that allows you to alert in a variety of ways.
Alert enrichment
Actionable alerts are detailed alerts. Which means a short text message isn’t always enough. Beware of strict character limits and look for technology that lets you attach charts, logs, runbooks, and checklists to provide additional context to an alert and let the developer know what they should do next.
Custom alert actions
Most alert technology will let you add a note to your alert or close it out. But sometimes there are steps in between. Like escalating the alert for further investigation, creating a service ticket, or restarting a server. Look for tech solutions that let you do more than just open and close.
Automated actions
For some alerts, what to do next is complicated and requires an experienced developer’s insight. For others, the way forward is clear.
For alerts with clear next steps—diagnostic tests, remedial actions—you’ll want a system that triggers those responses automatically in response to an alert that meets your predefined criteria.
For example, if a database slows, perhaps you set your alert system to automatically switch to a backup database. If the first step in fixing Issue A is always to restart a server, maybe you set your alert system to restart the server and monitor the result before sending out a middle-of-the-night alert.
Alert customization and classification
As alerts come in, your team should be able to organize them, tag them with additional info, and filter them.
Alert lifecycle tracking
In your incident postmortem, you’ll want to know when the alert came in, who received it, when they saw it, and what action was taken. Make sure any technology you choose automatically tracks these details. It’ll make it simpler to understand what is and isn’t working, improve your KPIs, and document past incidents so that on-call teams can learn from them and refer back to those learnings for future incidents.
Alert and notification policies
If the best practice here is setting intelligent thresholds for your alerts and making sure minor issues aren’t waking your developers in the middle of their REM sleep, you need technology that lets you suppress, delay, and expedite alerts based on their content and timing.
Real-time monitoring for your monitoring
How do you know, at any given moment, that your alert systems are up and running?
The answer—with the right technology—should be that the tech has its own monitoring system. With OpsGenie, we do this with a tool called Heartbeats, which continuously checks that monitoring tools are active and connected and custom tasks are completed on schedule. If the signal goes down, the system alerts you instantly.
Setting up an on-call schedule with Opsgenie
In this tutorial, you’ll learn how to set up an on-call schedule, apply override rules, configure on-call notifications, and more, all within Opsgenie.
Read this tutorialA better approach to on-call scheduling
An effective on-call schedule is key to sustaining a healthy on-call culture. Learn common mistakes, types of rotation schedules, and how to get it right.
Read this article