Figure out what you’re trying to learn, then see which metrics can serve you.
Setting and tracking key performance indicators based on the right data can help incident management teams reduce the impact of incidents and strengthen the business.
But what exactly is the right data? That can be a deceptively tricky question. Incidents are complex, and no two are exactly the same – and your KPIs must reflect this complexity.
You’re probably familiar with a few of the most popular incident management metrics: MTTA (Mean time to acknowledge), MTTR (Mean time to resolve), and total number of incidents. It’s tempting to just pick whatever metric sounds right and designate it as your KPI.
But then all you’re left with is a fancy chart showing you what you already know: an incident is underway. What can you learn from that data? How can you take action based on it?
The answer: flip your thinking. Figure out what you’re trying to learn and accomplish first, then see which available metrics can serve you in those pursuits.
Ask the right questions, figure out your goals
It feels good to set and track against a KPI, but it’s important to remember that this isn’t the end goal. Just like stepping on a scale every day won’t make you lose weight, you need to find a way to take action based on your data – otherwise, you’re just staring at numbers.
Our goal, then, is to find data points we can ultimately use as inputs to learn more from incidents and answer hard questions.
Questions like:
- How do people escalate incidents?
- How can we have fewer incidents?
- Which services experience critical incidents?
- What types of weaknesses lead to incidents?
Think of what these questions might be for your particular team and organization, and work backward from there. Think of your product and customers; if you work for a bank, you’ll probably prioritize security and data integrity over total uptime, whereas if you’re in ecommerce, you might prioritize fewer incidents and total uptime. Then, look for data points that could shed light on these questions. Maybe it’s time between escalations, or the number of services impacted per incident. This process will help uncover potential KPIs you may have never considered. And if you do end up measuring a more “conventional” metric, like MTTA or total “9s” of uptime, it will be for the right reasons.
Consider the available metrics
Now that you have an understanding of what you’re trying to learn, you can begin combing through available metrics to see which data points could determine your KPIs.
Here’s an example. Let’s say your goal is shorter incidents. You might get your team together in front of a white board and start with that goal in the form of a question: How can we resolve incidents faster?
From there, you’d brainstorm all the possible ways this might happen. One answer you might land on: “Make sure the right person is on call.” Then look at the available data. Ask yourself, “What are the indicators that we might have the wrong person on call?” One possible indicator could be the number of on-call escalations. If the first on-call responder is constantly having to escalate to additional responders, that could be slowing down incident resolution. Your KPI, then, could be 50 percent fewer escalations. A note of caution, though – you don’t want engineers avoiding escalation only to hit some KPI.
With that in mind, let’s take a look at some of the most common key metrics on-call teams might consider. You may find answers to your key questions here, or you may want to dig even deeper.
Team balance and burnout: on-call times and performance
Don’t overlook the impact of team member satisfaction on incidents and recovery time. For example, when someone takes a lot of on-call duty because nobody else is familiar with a certain service, we should note this and take necessary actions to avoid on-call burnout.
Alerts per: status, priority, tag, team, date
“Alerts per” is a grouping schema used in Opsgenie to organize alerts based on different fields. It’s a good way to identify toil – an SRE term that refers to the kind of work tied to running a production service.
For example, when you look at alert severity levels per source, you may see a pattern emerging for a specific type of alert and conclude that most high-priority alerts are coming from a particular service. You can run more complex queries like alerts per status or alerts per priority.
Don’t lose sight of what you want to learn – and what you can’t measure
Collecting and visualizing metrics can help teams assess their own performance and improve. Managers also benefit from visibility into the incident management process and how teams are performing.
But remember: use these metrics to answer the right questions. And don’t forget about the things you want to learn that aren’t easily measured. Don’t give up on learning those things just because the answers don’t fit neatly into a dashboard or SQL query.
And don’t forget to talk with each other and run incident postmortems. Couple these insights with data to find even more insights. All incidents are a learning opportunity, and the key to learning starts with asking the right questions.