It’s fairly common that some projects live under alerting chaos. Not only have I been immersed in the mess a few times, but I also contributed to it because of many misconceptions I had about alerting.
We had more alerts than we could handle, missed many important ones, and took a lot of time to act on them because the only context they carried was the error or problem described in the alert message.
So, what principles and practices should you follow to avoid the alerting chaos in your projects?
Alerts shouldn’t be ignorable
If people responsible for handling the alerts need to consider whether the notification can or can’t be ignored before taking action, this judgment can likely lead to ignoring important events. Having to figure out if an alert represents a threat to the system can take time that could be used to respond to the issue instead.
Every alert that doesn’t represent a real problem compromises the precision of your alerting strategy and the general trust in it. They should be reviewed and repurposed or deleted - don’t be afraid of letting them go!
It’s also essential to try to alert only people who can take action on the problem. If the alert is not actionable by the person who receives it, it will only create noise and distract them from their work. The consequences of the events can be communicated to stakeholders or whoever needs to be aware of them in other ways.
Information and context are key
The alert title and message should include details about the problem in a human-readable way, and that’s already a significant first step. However, there’s work on alerts that you can do before the alerts happen, that can save valuable time when responding to them.
- Create an error runbook documentation, and have one page for every known alert with as many instructions as possible about debugging the problem and potential causes.
- Include link for past similar alert reports with the solutions for them.
- Attach the link to this documentation to the alert description, and
- When alerts happen, encourage your team to improve this documentation.
Frequency matters
A single bad event won’t necessarily represent a problem with a real impact on the system, and alerting based on them can become very noisy. As a result, we would end up ignoring alerts and losing sight of when they become a relevant problem.
To identify flaws that indicate an issue, consider the frequency of those occurrences to trigger an alert. You can have alerts that trigger when a rate of bad events happens in a given time window.
As an example, if a pod took a longer time to scale than expected on a particular demand once, this might not be a problem. However, if it happens multiple times in a short period, it’s likely that there’s a problem with the scaling process and you’d probably want to be notified about it.
“Hello, is this a good time?”
Balancing feature work and maintenance work can be challenging, and timing plays a vital role in that task.
Alerts that need immediate action should notify someone on-call, while other alerts can be grouped and notified at scheduled times, or they can go to a channel that is reviewed and prioritized regularly.
To achieve good timing for handling alerts, you need to categorize them. You can use the typical severity or priority scale of categories (info, moderate, or critical severity, low, medium, high priority) as long as the action that needs to be taken for each alert category is clear.
At thoughtbot, we chose to link the category of the alerts with the actions to handle them more effectively. We use three main categories:
- Warning: problems that don’t need immediate action. Those alerts are sent to Sentry, and the team reviews them regularly and creates tickets that get prioritized later.
- Ticket: issues that don’t need to happen more than once to require some action, so they go to Opsgenie, which automatically creates a Jira ticket for those. For example, there could be an alert of storage almost reaching its total capacity for a database: it doesn’t have to happen a few times to become a problem.
- Page: incidents or critical errors that require an immediate response. Those alerts go to Opsgenie to notify an on-call developer.
This diagram illustrates how we use that in practice:
🔎 A closer look at our alerting architecture: we developed a telemetry module in Flightdeck that, besides other things, bakes in SNS topics with our alerting categories. Those SNS topics trigger a lambda script that forwards the alerts to their final destination.
You can go as granular as you want with those categories. If you have multiple teams working on the same project, for example, you can have additional categories to filter alerts that different teams should handle.
On the contrary idea of alerting everyone, everywhere, all at once…
By implementing these principles and practices, teams can establish a well-organized, healthy, and efficient alerting system, ensuring important issues are promptly addressed while minimizing the chaos associated with excessive or mismanaged alerts. As opposed to alerting everyone, everywhere, all at once, you can alert the right people at the right time, and that’s what matters!