How True Engineering Manage Alerts in Product Monitoring
We keep explaining how True Engineering builds its support processes. In this paper, we’ll go over the main tools that provide our teams with monitoring and alerting functions.
For us, that is a key task when creating IT products since effective support brings a clear commercial benefit to the client. In previous articles, we’ve already shared our thoughts on what a competent and effective tech support should be:
- How to establish business monitoring of the product
- How to automate request support
- How analytics work based on OLAP-cube
- How we isolated common logging functions in libraries
The overall point of all these projects and activities is to automate processes so that support engineers are doing intelligent work rather than dealing with logs. In this task, automatic alerts and metrics by which monitoring systems control the state of the product play an important role. Further on, how we process these metrics and what tools we work with.
How metrics work
We use the Prometheus tool to collect technical and business metrics for services. The indicators are visualized in Grafana, where dashboards with key metrics are set up. These dashboards are deployed for all environments at once, including customer projects.
There are many business metrics for each product. They are our tool for instant investigation, because they are the ones that allow us to react to failures, look at the history, analyze the causes and fix bugs if possible.
For example, in one of our solutions we track the number of uploads to integrated systems - the Bitrix data warehouse and the Opus product catalog management system. These uploads are divided into file types and each of them is indicated by a different color. Every five minutes, the metric is updated. Another metric is the duration of downloads by stage. The progress bar shows in seconds how long each stage of the upload takes.
The same metrics are also used to set up alerts. They come into Slack and allow teams to react quickly to breakdowns, without wasting time looking at graphs.
Another option is to set up the reception of alerts in Jira, where they will automatically generate tickets.
The main thing here is to set up the conditions correctly for each notification so that the team is not alarmed by false positives. Otherwise, at least one Error message is a reason to deal with it. And when queuing in RabbitMQ, exceeding the conventional 50-message figure is a reason to pay attention.
To avoid mixing bugs, teams have separate channels for products. Those team members who are involved in the development of a particular product, and of course the support department, have access to the notifications.
Customers also have access to Slack channels. This way they, too, can monitor the status of alerts and respond to them if necessary.
How we work out the alerts
All notifications are first taken care of by the support engineers. Some of the alerts are closed by them, when they manage to figure out what is causing the error and fix it. And then several types of tasks for development teams are born.
- Urgent tasks. It requires an immediate response.
- New bugs. These cases have not yet been addressed by any of the users. When alert, based on the metric, shows a bug, the team starts investigating it, fixes it, and the user doesn’t even know that such a bug existed.
- Adjusting metrics and conditions for alerts. Teams are engaged in a process of continuous improvement and searching for balance so that, on the one hand, to catch all the critical bugs and, on the other hand, not to drown in endless notifications.
The alert system must live and grow
In conclusion, let’s say a few words on creating and maintaining a metrics system. The first thing to do is to immerse support engineers in the operation of the product, so they know well how it is structured and how business processes are realized technically. Together with developers and PMs, they should break down all the major scenarios and define the metrics that will speak to the performance of the product as a whole and its individual components individually.
In our practice, in the first 2-3 weeks upon the product launch, tech support engineers meet with the team every morning for debriefing taking control over how the system worked, how the alerts came, whether there were any false positives or missed important events.
Based on the observations of the product use, decisions are made to fine-tune the monitoring: for example, to change the measurement period, abandon certain metrics in favor of others, etc.
For example, among our products there is a sales accounting system, where it is very important to track the arrival of certain data from external sources. Some of this data is downloaded automatically, and some manually. The documents that came in the second way were uploaded later, which caused the support to trigger an alert. We decided to create a new feature in the monitoring system to distinguish between automatic and manual uploads and since then, there is no more false positives.
After three weeks, the base of all possible situations that occur within the standard business flow is built up, and the regularity of meetings can be reduced. However, support should still be in touch with developers, so that even before the release, support engineers understand what features will appear and how they will work.
Of course, the parsing of the alerting should be done after each failure to understand why the team did not see early signs of problems. But it is equally important to do the same work after the release of new features. The launch of a new service or service means new data flows, new integration links, etc. It will be difficult for the team to predict how they will affect monitoring metrics, so you need to reserve support resources for such monitoring.