You were building a highly requested feature recently. And now you completed it. The last pull request merged, there was no further change request, and all stakeholders were happy. Now what?

As a developer, you want to make sure everything works as expected. You want to have instruments to see the performance of the feature and identify issues with it.

The right answer to achieving this is using monitoring. Plenty of tools are available, including NewRelic, DataDog, and Prometheus. Each tool offers a rich feature set. Choosing a tool is a matter of preference and the company budget.

But what kind of monitoring should we set up? Do we need to have all possible alerts? Can we identify the most important monitors?

I discovered that at least the following three monitors are always crucial. They will give you confidence in your code and provide sufficient observability to everybody.

System Outage

The most critical monitoring is to observe whether there are data. If there is no data, it can be worrying. It means either your entire application is unavailable or the monitoring is misconfigured.

First, we need to ensure we are sending the events properly. After the monitor is created, verify the configured metrics. The monitor should show the received data.

The next step is adding an alert when there is no data. Choose the evaluation period based on the data frequency and configure sending a notification when no events during that time.

The shortage monitor is critical. The alert must page an on-call engineer, as it could be a service outage. You have to react quickly to mitigate the issue and satisfy your customers.

Error Rate

Another monitor type that can save the business is the error rate. This monitor can help to identify issues in the code. Especially, if the defect was introduced recently.

How does the error rate monitor work? It will notify once the number of errors crosses a certain threshold. Usually, the threshold is set in the percentage of the total number of events. But it can also be an absolute number if you know precisely how often “bad” events happen.

How can you set up an error rate monitor? Error rate monitoring requires two types of events: one on success and another on failure. Then we calculate the ratio of the failures to the total events.

Once the error rate is calculated, we can define the threshold to watch. The threshold value depends on the system’s reliability.

For example, if we process 1M network requests to the system daily, the error rate threshold of 1% points to 10k failed requests. It can be a lot for some critical systems, but it is fine for entertainment systems with automatic retry.

Anomaly Observation

There are situations when the monitoring events are seasonal. For instance, the banking system receives the peak load during regular business hours. Meanwhile, the entertainment applications like Netflix get the most traffic in the evening.

The seasonality pattern can be daily, weekly, monthly, or even custom if your product runs regular sales on a specific day or time.

How can you track if the system follows seasonality?

Monitoring tools provide an anomaly monitor.

Tuning anomaly monitors can be difficult because you don’t know precisely the patterns and the deviations. Additionally, there is always a chance of a false positive alert. This usually happens during the Christmas holidays when many people are on holiday. Or the other way around — during Black Friday when everybody rushes to purchase gifts for their loved ones.

An anomaly monitor works best if you have enough events. On a low number of events, the seasonality will not have good shapes leading to low precision and often false positive alerts.

Setting up an anomaly monitor takes time because you need to learn the data patterns. Therefore, you should consider going conservative. Setting lower precisions and thresholds will reduce the noise at the beginning. Later, you can gradually increase them, making the monitor more precise.

Afterword

Setting up monitoring sounds like an easy task.

However, software developers often forget about it.

A common mistake is not sending enough data or sending incomplete data. As a result, they can’t fully evaluate the system’s performance or analyze a specific data segment.

Being able to analyze and observe the feature performance is a sign of the experienced engineer. Master this skill and keep growing as a software engineer!

Do you want to know how to grow as a software developer?

What are the essential principles of a successful engineer?
Are you curious about how to achieve the next level in your career?

My book Unlock the Code offers a comprehensive list of steps to boost 
your professional life. Get your copy now!