Designing Systems and Teams For Failure

Imagine the following situation. You are a software developer in an organization building a product.

Your routine looks quite similar: start working in the morning, have a few catch-up meetings with teammates, take a lunch break, continue working, and by the end of the day, your pull request is likely completed.

It is 6 PM, the laptop is closed, and you move on to the private time -hobbies, dinner, or family time. Suddenly, you have a call at midnight.

Heart rate goes up, the mind becomes blurry, and on the other end of the phone, a robotic voice tells that the application is down. You rush to the laptop, and some random errors are present in the logs. But what should you do?

First, what comes to your mind is to fix the error. However, the users have to wait until the fix is deployed. And then who is going to review your changes in the middle of the night?

Hopefully, you don’t have such a situation. But if you had, how did you react? Did everything go smoothly, and the least number of users were impacted?

Incidents are unavoidable in the software development industry. Hence, we need to be ready for them. How can you be prepared for the incidents?

Through the years of knowledge, I observed how organizations grew and how their engineering culture changed. But only those who learned from their mistakes became successful.

From my point of view, the company’s success is adopting the practices below.

Better Handling Through Training

Making the right decisions under pressure is a skill to develop through experience or training. While gaining experience in the production environment can be costly, regular training can significantly contribute to the confidence of the team.

What does incident response training look like?

It is a simulation of the incident in a production environment, but without a real incident. How can we achieve it?

You can have a dedicated dummy endpoint with logic that throws errors. It can also be a background job that randomly produces errors without impacting any users or functionality.

As part of the training, the team must investigate the issue and mitigate it. The typical actions the team performs are to identify the root cause of the issue using various tools. Checking logs and metrics, localizing the issue, evaluating the impact – all of those are preliminary steps.

The key aspect is to work as a team while handling the incident. Everybody learns how to work efficiently in a stressful environment.

To avoid chaotic movements, there are multiple roles in incident management. The most common are:

The incident commander is in charge of responding to the incident. The person leads and manages all processes within the incident. They distribute the roles in the team and coordinate the work.
The communication lead takes care of the communication processes with the stakeholders and users. Sharing updates and notifying when the incident is over are their main responsibilities.
The documenter records all important information about the incident: timeline, actions, and decisions. Based on that data, they generate a postmortem to share with the rest of the organization.
The technical specialists dive into the problem and try to mitigate the incident. Usually, those are software engineers, and when necessary, data analysts.

The roles above can depend on the team size and structure. You might have extra roles to cover specific needs. For example, in a smaller organization, you have fewer roles, and a single role contains more responsibilities.

There are a few key aspects in the training.

First, it should happen in production or a close-to-production environment. In case of the incident, the steps should be the same as during the training. Otherwise, you risk having trained personnel on a staging environment. In the case of the production incident, they might not have sufficient access rights or credentials to certain services.

Second, it is also important to wear the hat of each role during the training. If you are a good technical specialist, try to be an incident commander next time. This will ensure that everybody on the team can process the incident efficiently. If you are good at communication, jump into the role of the technical specialist.

Personally, I always discover something new while watching how engineers from my team are using the same tools in a different way.

After a couple of rounds of training, you will notice how the team’s confidence grows. Everyone becomes aware of how their domain operates and can provide effective assistance during incidents.

Clear Action Plan

In a stressful situation, humans act differently compared to a calm environment. Doing regular training every quarter is a first step, but the probability of missing an important step is high. In a case when every minute counts, a delay or an extra mistake is very costly.

This is how runbooks were designed. It is a document with step-by-step instructions to accomplish an action.

For that reason, it makes sense to have a runbook for incident management. It describes what are the important steps during the incident. Following the document is way easier than trying to remember from the head.

In a well-functioning organization, you don’t deal with incidents often. Therefore, the knowledge fades away over time. To prevent that, you want to keep it somewhere documented.

Runbooks can be handy for handling various situations. For instance, how to do a rollback or what to do if a third-party provider stops working. Clear guidelines covering every scenario are beneficial for the team, especially when new team members join.

Following the defined steps is a safer and stress-free way to execute any actions. You can think about the runbook as a checklist in aviation. This is a proven concept in many companies.

Game Days

Do you know how your system operates when crucial components stop working? Imagine if a third-party service responsible for processing payments due to maintenance is not available. Does it mean the users cannot make any purchases in your application? Or do you have a backup plan?

To learn how the system operates in such situations, you need to run stress testing on your system. One of the ways to check that is to drive a Game Day.

It is similar to the training covered in the paragraph above, but with a slight difference. You try to break the system. Then you observe what happens and how to quickly recover.

To bring more clarity, we will explore a real-world scenario when the payment provider of your application experiences an outage. You can simulate it by overriding the network response from the provider by having a dedicated feature flag.

Then, what happens next?

Do the customers see the blank page with error code 500? Or can they navigate through the website?

During this exercise, you will see the weak points of your system.

In our example, we can improve the system, and there are multiple options to choose from.

First that comes to mind is to tell the users that the payments temporarily do not work. Not the best option, but at least you will lose fewer paying customers.

The next step is to integrate a backup payment provider. So developers can enable it when the primary one is down. This will make sure that your product keeps monetizing.

Lastly, the system can become self-recovering. If there are too many errors on the primary payment provider, the traffic is redirected to the backup one. And as soon as the main provider is healthy again, the traffic is diverted back to it.

To ensure that the system is functioning as described above, you need to simulate the outage of the payment provider. And that’s exactly the main goal of the Game Day.

Do it regularly on the production environment to be confident that all components of the entire system are resilient. The result of this practice allows teams to have better sleep and reduces the business’s monetary losses.

In the real world, you can’t cover all possible scenarios. The mistakes and incidents will happen, and that is fine.

What we want is to avoid common mistakes and ensure everybody is prepared to handle unexpected situations.

This will make the software and the teams truly resilient.

Do you want to know how to grow as a software developer?

Are you curious about how to achieve the next level in your career?
Looking how to succeed as an engineer?

My book Unlock the Code offers a comprehensive list of steps to boost 
your professional life. Get your copy now!