Modern infrastructure contains many small applications launched in the context of a single application manager, which manages their quantity, updating, and resource requests. It is not because administrators decided it was so convenient to manage it. Such an infrastructure reflects current thinking in software development. In order to understand why we are talking about microservice architecture as an ideology, we need to go back for around 30 years.

In the late 80s / early 90s, object-oriented programming was the answer to the growing volume of software – after all, it was then that personal computers began to be used everywhere. If earlier the software was a small “utility”, then somewhere around this time the development of large-scale software turned into a big business. Development teams of thousands of people, creating great functionality, have ceased to be something supernatural. Business needed to understand how it is possible to divide the work of programmers teams so that everything does not get mixed up – and object-oriented programming was the answer to this question. The app release cycle was still slow.

You planned to release your product (let it be Microsoft Office 95) in a few years. After the development was completed, you tested it in detail (after all, you cannot easily fix errors after the application is delivered to an end-user). Then you sent the “binaries” to the factory where the circulation of compact disks (or diskettes) was made, all this was packed in cardboard boxes and delivered around the world to stores where the user bought them and then installed the application on their computer … – this is the main thing difference with what is happening now.

Since 2010, microservice architecture has been the answer to the need for large applications and companies to update as quickly as possible. We no longer need to install the application on users’ computers. We, in fact, “install” the application in our infrastructure and can quickly update it. Thus, we can (and the business wants) to conduct updates as quickly as possible in order to experiment and test hypotheses. Business needs to create new functionality in order to attract new users and keep current ones. Business needs to experiment and watch, as a result of which the user can pay more money. Finally, it is important for business to keep up with competitors. The business wants to update codebases at least dozens of times a day. In theory, this can be done on one large application; but if you divide it into many small pieces, the manageability of updates will increase. That is, the transition to microservice architecture was almost never based on the desire of the business to increase the stability of the application and infrastructure: microservices are part of agile, and the business was guided by the ability to increase the “flexibility” of the application.

What does flexibility mean? This is speed, this is the ease of change, this is an opportunity to quickly change your mind. Thus, the main thing is not the solidity of the solution, but the maximum delivery speed of this solution, the ability to quickly test the concept. It is assumed, however, that after the solution is tested, resources will be allocated to make it thorough – which in practice (especially in small teams, in developing businesses, where the whole focus is on product development) does not happen. Technical debt is created, the danger of which is especially growing in the context of the belief that “Kubernetes will raise everything itself.”

But this is not so: recently I came across an excellent quote that speaks both of the pros and the horrors of Kubernetes for exploitation.
“Kubernetes is so awesome that one of our JVM containers has been periodically running out of memory for more than a year, and we just recently realized about it.”

Let’s think about this phrase. During the year, the application from time to time fell out of memory – and the operating team did not pay attention to it. Does this mean that the application most of the time continued to work stably and perform its task? At first glance, the function is really good: instead of the admin receiving a message about a service crash and going to raise it, Kubernetes itself found that the application crashed and restarted it itself. During the year, this happened regularly, and no messages were sent to the administration service. I saw a project where, however, a similar situation occurred strictly in one case – when generating a monthly report. The reporting functionality was developed and pumped out to help the business, but after a short time, users began to receive HTTP 502 for their requests – the application crashed, the request was interrupted, and Kubernetes reloaded the application. And although the application itself worked fine while doing this, the report generation process was never performed.

Employees of the company who needed it preferred to do the old-fashioned reporting and did not report a mistake (it is needed only once a month – you can do it the old way, why to bother people), and the maintenance service did not give priority to the task that arises from strength once a month. However, in the end, this leads to the fact that all the resources that were spent to create this functionality (business analysis, planning, development) were wasted. This was also learned a year later.

In our work, we chose a number of practices in which we try to close the risks of supporting microservice applications based on previous experience. In this article, I will try to talk about the 10 most important of them.

Service reboots are not monitored, or they are not given due importance

Example:
An example is described above. Problem: at least the reboot itself is the user who did not get the result, as the maximum – the same function may not be systematically performed.

What to do?
Basic level of monitoring: monitor the fact of reboots of your services. Perhaps you should not give priority to a service that is reloaded every three months, but if a service starts to reboot every five minutes, this is undoubtedly a priority.
Advanced level of monitoring: pay attention to all the services that have ever rebooted, and organize the task setting process to analyze such reboots.

Service errors (such as Fatal Error or Exceptions) are not monitored

Example:
The application may not crash, but give the user (or another application through the API) an execution trace. Thus, even if we monitored application reloads, we may not detect cases when the query execution completed incorrectly.

What to do?
Application logs should be aggregated in any system and analyzed. Errors must be systematically analyzed with the “eyes”, and critical errors must be alerted and escalated immediately.

Endpoint with a health-check service is missing or does not make meaningful work

Example:
The creation of endpoints issuing service metrics (preferably in the OpenMetrics format) so that they can be read (for example, by Prometheus) has become, thank God, practically a standard. However, in conditions when the business is pushing towards the development of functionality, developers often do not want to spend extra time thinking through metrics. As a result, health-check services are quite common, the whole point of which is to display the “ok” message. If the application is able to display something on the screen, it is considered that this is already ok. However, it is not. Such a check, if the application cannot reach the database server, will still say “ok” and will give incorrect information, making it difficult to detect a problem.

What to do:
Firstly, the fact of the endpoint with health check should become the standard for any of your services, if you haven’t. Secondly, such a health check should examine the health/accessibility of critical systems for the functioning of a service (accesses to queues, databases, accessibility of services, and so on).

API response time and interaction with other services are not monitored

Example:
In conditions when most parts of the application become interacting with each other, clients, and servers, it is critically important for the API to understand how long a particular service gives an answer. If for some reason it has increased, the total response time of the application may turn into a waterfall of delays.

What to do:
Use tracing. Jaeger has become almost the standard. There is an awesome working group working on the OpenTracing format, similar to OpenMetrics.

Service is still an application, and an application is still a memory and processor (and sometimes a disk)

An example and what to do: here, I think everything is clear. Quite often, performance monitoring of individual services is missed: how much they consume the processor, how much RAM, if measurable, how they consume the “disk”. In general, all standard metrics for monitoring the “server”. Often we see how the entire node is monitored without monitoring individual services – and in vain.

Emergence of new services should be monitored

Example:
This is a rather funny moment. In conditions when there are many development teams, there are even more services, and SRE is shifted towards the responsibility of the development team, the cluster operation team should generally monitor (and receive notifications) the appearance of new services. Let you have a standard on how a new service should be monitored, what metrics it should export to the outside, how its performance should be monitored – when a new service appears, you should still make sure that these standards are followed.

What to do:
Set an alert for new services in the infrastructure.

Monitoring delivery time and other CI / CD metrics

Example:
Another problem that has appeared relatively recently.
Performance is not only the performance of the application, but also the speed of its calculation. Complex CI / CD processes + complication of the application building process + assembly of the delivery container = and here the simple calculation process is not so simple anymore 😉
At some point, you may find that the calculation of the service began to take instead of one minute – twenty …

What to do:
Monitor the delivery time of the application, from assembly to its appearance in production. If the time has increased – study what happened.

APM and application profiling

Example and what to do:
The moment you find out that a problem has occurred with the service — it has started to respond for a long time or has become unavailable, or any other problems have occurred — you least of all will want to delve into the interiors of the service or try to restart it to localize the error. In our practice, it’s very often possible to track the problem when you have detailed monitoring at the Application Performance Monitoring level: most often the problem does not occur simultaneously, but “accumulates” in time – and with APM data you can understand when it happened. Also, learn how to use system-level profilers. With the development of eBPF, a huge number of possibilities have appeared for this.

WAF Event Monitoring, Shodan Monitoring, Image and Package Security Monitoring

Example:
Monitoring now concerns not only performance issues. There are a number of simple steps you can take to improve the security of your service:

Monitor the results of the assembly of your application when executing commands such as npm audit or similar – you can receive alerts that are in the current version of the libraries that you use, and upgrade them to the protected version.
connect the Shodan service API, which scans open databases and ports on the Internet. Check your IP through the API in order to be sure that you do not have open ports, and the databases have not leaked to open access.
if you use WAF, put alerts on WAF events to understand when an attacker will purposefully come to you and look at the attack vectors that he uses.

Bonus track: SRE, application response time – no longer server response time!

We are used to measuring system performance by server response time, but a modern web application is 80% of its logic transferred to the front-end. If you have not measured the response time of the application as the time it takes to display a page with frontend metrics for page loading, start now. It doesn’t matter to the user at all, for 200 or 400 milliseconds he gave the server an answer if the front on Angular or React then draws it within 10 seconds. In general, I generally believe that optimizing performance will largely go to the front or appear as a new discipline.