This article contains a brief extract from my own and my colleagues experience, with whom I have deal with incidents day and night. And many incidents would never have occurred if our beloved microservices were written at least a little more accurately.

Unfortunately, some programmers seriously believe that a Dockerfile with some kind of any command inside is itself a microservice and it can be deployed even now. This approach is covered with problems starting with a drop in performance, the inability to debug and denial of service, ending with a nightmare called Data Inconsistency.

If you feel that the time has come to launch another app in Kubernetes / ECS / whatever, then I have something to tell you.

English version is also available

I formed for myself a certain set of criteria for assessing the readiness of applications for launch in production. Some points of this checklist cannot be applied to all applications, but only to special ones. Others generally apply to everything. I’m sure you can add your options in the comments or dispute some of these points.

If your micro-service does not meet at least one of the criteria, I will not allow it to be in my ideal cluster.

Note: the order of the items does not matter.

Readme Short Description

It contains a short description of itself at the very beginning of Readme.md in its repository.

God, it seems so simple. But how often I came across that the repository does not contain the slightest explanation why it is needed, what tasks it solves, and so on. There is no need to talk about something more complicated, such as configuration options.

Integration with a monitoring system

Sends metrics to DataDog, NewRelic, Prometheus, and so on.

Analysis of resource consumption, memory leaks, stacktraces, service interdependence, error rate – without understanding all of this (and not only) it is extremely difficult to control what happens in a large distributed application.

Alerts configured

The service includes alerts that cover all standard situations plus well-known unique situations.

Metrics are good, but nobody will follow them. Therefore, we automatically receive calls / push / sms if:

  • CPU / memory consumption has increased rapidly.
  • Traffic increased / fell sharply.
  • The number of processed transactions per second has changed significantly in either direction.
  • The size of the artifact after assembly has changed (exe, app, jar, …).
  • The percentage of errors or their frequency exceeded the permissible threshold.
  • The service has stopped sending metrics (often situation).
  • The regularity of certain expected events is violated (cron job doesn’t work, not all events are processed etc.)

Runbooks created

A document has been created for the service that describes known or expected contingencies.

  • how to make sure that the error is internal and does not depend on third-party;
  • if it depends where, to whom and what to write;
  • how to safely restart it;
  • how to restore from backup and where backups are;
  • What special dashboards / queries are created to monitor this service;
  • Does the service have its own admin panel and how to get there;
  • is there an API / CLI and how to use it to fix known issues;
  • and so on.

The list can vary greatly between organizations, but at least basic things should be there.

All logs are written in STDOUT / STDERR

The service does not create any log files in production mode, does not send them to any external services, does not contain any redundant abstractions for log rotation, etc.

When an application creates log files, these logs are useless. You will not go into 5 containers running in parallel, hoping to catch the error you need (and here you are, crying …). Restarting the container will result in a complete loss of these logs.

If an application writes its own logs to a third-party system, for example, to Logstash, this creates useless redundancy. The neighboring service does not know how to do this, because does it have a different framework?

The application writes part of the logs to files, and part to stdout because it is convenient for the developer to see INFO in the console, and DEBUG in files? This is generally the worst option. Nobody needs complexity and completely redundant code and configurations that you need to know and maintain.

Logs are Json

Each log line is written in Json format and contains a consistent set of fields

Until now, almost everyone writes logs in plain text. This is a real disaster. I would be happy to never know about Grok Patterns. I dream of them sometimes and I freeze, trying not to move, so as not to attract their attention. Just try to parse Java exceptions in the logs once.

Json is good, it is fire given from heaven. Just add there:

  • milliseconds timestamp according to RFC 3339;
  • level: info, warning, error, debug
  • user_id;
  • app_name

and other fields.

Download to any suitable system (correctly configured ElasticSearch, for example) and enjoy. Connect the logs of many microservices and again feel how were good monolithic applications.

(And you can add Request-Id and get tracing …)

Logs with verbosity levels

The application must support an environment variable, for example LOG_LEVEL, with at least two operating modes: ERRORS and DEBUG.

It is desirable that all services in the same ecosystem support the same environment variable. Not a config option, not an option on the command line (although this is reversible, of course), but immediately by default from the environment. You should be able to get as many logs as possible if something goes wrong and as few logs as possible, if all is well.

Fixed dependency versions

Dependencies for package managers are fixed, including minor versions (for example, cool_framework = 2.5.3).

This has already been discussed a lot, of course. Some fix dependencies on major versions, hoping that only minor bug fixes and security fixes will be in minor versions. It is not right.

Each change in each dependency should be reflected in a separate commit. So that it can be canceled in case of problems. Is it hard to control with your hands? There are useful robots, that will follow the updates and create Pull Requests for you for each of them.

Dockerized

The repository contains production-ready Dockerfile and docker-compose.yml

Docker has long been the standard for many companies. There are exceptions, but even if you don’t have Docker in production, any engineer should be able to just do docker-compose up and not think about anything else to get a dev assembly for local verification. And the system administrator must have the assembly already verified by the developers with the necessary versions of libraries, utilities, and so on, in which the application at least somehow works to adapt it to production.

Environment configuration

All important configuration options are read from the environment and the environment takes precedence higher than configuration files (but lower than command line arguments at startup).

No one will ever want to read your configuration files and study their format. Just accept it.

Readiness and liveness probes

Contains appropriate endpoints or cli commands to test readiness to serve requests at startup and uptime throughout life.

If the application serves HTTP requests, it should have two interfaces by default:

To verify that the application is live, a Liveness test is used. If the application does not respond, it may be automatically stopped by orchestrators like Kubernetes, “but this is not accurate.” In fact, killing a frozen application can cause a domino effect and permanently put down your service. But this is not a developer problem, just do this endpoint.

To verify that the application has not just started, but is ready to accept requests, a Readiness test is performed. If the application has established a connection to the database, queuing system, and so on, it should respond with a status of 200 to 400 (for Kubernetes).

Resource limits

Contains limits on the consumption of memory, CPU, disk space and any other available resources in a consistent format.

The specific implementation of this item will be very different in different organizations and for different orchestrators. However, these limits must be set in a single format for all services, be different for different environments (prod, dev, test, …) and be outside the repository with application code.

Assembly and delivery are automated

The CI / CD system used in your organization or project is configured and can deliver the application to the desired environment according to the accepted workflow.

Nothing is ever delivered to production manually.

No matter how difficult it is to automate the assembly and delivery of your project, this must be done before this project gets into production. This item includes building and running Ansible / Chef cookbooks / Salt / …, building applications for mobile devices, building a pool of the operating system, building images of virtual machines, whatever.

Can’t automate? So you can’t run this into the world. After you, no one will collect it.

Graceful shutdown

The application can process SIGTERM and other signals and systematically interrupt its work after the end of processing the current task.

This is an extremely important point. Docker processes become orphaned and work for months in the background where no one sees them. Nontransactional operations break in the middle of execution, creating data inconsistency between services and databases. This leads to errors that cannot be foreseen and can be very, very expensive.

If you do not control any dependencies and cannot guarantee that your code will correctly process SIGTERM, use something like dumb-init.

Database connection checked regularly

The application constantly pings the database and automatically responds to the “loss of connection” exception for any requests, trying to restore it on its own or correctly terminates its work

I saw many cases (this is not just a turn of speech) when services created for processing queues or events lost their connection by timeout and began to endlessly pour errors into the logs, returning messages to queues, sending them to Dead Letter Queue or simply not doing their job.

Scaled horizontally

When the load increases, it is enough to run more application instances to ensure that all requests or tasks are processed.

Not all applications can scale horizontally. A striking example is the Kafka Consumers. This is not necessarily bad, but if a particular application cannot be launched twice, all interested parties need to know about this in advance. This information should be an eyesore, hang in the Readme and wherever possible. Some applications generally cannot be run in parallel under any circumstances, which creates serious difficulties in its support.

It is much better if the application itself controls these situations or a wrapper is written for it that effectively monitors “competitors” and simply does not allow the process to start or start work until another process completes its work or until some external configuration allows N processes to work simultaneously.

Dead letter queues and bad message resilience

If the service listens for queues or responds to events, changing the format or content of the messages does not lead to its fall. Unsuccessful attempts to process the task are repeated N times, after which the message is sent to Dead Letter Queue.

Many times I saw endlessly restarting consumers and queues that were inflated to such a size that their subsequent processing took many days. Any queue listener should be prepared to change the format, to random errors in the message itself (typing data in json, for example), or when it is processed by child code. I even came across a situation where the standard library for working with RabbitMQ for one extremely popular framework did not support retries, attempt counters, etc.

Even worse, when a message is simply destroyed in case of failure.

Limitation on the number of processed messages and tasks per process.

It supports an environment variable, which can be forced to limit the maximum number of processed tasks, after which the service will correctly shut down.

Everything flows, everything changes, especially memory. The continuously growing graph of memory consumption and OOM Killed in the end is the standard of life for modern kubernetic minds. The implementation of a primitive test that would simply save you even the very need to examine all these memory leaks would make life easier. I have often seen people spend a lot of time and effort (and money) to stop this turnover, but there are no guarantees that your colleague’s next commit will not make it worse. If the application can survive a week – this is a great indicator. Let it then just end itself and be restarted. This is better than SIGKILL (about SIGTERM see above) or the “out of memory” exception. For a couple of decades, this way is enough for you.

Doesn’t use third-party integration with filtering by IP addresses

If the application makes requests to a third-party service that allows access from limited IP addresses, the service performs these calls indirectly through a reverse proxy.

This is a rare case, but extremely unpleasant. It is very inconvenient when one tiny service blocks the possibility of changing the cluster or moving the entire infrastructure to another region. If you need to communicate with someone who doesn’t know how to use oAuth or VPN, configure reverse proxy in advance. Do not implement in your program the dynamic addition / removal of such external integrations, as by doing this you nail yourself into the only available runtime. It is better to immediately automate these processes to manage Nginx configs, and in your application, contact him.

Obvious HTTP User-agent

The service replaces the User-agent header with a custom one for all requests to any API, and this header contains enough information about the service itself and its version.

When you have 100 different applications talking to each other, you can go crazy seeing in the logs something like “Go-http-client / 1.1” and the dynamic IP address of the Kubernetes container. Always identify your application and its version explicitly.

Doesn’t violate the license

It doesn’t contain dependencies that excessively limit the application, it is not a copy of someone else’s code, and so on.

This is a self-evident case, but it happened to see that even the lawyer who wrote the NDA now hiccups.

Doesn’t use unsupported dependencies

When you first start the service, it does not include dependencies that are already outdated.

If the library that you took into the project is no longer supported by anyone – look for another way to achieve the goal or develop the library itself.

Conclusion

There are some very specific checks on my list for specific technologies or situations, but I just forgot to add something. I am sure you will also find something to remember.