5/5 - (1 vote)

A lot of time has already passed since the publication of our last essay on interesting cases from the practice of servicing Kubernetes clusters and the services / applications launched. We would like to present another series of adventures of our engineers in the turbulent waters of a large (and not really) web development.

Story # 1. Kafka and variables from Docker in K8s

Container systems are slowly becoming the industry standard, and many tools are already quite firmly established in this environment. However, you can still find funny (and sometimes not so) bugs related to integration into the Docker / Kubernetes infrastructure.

Once we needed to deploy the standard ZooKeeper + Kafka bundle for Sentry, and, perhaps, other components in the future. It looks not difficult: we take as a basis the hard-to-read Kafka Helm chart from the well-known project on GitHub, cut off all unnecessary, add our own to the needs of the project.

Trying to run it – we get the first error:

===> Configuring ...
port is deprecated. Please use KAFKA_ADVERTISED_LISTENERS instead.

Strange message – ok, let’s add:

    value: {{ printf "%s:%s" .Chart.Name  .Values.port._default }}

The situation repeats itself. Perhaps it makes sense to see who writes this and why? And then the main question arises: what image do we use and what is there? The chart from the largest repository is based on the rather popular confluentinc / cp-kafka image, which has a highly functional entrypoint (like many other Confluent Community images). If you study what is launched and in what sequence, you can find a script in which the source of problems lies:

if [[ -n "${KAFKA_PORT-}" ]]
  echo "port is deprecated. Please use KAFKA_ADVERTISED_LISTENERS instead."
  exit 1

Everything seems to be clear, but after all, we did not declare this variable (and indeed nothing similar)! ..

However, if you look at the actually declared environment variables inside the container, you will find that such a variable does exist … as well as some others of a similar format:

root @ kafka-0: / # env | grep KAFKA_PORT
KAFKA_PORT = tcp: // 9092
KAFKA_PORT_9092_TCP = tcp: // 9092

The thing is that the kubelet at the start of the pod adds a set of variables formed on the basis of the created services – this is one of the ways to find the address of the required service, which is briefly described in the K8s documentation.

One of the most obvious ways to close the question is to simply change the name of the service in Kubernetes to something more explicit (and most importantly, different from the laconic kafka).

Let’s try and run:

$ kubectl -n sentry logs sentry-kafka-0
===> ENV Variables ...
SENTRY_KAFKA_PORT = tcp: // 9092
===> Configuring ...
===> Running preflight checks ...
===> Check if / var / lib / kafka / data is writable ...

Congratulations! You are wonderful!

Story № 2. How the modest 100 bytes caused big problems

We love ClickHouse and use this database in many projects as the main base for collecting statistics. We also store the logs of K8s clusters there, which is very convenient and efficient – for this, we even created a specialized loghouse tool. So we will not be surprised that, in practice, accidents related to ClickHouse can periodically occur, and of varying severity … but this case is special.

We have been servicing for many years a ClickHouse cluster of quite impressive size: 5 shards, 10 iron nodes with 45 TB of data each. There are many replicated tables in the cluster, and the ZooKeeper cluster of 3 nodes on virtual machines serves replication.

One night, one of the hypervisors on which ZooKeeper’s virtual machines are running, stopped. A decision was made to restart the hypervisor, after which the ZK cluster lost its integrity, stopped collecting and serving ClickHouse. The first thought – “something was broken”, so there was a need to restore metadata from a snapshot.

Done! However, ZooKeeper starts to experience problems when trying to get it together, choose a leader and split a snapshot:

2020-07-03 18: 19: 50.681 [myid: 2] - WARN [QuorumPeer [myid = 2] (plain = [0: 0: 0: 0: 0: 0: 0: 0]: 2181) (secure = disabled): Follower @ 129] - Exception when following the leader
java.net.SocketTimeoutException: Read timed out

If we try to run without a snapshot and create tables in ClickHouse, then the situation is not better:

CREATE TABLE database.storage1 (`date` Date, `created` DateTime, `value` Int32) ENGINE = ReplicatedMergeTree('/clickhouse/tables/{shard}/storage1', '{replica}', date, value, 8192);
Received exception from server (version 19.14.3):
Code: 999. DB::Exception: Received from Coordination::Exception. Coordination::Exception: All connection tries failed while connecting to ZooKeeper. Addresses:,,
Code: 209, e.displayText() = DB::NetException: Timeout exceeded while reading from socket ( while receiving handshake from ZooKeeper (version (official build)),
(Connection loss).

It’s curious: we haven’t changed any settings … neither on the servers, nor in the configuration of CH and ZK themselves. But indirectly, these problems indicate network problems, moreover, they are clearly tied to ZK.

I will not go into unnecessary detailing of all our actions taken to find the source of the problems – suffice it to say that we were pretty nervous. The result of these studies was … the most common MTU!

2: ens3: <BROADCAST, MULTICAST, UP, LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
link / ether 52: 54: 00: e5: 6d: fc brd ff: ff: ff: ff: ff: ff
inet brd scope global ens3
valid_lft forever preferred_lft forever

The thing is that the network on virtual machines for ZK was created quite a long time ago – in 2018, immediately after the release of Ubuntu 18.04, and then netplan, which had not yet become really familiar, was involved. And it sets the MTU to 1500 bytes by default. At the same time, as many know, the default MTU value in Hetzner vSwitch is 1400 bytes.

Bottom line: the use of the cluster until now did not require packets of this size, so the problem never popped up, and now the difference of 100 bytes has turned out to be fatal. When the MTU decreased, the cluster started up and returned to work without any problems (and we continued to eliminate the consequences).


When restoring metadata in ZK, we tried (among other things) to manually recreate tables and found an interesting feature.

This table creation format (when the partitioning key and the primary key are specified directly in the ENGINE section) is deprecated since version 20 of ClickHouse:

CREATE TABLE database.storage1foo (`date` Date,` created` DateTime, `value` Int32)
ENGINE = ReplicatedMergeTree ('/ clickhouse / tables / {shard} / storage1foo', '{replica}', date, value, 8192);

In current versions of ClickHouse, the correct option is:

CREATE TABLE database.storage1bar
`date` Date,
`created` DateTime,
`value` Int32
) ENGINE = ReplicatedMergeTree ('/ clickhouse / tables / {shard} / storage1bar', '{replica}')
ORDER BY value
SETTINGS index_granularity = 8192

Take a closer look: in fact, both options do exactly the same job and create the same tables with the same parameters. But let’s see what metadata these commands generate:

[zk: localhost: 2181 (CONNECTED) 5] get / clickhouse / tables / 1 / storage1foo / metadata
metadata format version: 1
date column: date
sampling expression:
index granularity: 8192
mode: 0
sign column:
primary key: value
granularity bytes: 10485760

[zk: localhost: 2181 (CONNECTED) 6] get / clickhouse / tables / 1 / storage1bar / metadata
metadata format version: 1
date column:
sampling expression:
index granularity: 8192
mode: 0
sign column:
primary key: value
data format version: 1
partition key: date
granularity bytes: 10485760

In the output, there is a difference in the date column and partition key fields, although the input data is, at first glance, the same, which makes the metadata and tables in CH, created with different query formatting, incompatible.

Story №3. How the K8s cluster was overloaded … but in fact – overheated

One day a client received a request about a malfunction of one of the services. When examining the pods of this service, it turned out that the pods are periodically restarted and do not really work.

The service was launched on several K8s nodes, whose CPU load was simply colossal. The increase in the load occurred at different nodes not simultaneously, but very close to each other in time. The situation was so bad that at some point the kubelet simply stopped responding.

At the same time, on problem nodes, the CPU consumption by cluster resources was not very high: the lion’s share fell on kidle_inj, i.e. the kernel was constantly running the idle process.

$ dmesg -T | grep intel_powercl
[Sat Jun 13 06:29:00 2020] intel_powerclamp: Start idle injection to reduce power
[Sat Jun 13 06:29:02 2020] intel_powerclamp: Stop forced idle injection
[Sat Jun 13 06:29:05 2020] intel_powerclamp: Start idle injection to reduce power
[Sat Jun 13 06:29:07 2020] intel_powerclamp: Stop forced idle injection
[Sat Jun 13 06:29:08 2020] intel_powerclamp: Start idle injection to reduce power

This cluster is built on bare metal, i.e. iron knots. A legitimate question arose: what about their temperature?

$ sensors
Adapter: ISA adapter
Package id 0: + 92.0 ° C (high = + 80.0 ° C, crit = + 100.0 ° C)
Core 0: + 90.0 ° C (high = + 80.0 ° C, crit = + 100.0 ° C)
Core 1: + 91.0 ° C (high = + 80.0 ° C, crit = + 100.0 ° C)
Core 2: + 92.0 ° C (high = + 80.0 ° C, crit = + 100.0 ° C)
Core 3: + 91.0 ° C (high = + 80.0 ° C, crit = + 100.0 ° C)

90 degrees!

It’s time to write to the technical support of the data center with a request to check the machines, and at the same time look for any opportunities where to switch the load.

The technical support response was a message about the need to turn off the machines for 30 minutes to carry out work. To agree to this, we have drained the first problem node. And as a result of the operations performed, the DC received the following response:

Dear Client,
unfortunately the CPU fan was faulty, we’ve replaced this and started the server again.

Q.E.D! The next steps are clear: one-by-one withdrawal of nodes from the cluster and coordination of our actions with employees in the DC who change coolers. The situation with the load literally returns to normal in front of us.

Story №4. The pg_repack ode for PostgreSQL

We usually do not provide DBA services. Yes, we can look at your database, create a couple of indexes, optimize queries. Often such things happen simply by coincidence: they noticed some oddity, went to look carefully and found something (you can always find something if you look closely).

One day our attendant noticed an increase in long-term transactions on a client basis. More precisely, it was recorded by the monitoring system: the planned task of clearing irrelevant data from the database began to take more and more time, which is why it fell under the trigger of long transactions.

Although a simple step was taken to address the initial problem reported by monitoring, it became clear that a more careful analysis of the situation was required.

To do this, using the pgstattuple module, we estimated the fragmentation of tables with queries:

SELECT schemaname,
dead_tuple_count (stats) AS dead_tuples,
pg_size_pretty (dead_tuple_len (stats)) AS dead_space,
pg_size_pretty (free_space (stats)) AS free_space,
(SELECT pgt.schemaname AS schemaname,
pgt.tablename AS tablename,
pgstattuple (pgt.schemaname || '.' || pgt.tablename)
AS stats,
uts.last_autovacuum AS last_autovacuum,
uts.last_autoanalyze AS last_autoanalyze
FROM pg_tables AS pgt
LEFT JOIN pg_stat_user_tables
AS uts
ON pgt.schemaname = uts.schemaname
AND pgt.tablename = uts.relname
WHERE pgt.schemaname NOT IN ('repack', 'pg_catalog')) AS r
ORDER BY dead_tuples DESC;

… and fragmented indexes:

SELECT schemaname,
dead_tuple_count (stats) AS dead_tuples,
pg_size_pretty (dead_tuple_len (stats)) AS dead_space,
pg_size_pretty (free_space (stats)) AS free_space
(SELECT pgt.schemaname AS schemaname,
pgt.indexname AS indexname,
pgt.tablename AS tablename,
pgstattuple (pgt.schemaname || '.' || pgt.indexname) AS stats
FROM pg_indexes AS pgt
WHERE pgt.schemaname NOT IN ('repack',
AND pgt.indexname NOT IN ('some',
'triggers')) AS r
ORDER BY dead_tuples DESC;

In the course of this research, a huge number of fragmented tables and indexes were identified. The solution is to run pg_repack and clean up a number of duplicate indexes in parallel.

The repack operation on a 620 GB database took a whole night and morning. The result was a gain of about 150 GB of disk space and the accompanying “brushing” of the database, combined with a slight increase in performance. The closest plan is to migrate from PostgreSQL v11 to v12, as it uses techniques to reduce data fragmentation, which means that the problem will become less of a problem for us.

The conclusion is trivial: watch the bases, optimize wisely and don’t turn them into noodles filled with useless indexes.