5/5 - (1 vote)

In November 2018, an information support department was created in LiteRes and invited Andrei Yumashev to lead. Last year, the department helps the company to work and develop and keeps the entire infrastructure under control. But it was not always so. Before setting things up, Andrei ran into ruins: half-dead Nagios, conditionally live Cacti and comatose Puppet, 120-page dead Wiki, incoherent task and list of hardware, outdated architecture, 340 inactive cores, 2 TB of RAM and 17 TB disk space that for some reason was not recorded in the inventory tables. Plans that do not work, deadlines that break, working environment and tools that are not there – all this awaited Andrei in a new project.

At DevOpsConf 2019, Andrei made a report in which he showed on live examples what is worth and what should not be done when you enter a project that you have not seen or know poorly. Before is an updated version of the story – how to properly analyze the range of problems and build an activity plan, how to calculate KPI correctly and when to stop in time.

Andrey Yumashev is the owner of his own development companies in various fields (online and offline), a consultant on the construction of processes, and the head of the information support department at LiteRes.

A little bit about the LiteRes. This is the largest supplier of electronic and audio books in Russia, a publishing house anda bunch of partnership projects. These are hundreds of thousands of Perl lines, several database clusters, and repositories. This is 2 GB of outgoing traffic per second, hundreds of thousands of unique requests per day, several racks in different data centers and more than 100 servers. All in all, this is not just an e-bookstore.

First steps

I used to work in LiteRes on a freelance basis. The company practices outstaff development with registration of remote employees in the state.

The task fulfillment system in LitRes works on the principle of an “auction”. In the internal task tracker, managers and architects describe tasks for internal projects and evaluate them in local currency. The currency is “mushrooms, grass and trees”.

Then begins the “easy auction” – any developer can take the task or bargain. Well done – you get paid. Didn’t work at all – you don’t get it. In a playful way, people are interested in completing tasks.

 Work for the mushrooms

The system suited me – I supported the experience of programming in Perl, worked when it was convenient and did not spend too much time on it. In this mode, I spent a couple of years until November last year and thought that I understand the structure of the ecosystem.

I was wrong

I was invited to the company and informed that my services as a Perl developer were not needed, and I was offered to have a new department. In November 2018, I became the head of the information department.

In front of me were open spaces: several racks with hardware in several data centers in Moscow, outdated architecture, foreign resources and the almost complete absence of relevant documentation for that’s all. The introductory sounded something like this: “Now this is yours, improve, do not break and support.” There was some ready-made list of tasks and rough plans for the coming year. It was necessary to comprehend all of this and bring it into a normal condition.

The experience of past years helped me a lot when I developed a clear position when working with a strange or incomprehensible. First of all, this is a thorough study and a minimal action plan. This is where I started.

For the first month I have managed to find:

  • different Google Sheets with current tasks and a pinch of useful information;
  • different documents: Word, texts, scraps of old Wiki – 120 pages;
  • half dead old Nagios;
  • conditionally live monitoring of Cacti;
  • very old puppet with rare signs of life.

All these ruins also collected 400 metrics.

I had a little fun, read everything fluently and stuck to Trello processes. It transferred the current tasks of his colleagues to him and began to dream – to write a plan for the quarter and year.

First mistake

No plans until you explore the area.

The plan seemed so great, but did not take into account reality. It was beautiful and simple: implement monitoring, analyze logs and transfer deploy to CI / CD. Somewhere in the end there was a languid “analysis of the project’s weaknesses”.

Classic first steps

I forgot about the main thing. My first priority is not to implement tools for the implementation of tools, but to ensure the viability and stability of the service as a whole.

While I was writing the plan and tormented with the questions of my colleagues, the first problems arrived. One of the cluster nodes ran out of space, and the SSD as a whole ended on the other node of the same cluster. I urgently bought new larger disks and our department quickly gained experience in replacing these disks by copying the system from disk to disk. After replacing the disks, we built the cluster from scratch through SST. The cluster is built on Percona and Galera, and such entertainment is not a joke.

While I was traveling between data centers, the first sprouts of doubt about the plan were born.

At the same time, the intensity of the black work was so high that I did not even think to collect a complete medical history, but simply took some photos for further study.

In parallel, another introductory appeared. In LitRes there are audio versions of books. So that the listener does not rewind the recording each time again, we have a mechanism that tracks the moment of stop. The next time you listen, the audio version is played back from the desired section. For this task, it took to find about 500 cores faster, a terabyte of RAM and a bit of analysis in Java.

I started briefing Azure, Google, DigitalOcean and everything else that droplet solutions provided. Containerization begs, why not cheerfully implement it? Moreover, in the “great” plan there was a separate point about this.

A month passed in correspondence and bidding, everything was added to the tasks in Trello, of which I generated a significant part, but the result did not progress. I wondered if I was going there. Before me, everything worked somehow and was not going to stop, no matter how much empty activity I showed. I sat down carefully to study the inventory that I managed to collect bit by bit. Then I got up and went to the second round of data centers.

The second calm visit to the data centers put everything in its place. When I saw all of this not in the console, not in Excel, but alive, my awareness of reality changed radically. I realized that I’m not at all busy with what I owe. Because first you need to understand what I’m working with.

Until I understand what I’m working with, all plans are a waste of time.

I studied the racks, compared reality with lists, made edits, and stumbled upon a semi unit with 20 blades. Out of 20, only 4 worked. I jerked the blades and realized that we did not need any droplets for the solution. Because I found 340 dormant cores, 2 TB of RAM and 17 TB of disk space! These are old backends, old nodes of clusters that simply stopped using, and time has wiped out the memory of their existence. I rolled Kubernetes on these blades and got rid of one major task.

First error output

Analyze and study. Without a preliminary analysis of the situation, the train does not move.

Thanks to a thoughtful trip to the fields, I already had in my hands a fairly relevant plan for the equipment and the overall architecture of the system. In the yard was January. I spent two months on it, half of which I just rushed from side to side. I didn’t know which fire to extinguish first, what task to solve first of all – the support flow did not disappear.