5/5 - (1 vote)

To paraphrase one famous historical figure, the most important of all the services for us is delivery. What would we do in the current situation without rolls or pizza at home? I don’t want to imagine. It happened so that 2 years ago we took on the support of one of the largest networks. The monthly audience of the site is about a million people. It is now in 2020. In 2018, when the company started cooperation with us, it was 2 times smaller. We ensured a painless move to a new data center and completely redid the infrastructure – thanks to which, in fact, the site and applications are able to withstand the increased load without problems.

Sometimes, we strongly regret that we are geographically far from the nearest restaurant: otherwise, all our successes (and little misfortunes) would be jammed with delicious rolls.

In general, today we want to tell you the story of support of one of the largest web projects of the domestic “hospitality industry”.

We met at the end of March 2018.

International Women’s Day has long passed, but the guys have just coped with its consequences. Everything is pretty banal: on the eve of March 8, traffic increased sharply, and the site was unavailable for a long time. Really long, not a couple of hours. Because the traffic flew not only through the main site, but also came from the application (available for Android and iOS), as well as aggregators (Yandex. Food, Delivery Club, Zaka-Zaka etc).

What we saw

Technically, the project turned out to be quite complicated:

  • The site is a react application with SSR (server side rendering).
  • Mobile applications – for iOS / Android.
  • API – all applications work with it.
  • External systems, including order processing.

The system was a reverse proxy server: the traffic to them went through the system of protection against DDoS attacks and from there it was already distributed to the backend servers. At the time of acceptance, there was an old site and an API for mobile versions, and the development of a new site began. The development of the new API was carried out on separate servers.

The database cluster consisted of two servers with master/master replication, where switching in the case of a failure was performed at the network level due to floating IP. All recording applications worked with this IP, while there were MySQL slaves for reading located on each backend server – where the application, accordingly, worked with the localhost.

At the backend, we saw the following problems:

  • An insufficiently reliable balancing mechanism in the database configuration. The master replication master has led to frequent crashes.
  • Slaves on each backend – required a large amount of disk space. And any manipulation or adding new backend servers was expensive.
  • There was no common deployment system for applications – there was a self-contained deployment system via the web.
  • There was no system for collecting logs – as a result of which it is rather difficult to investigate incidents, primarily in working with the order system, since there is no way to determine how a particular order was received.
  • There was no monitoring of business indicators – it was not possible to timely record a decrease or complete absence of orders.

 

After the initial audit of the servers accepted for monitoring, we started with the formation of an operational roadmap. Initially, two main areas of work were identified:

Stabilization of applications.
Organization of a comfortable development environment for the new API and site.

Decisions in the first direction were primarily related to the stabilization of the MySQL cluster. I did not want to refuse the master replication master, but it was impossible to continue working with a floating IP. Periodically observed violations of network connectivity, which led to disruptions in the cluster.

First of all, we decided to abandon the floating IP in favor of a proxy server, where the upstream will be controlled between the masters, as we used nginx as a proxy for MySQL. The second step is the allocation of two separate servers for slaves. Work with them was also organized through a proxy server. And since the reorganization, we forgot about the problems associated with working with the database.

Next, we monitored orders at the query level in the database. Any deviation from the norm – to a greater or lesser extent – immediately gave rise to an investigation. Then, at the log level, we formed metrics for monitoring external interactions, in particular, with the order management system.

Together with colleagues, upon their requests, we carried out an additional adjustment of all systems to stable and fast operation. It was both tuning MySQL and updating versions of PHP. In addition, colleagues implemented a caching system based on Redis, which also helped to reduce the load on the database.

All this was important … But the main thing for the business was an increase in sales. And in this context, the company managers pinned great hopes on the new site. For developers, it was necessary to get a stable and convenient system for deployment and application control.

First of all, we thought about the assembly and delivery pipelines of the CI / CD application, as well as the systems for collecting and working with logs.

To begin with, it was customary to implement pipelines on dev environments – this allowed us to significantly increase the development speed. Then it was introduced on the production circuits, where the automatic deployment allowed us to avoid frequent errors, usually caused by the human factor.

After implementation, CI / CD started organizing the collection of logs and working with them. The ELK stack was chosen as the main one, which allowed the client to conduct investigations faster and better in the event of an incident. And as a result, application development went faster.

“More terrible than two fires …”

After solving quite complex, but nonetheless standard tasks, the company told us what they had wanted to say for a long time: “Let’s move!”

The change in DC was caused by economic factors. In addition, the client expanded its infrastructure with additional services that were already in the new DC – this also influenced the decision to move.

The migration of any system is a process that requires extensive planning and large resources.

The move was carried out iteratively: at the first stage, reverse proxy servers were created in the new DC. And since only they have public ip, they also acted as access points to the system for administrators.

Then we launched all infrastructure services – logging and CI / CD. And Consul allowed organizing a convenient, manageable and fairly reliable service of interaction between client applications.

The following migrated databases, Redis and the queue broker – RabbitMQ. Here it was important to organize everything so that they were correctly registered in the service discovery protocol, which, in turn, controlled the operation of the DNS. Note that the applications did not work directly with the database, but through Haproxy, which allows you to conveniently and balance between databases and switch in case of failure.

At the preparatory stage, database replication between data centers did not rise. I just had to transfer the backups. Next, we started to configure the applications directly, and this is the organization of all interaction through the internal DNS – the interaction between the application and the database / Redis / RabbitMQ / external services (for example, order services). Naturally, at the same stage all CI / CD mechanisms were immediately connected – and then a second change in architecture arose. Previously, changing the application settings through the interface was not possible – only through editing files directly in the console. Immediately, we introduced a solution that allows you to conveniently manage settings – through the web interface. It was based on the Hashicorp vault (Consul acted as a backend for it), which allowed us to build convenient mechanisms for managing environment variables.

The next step is to switch services to the new DC. Since the work of all systems was organized using the http protocol, and all domains went through a system of protection against DDoS attacks, switching was reduced to manipulating upstream directly in the interface of this system.

Previously, the necessary replicas were organized from the old DC to the new one. And a switch was made to the agreed window of work.

What infrastructure looks like now

All traffic goes to the balancers. Traffic to the API goes from the application (on Android / iOS) not directly, but through Qrator.
On a static server is the main site of the project, a server with landing pages.
The backend cluster now consists of servers: frontend, static, servers for applications.

Order flow chart
The received order passes through Qrator (immediately we filter out attacks) and comes to the API. Then he goes to Raiden to deliver the order, goes to Redis and goes to nginx, after which he leaves for the database.

What has changed for the customer

Reliability of the system: problems were observed in July 2019 – orders were not issued within an hour. But that was before the global move. No major incidents were subsequently observed.
The life of developers: they have a convenient development environment, CI / CD.
Fault tolerance: the infrastructure now withstands a lot of traffic. For example, during the holidays, the RPS peaked at 550 units.

What’s next

In modern conditions, online sales come to the fore. The project should provide reliability and accessibility for service customers. But development is also a very important component: product releases should be as fast as possible and invisible to end-users.

Another important issue is the utilization of resources and reducing the cost of maintaining the system.

All this leads to the need to review the system as a whole. The first step is to organize application containerization. Then we plan to organize a Kubernetes cluster. But we will talk about this in the next article.