The article will be useful to those who used their own data center or classic hosting, but then faced the world of public cloud platforms. If you have not thought through all the details when planning, then problems are almost inevitable. I’ll tell you how to avoid them.
What is a public cloud
Gartner calls public clouds Cloud Infrastructure and Platform Service Providers, i.e. companies that make infrastructure and platform services. The market leaders are Amazon, Google and Microsoft.
Intricately studied the main players and buyers of the cloud services market and concluded that SaaS services, media services and internet-facing resources spend the most money on clouds.
The cloud is primarily chosen to reduce the time to market of services. A modern microservice web application is a lot of technology. You have three options:
- To lift everything on your own on hardware – and you will spend about 2/3 of the time of product development on preparing the infrastructure.
- Take virtual machines – then spend 1/2 of the time.
- Choose platform services – and you will immediately start writing code that makes money, and then decide whether it is worth deploying the infrastructure yourself.
The public cloud allows you to raise the infrastructure automatically (manually, using scripts, code, Terraform) and serve it not as traditional iron servers, but, for example, as code. And thanks to automatic scaling, you will spend exactly as much money as you need right now. The cloud allows you to quickly get a lot of resources, and then collapse them.
What will have to change in the clouds
Scaling Approach
In on-prem, if you don’t have enough resources, you just buy a more powerful server. It is possible to scale vertically in the cloud, but there is always a limit: platforms usually buy servers of the same type with a limited CPU and RAM. Therefore, blur loads horizontally. With stateless, this is very easy: the frontend and backend usually scale horizontally. Difficulties usually arise with stateful databases.
Recommendations:
- Divide workloads into stateless and stateful:
- Stateless workloads autoscale horizontally.
- Stateful workloads are usually manually scaled vertically.
Availability Zones
Traditionally, you know exactly where your virtual machines and other resources are located down to the server and rack. In the cloud, this can only be done up to the data center. These data centers are called Availability Zones and are often grouped into regions so that you can create a service that will survive a data center failure.
What is important to understand about data centers and availability zones? Zoned services (for example, virtual machine disks) will not go beyond the zone. Regional services (for example, Object Storage for all vendors and service providers) are tied to the region. Global services (for example, Amazon has Route 53, their DNS is global around the world) do not depend on specific data centers.
Recommendations:
- Examine the scope of the services you plan to use. Services have different availability zones and different modes of operation.
fault tolerance
In the cloud, at a certain scale, you may have new failure scenarios, and the old ones will not go anywhere. Possible:
- Cascading failures.
- DDoS monitoring.
- Updates in the availability zone.
Recommendations:
- Include failure scenarios in project sizing.
Quotas
In the cloud, you can get a lot of resources, but you are limited by a quota. Quota is the amount of resources that can be used. Usually the quota is raised fairly quickly, semi-automatically, but complex options require approval. If you don’t prepare for the surge, you’ll run into quotas and not have the resources to handle the load.
Recommendations:
- Carefully read the section on quotas and limits for the services you plan to use.
- Constantly monitor quotas to avoid incidents.
Checklist for IaaS
Now let’s move on to specific services and their implementation in the clouds. Let’s start with IaaS.
Virtual machines
A virtual machine is the essence of an availability zone, that is, a kind of abstract container. She has a disk and a network card, which usually do not go beyond the availability zone.
Fault tolerance . Therefore, if you need fault tolerance, then place the service in several availability zones, or at least connect the same type of virtual machines to different racks.
Oversubscription . For example, we have instance types with 5/20/50% or 100% guarantee. What is important here? If you are doing tests on processors with oversubscription, then you expect that, for example, one such processor will receive 100 RPS and it will work like that for a week, two, three. And then someone will sit down on this processor – and your performance will drop sharply. Therefore, choose processors for production without oversubscription.
Scaling. It seems that cloud platforms are something magical: their resources are endless. But virtual machines land on specific servers with specific processor and GPU models. The provider first buys some server models, then others. At some point, you will not be able to expand and create a new virtual machine: there will not be enough servers. Therefore, it is important to follow the platform news in a particular service provider, read its documentation and update the platforms in your application. And very often new platforms are cheaper than old ones: this is how the service provider encourages you to switch to a new one.
What to look for:
- Usually zoned VMs with x86 or ARM architecture cannot migrate to another DC.
- There are modes with guaranteed and non-guaranteed CPU redundancy.
- VMs differ by platform (CPU vendor/model, number of CPUs per host, CPU/RAM ratio).
Recommendations:
- Back up application compute resources in different Availability Zones.
- Use guaranteed dedicated CPUs for predictable scaling in production. Compact workloads that need less than 1 CPU with containerization.
- Monitor the news about the release of new platforms so as not to end up in a situation where the resources of the current one run out.
VPC network
A network in the cloud is a unique system with its own rules and patterns. For example, broadcast and multicast are prohibited in service provider networks. As a result, traditional clustering and fault tolerance tools, which are based on the fact that you connected servers to a regular switch, will not work. We need to use new tools, such as load balancers.
Latency
Any entities in the cloud are located in data centers. Often, the equipment in the DC is different, and the DCs themselves are connected by different channels. Therefore, there is always latency within an availability zone, between zones, and between regions. She is always unique. Pay attention to your regions and availability zones and do network latency tests.
Load
Usually in the clouds, the network is designed for a general purpose load profile. These are SaaS services, transactional services, etc. By default, 80% of loads work out of the box. But there are loads that are called network-bound. Their performance requirements are often elevated.
Accordingly, you need to know that in all clouds there is a regular network and an accelerated network.
VPC tier
How to understand what network tier you need? Very simple: do load testing.
What to look for:
- Cloud architecture.
- SDN and physical network layers.
- Tier performance networks.
Recommendations:
- Learn the rules of the network, supported protocols.
- Learn what delays you can expect across Availability Zones, between Zones, and between Regions. Draw the line between synchronous and asynchronous replication for your systems.
- Do a load test to see if you need better network performance.
Load balancers
Clouds typically have two types of load balancers. The first is network. It can handle packets, usually by doing destination NAT. The balancer cannot do anything else, but due to this, it is constantly scaling. That is, you are more likely to run into your performance backends than this balancer.
When you distribute the load, you have to solve other problems, for example, terminate SSL, route by URLs etc. This is the responsibility of the application layer load balancers, which sit behind the network layer balancers. It all depends on the cloud. You can raise the load balancer yourself, or use the load balancer provided by the cloud. Why is there a choice here? Because, firstly, the application level balancers that the cloud provides can be expensive. Secondly, for example, there are Ingress controllers in Kubernetes that give you this functionality, and it is quite stable, convenient and standardized between clouds, and you can use it not only in one cloud, but also in different places. An important conclusion is that you will not leave network load balancers in any cloud, but from application-level load balancers – here already choose what is more convenient for you.
What to look for:
- Network balancers – work at the L3 / L4 level. “Dumb”, but linearly scaled and inexpensive.
- Application balancers – work at the L7 level. “Smart”, but usually scaled with a ladder and can be expensive.
Recommendations
- You cannot live without network balancers in the cloud.
- Learn the types of balancers in the cloud, their scope, protocol support.
- Application Load Balancer is optional, here you choose – take it from the cloud and develop your own.
Block storage
Storage is an important part of any IaaS.
A disk is almost always the entity of an availability zone. Disks that live on multiple zones are usually very slow.
Cloud drives help increase productivity in a number of ways. For example, by increasing the disk size. This allows you to isolate workloads between users and get more performance. But with almost any type of disc, you will sooner or later come to the limits. To increase the limit in the clouds, connect multiple disks to virtual machines, build RAID , or raise systems that can write multiple disks in parallel.
Recommendations:
- Replicate stateful applications (at least to another AZ).
- There is not enough performance of the disk – increase its size.
- Disks can be aggregated at the OS level (RAID 0 or software tools).
- Do a load test to see if you need more disk performance.
Checklist for PaaS
Kubernetes
Kubernetes as a cloud-native system is often aware of the topology of the cloud under which its workers reside. That is, Kubernetes knows about availability zones and can do anti-affinity for nodes. Actually, this knowledge in node labels very often allows you to schedule your pods inside K8S and in different availability, in different types of nodes, etc.
Autoscalers. There is a Cluster Autoscaler in the cloud. It allows you to add and remove nodes to the cluster depending on the load profile. It is important that the Kubernetes Cluster Autoscaler operates on the sum of requests requested by your pods. If the amount is less than what is in the cluster, Autoscaler will go to create a new node. This works in parallel with the Horizontal Pod Autoscaler which uses metrics from Metrics Server. That is, Cluster Autoscaler and the same Horizontal Pod Autoscaler are parallel autoscalers, they have different scaling principles. To properly scale both pods and nodes, you need to set up requests and know how everything works.
Pod disruption budget. In the cloud, Kubernetes updates – usually Managed Kubernetes – work like this. Let’s say you have a cluster with three nodes, you need to do an update. What’s happening? Three new nodes are added to the cluster, and cordon & drain is done from the old ones: the pods are restarted and removed from these nodes, the old nodes are stopped and removed. There are now three new nodes in the cluster. But during the cordon & drain nodes, the pods restart, so application downtime can happen. The pod disruption budget in Kubernetes allows you to cordon & drain in a way that reduces the downtime of the application.
Local External Traffic Policy. In any cloud, you can create a service of type Load Balancer, and the cloud load balancer will connect to your cluster. This could be an application network balancer, but it is usually a network balancer that can be connected to the Ingress. If you simply create it by default, the Cluster External Traffic Policy will turn on. The balancer will distribute the load across all nodes of the cluster. Even if there are no pods of your application in them, it will send the load to kube-proxy, which will already do normal balancing using Kubernetes to this pod. That is, by default, you get two balancing hops, and the application does not know about the IP addresses of users.
Kubernetes has a second type of load balancer connection: Local External Traffic Policy. The balancer distributes the load only to the nodes where there are pods, and, in fact, directly to the pods. Latency decreases because you have one balancing hop and you see the IP addresses of users. Local External Traffic Policy is used in most cases in production.
Recommendations:
- Make anti-affinity rules based on restrictions:
- You can scatter deployments without disks across different nodes.
- But a StatefulSet that uses disks may need to be spread across different Availability Zones to keep the service running if one AZ fails.
- Combine affinity rules with Cluster Autoscaler for uniform scaling across availability zones.
- Use podDistruptionBudget and Node Deployment Policy settings to minimize downtime when upgrading a node. Change RollingUpdateStrategy to minimize downtime when updating deployment.
- Connect the integration mechanism with the Local External Traffic Policy balancer: it will reduce the latency of user requests.
object storage
In the cloud, you can cheaply store static content: images, JavaScript, HTML, etc. Almost every cloud has an Object Storage service that runs on the S3 protocol. It allows you to stack content and give it to users, but you can pay for traffic here. Therefore, it is recommended to cover Object Storage with a CDN. It will give you and your users less latency, but will save money: less outgoing traffic will be spent this way. Also, Object Storage allows you to save not only on storage, but also on the processing of static content.
Remember that Object Storage is not file storage.
Database
Each cloud has a unique set of databases, but all of them can be divided into two parts:
1. Opensource database options (for example, Postgres). They are almost everywhere. Usually you buy preprovisioned resources (choose the number and size of nodes) and scale the database by hand. Something scales automatically, but you need to study the documentation of a particular database in a particular cloud.
2. Proprietary databases. They are usually cheaper and even available in serverless mode. There is also a drawback: it is difficult to leave proprietary databases, because you have to completely rewrite the code.
Therefore, choose a managed database, focusing on the cases for which you will use it (primarily from a business point of view: how much do you want to expand your geography, will there be more users).
What to look for:
- Opensource databases are preprovisioned and have limitations. Usually scaled horizontally.
- Proprietary databases are usually serverless and scale as you like. But there is a vendor lock.