5/5 - (1 vote)

If you are just starting out with Infrastructure as Code tools or thinking about how to integrate it into your CI / CD pipeline, this is the article for you. We will learn how to build the infrastructure automation process and implement the Infrastructure as Code.

The article provides a basic overview of Infrastructure as Code as a concept and focuses on the methodology and principles of its implementation in daily development and deployment.

Disclaimer: This article is NOT serious documentation about specific tools and technologies.

What is infrastructure

Infrastructure is the resources needed to support the code. At the same time, someone can imagine server racks, switches and a cables … But this was yesterday. Today, 99% of projects live in the “clouds”. That is, resources are virtual machines, containers, load balancers.

So, all cloud resources are other software that runs on our cloud provider’s computers.

Infrastructure as Code is a way of supplying and managing computing and network resources by describing them in the form of software code, as opposed to setting up the necessary equipment yourself or using interactive tools.

Why should you pay attention to Infrastructure as Code

Infrastructure as Code is a (no longer) new trend that solves the current problem of infrastructure automation.

Many of us had to be in a similar situation:

– Listen, I need to deploy the load balancer …
– Sorry, we have a crash! Please create a ticket to JIRA and come back in two days …

If the infrastructure was automated, this dialogue would not have taken place (as well as delays), because the load balancer would be automatically deployed. That’s why infrastructure automation is so popular. It solves not only technical issues, but also organizational and communication ones. Automation makes our lives easier and turns clutter into a predictable process.

If you are just starting to get acquainted with this topic, you may be stunned by the number of tools that the market offers. How to build a process that will help architecture evolve and change tools?

Problem of scaling

According to my observations, one microservice requires an average of 10-12 infrastructure resources (a load balancer, RDS instance, security groups, etc.). If we have three environments – test, staging, production – it is already about 30 resources. And if the microservices are 10-20-100, the problem becomes even bigger.

Problem of predictability

If you create all these resources manually, the question is “What to do if we make mistakes and our environments will be different; What bugs can this lead to? ” turns into “What to do when …” Because the probability of making a mistake in several hundred manual operations is close to 100%.

Given these problems, infrastructure automation is becoming not just a fashion trend, but a necessity.

Ways to solve

We have proven code methodologies that we can use. We already know how to build a process: how to store, test and deploy code.

One of the most well-known methodologies for working with code is The 12 Factor App. It was promoted by one of the cloud providers – Heroku. Among the goals of this methodology are:

  • ensure maximum portability between environments, reducing the risk of differences and bugs. In this way we make Continuous Deployment possible;
  • make automation as simple as possible. So that developers do not spend a lot of time to start working on the project.

Among the 12 principles of The 12 Factor App the most important are:

  • Codebase
  • Configuration
  • Logging
  • Development / Production Parity

Codebase

When we work with microservice code, we do not store it locally, but use version control systems (Git, Mercurial, etc.). And code for infrastructure should be not an exception. So we do not lose the history of change and know the reason for each of them.

If our code becomes the only unified source of truth for all environments and there are no custom patches for individual environments – then we can get rid of problems with manual deployment, when each release is an attempt to remember where and what rake is hidden. We can make infrastructure deployment fully automated.

Configuration

But for automated deployment, the configuration must be stored separately from the code and granted access during deployment. The 12 Factor App recommends doing this through environment variables. This is a universal approach that works on any operating system. Moreover, this is a safe approach, unlike command-line arguments, because environment variables cannot be obtained from another process by a simple ‘ps aux’.

Logging

During infrastructure deployment, we need to monitor the state of deployment and the state of the system as a whole. And you can do this with logs. To make them work anywhere, The 12 Factor App offers to consider logs as a stream of endless and beginning events, which are sorted chronologically and displayed in stdout.

Logging is a separate issue that can be resolved by using Fluentd, Elasticsearch, or directing it to another file or process. Most importantly, logs in stdout can integrate with any system or work locally when you are debugging.

Development / Production Parity

And the most important principle is Development / Production Parity. If we store the code in the version control system and use it as a universal source of truth, and the code itself does not have special cases for individual environments, all unique settings are stored separately and available during deployment as environment variables – we get a system where there are no differences between environments (test vs staging vs production).

The 12-Factor App FTW

Yes, we may have a situation like “I need two instances for the test and 50 for the production.” But there will be a difference in the settings, not in the code. And that paves the way for Continuous Deployment. If we can automatically deploy and test our changes, we can automatically do so in any environment.

If our logs are available in stdout, and our configuration is available as environment variables, then we have no problems integrating with modern CI / CD solutions. Travis CI, Gitlab CI, Github Actions, Jenkins, and other tools can read code from the version control system, grant configuration access through environment variables, and work with logs in stdout.

“Hello, World!”, Or How to start

If we start to google “infrastructure tools”, we may be surprised by the choice that is on the market. We need to choose a tool with which we will not only be able to write an analog of “Hello, World!”, But with which it will be convenient to maintain a real system.

The first choice (if you use AWS) may be the AWS CLI. You can use it to create, modify and delete cloud resources:

aws elb create-load-balancer
  --load-balancer-name myELB
  --listeners
      "Protocol=HTTP,
       LoadBalancerPort=80,
       InstanceProtocol=HTTP,
       InstancePort=80"
  --subnets subnet-15aaab61

Here is an example of a command that creates a load balancer using the AWS CLI. At first glance, it is quite clear, this command will work as planned and create a load balancer … But are there other resources (subnets, security groups) to which this command refers. If not, they need to be created (and that’s a few more commands). If resources exist, but they have other identifiers – you need to find these correct identifiers and substitute them in the command.

And what if the load balancer already exists? So, you need to add a command that will check its existence. And what, as he has other parameters, not as it should be? So, we will have to check its status – and wrap our command in an if-else-statement: “If there is no resource – create, if there is – change its parameters.”

That’s too many commands! And the problem is not even in the size of the script, but in its instability: for each banal operation we need to process many exceptions. And if we miss one, there is a chance to destroy the entire infrastructure.

Declarative vs imperative tools

The problem of script instability is caused by the fact that AWS CLI is an imperative tool. Imperative tools work according to the scheme “I want to change the world, for this, I will make X”. But if the world is not in the state we expect, then at best the tool will return the mistake and do nothing, at worst it will not do what we expect.

Declarative tools are more suitable for infrastructure. They work according to the scheme “I want to change the world and leave this world in a state of Y”. Instead of describing separately each step we need to take to achieve a goal, declarative tools describe the goal itself, the final state. And what exactly are the steps – declarative tools decide for themselves, the user should not describe every action and every condition.

Declarative tools for AWS include AWS CloudFormation and Terraform. For our example, lets choose Terraform. In it the load balancer will look like this:

resource "aws_elb" "myELB" {
  name = "myELB"

  listener {
    instance_port     = 8000
    instance_protocol = "http"
    lb_port           = 80
    lb_protocol       = "http"
  }

  subnets = [...]
  security_groups = [...]
}

...

Here we see another problem – the link of some resources to others. To make our pseudocode real, we need to add definitions for security groups (subnets, etc.). For example, a link to a security group might look like this:

resource "aws_elb" "myELB" {
  name = "myELB"
  ...
  security_groups = ["${aws_security_group.elb.id}"]
}

resource "aws_security_group" "elb" {
  name = "web_alb"
  description = "Allow incoming HTTP connections to ALB."

  ingress {
    from_port = 80
    to_port = 80
    protocol = "tcp"
    cidr_blocks = ["0.0.0.0/0"]
  }

  egress {
    from_port = 0
    to_port   = 0
    protocol  = "-1"
    cidr_blocks = ["0.0.0.0/0"]
  }
}

We define the rules for this security group – accept traffic on port 80 – and the details of the deployment of this group become irrelevant. Terraform will deploy the resource itself, get its ID and substitute load balancer in the parameters.

We can add Terraform declarations to the version control system and use them as a source of truth for different environments. And here we see the next problem.

State of the world

How does Terraform know what to deploy and what not? To do this, it needs to save somewhere the current state of all the described resources. By default, this state is stored in the terraform.tfstate file. The question may arise: why not save this file in the version control system? There are two reasons why this should not be done:

First, this file de facto contains the configuration. And we try to separate the code and the configuration. Therefore it is not necessary to store them together.
Second, this file may contain sensitive information. For example, the password of the database you created in AWS.
We need to keep the configuration in a separate safe place. Fortunately, Terraform has a remote state concept; we can keep the state of the world separate from the code and even export its attributes that can be useful to other developers.

output "web_alb_sg_id" {
  value = aws_security_group.web_alb.id
}

terraform {
 backend "s3" {
   key    = "iacdemo.tfstate"
   region = "us-west-2"
   bucket = "demobucket"
 }
}

Terraform supports many remote state mechanisms, but if you work with AWS, I recommend AWS S3, because this mechanism:

  • supports encryption, which solves the problem of storing confidential data;
  • supports versioning – in case of malfunctions you will always be able to find the last correct variant of a condition of your world;
  • most teams can use it for free for a year (AWS free tier).

But even if we solve this problem, we immediately see the next one. Yes, we have created a load balancer and other resources, but …

Not all resources are the same

Yes, the load balancer is important for our microservice. But the security group he refers to is important for all microservices. Hacking a security group is much worse than hacking a single load balancer.

This is where the conflict between developers and DevOps lies. Developers need to describe and deploy resources for their services as soon as possible. DevOps requires that the entire system be stable. However, developers do not want to wait until each change to the infrastructure is thoroughly checked manually before release. They want to deploy their changes in production as soon as possible.

The only way to avoid this conflict is to divide the infrastructure into key (which provides basic resources for the entire system) and service (which provides resources for a particular service). If the key infrastructure is supported separately, then developers can work on the infrastructure of their microservices without fear of breaking something they have no idea about. All their mistakes and failures will be only within their services.

How to share an infrastructure

But developers still need to refer to key resources. How to do it?

Consider the example of Terraform. We keep the state of the world separate from the code. This allows you to import it and when declaring resources to refer to those of its parameters that were exported:

data "terraform_remote_state" "core" {
 backend = "s3"
 config = {
   key    = "iacdemo.tfstate"
   region = "us-west-2"
   bucket = "demobucket"
 }
}


resource "aws_elb" "myELB" {
  name = "myELB"
  ...
  security_groups = "${terraform_remote_state.core.web_alb_sg_id}"

Now the key infrastructure can be deployed separately from the microservice infrastructure.

And we move on to the most important issue …

How to organize the process of infrastructure deployment

As with regular code, we can break the deployment process into parts:

  • Validation
  • Testing
  • Deployment
  • Smoke testing

(and then everything is repeated in the next environment, starting with item 1)
Testing and smoke testing deserve a separate article, so for now we will focus on validation and deployment.

Validation of infrastructure – especially key infrastructure – is very important. We need to make sure that the infrastructure changes that will be deployed are actually what we need and that there are no unpredictable side effects.

In Terraform this can be done with the following commands:

  • terraform init
  • terraform plan -input = false

The first command initializes Terraform and creates a remote state or synchronizes with it.

The second command returns a list of resources that will be created or modified. So that we can check if the planned changes are really what we expect. (The “-input = false” parameter is required so that all variables are taken from variable environments without waiting for input from the console. This is very useful when commands are executed in a headless environment, such as Jenkins, where there is no console).

Carefully remove

When reviewing the list of changes, special attention should be paid to the removal of resources: without knowing the specifics of their work, you can make mistakes. For example, make a seemingly trivial change – in the name of load balancer – which can lead to the fact that the existing load balancer will be removed, and a new one will be created a minute later.

If your infrastructure does not require 99.9% uptime, you can survive it, if not – you may need to apply the create_before_destroy setting. But to do this, you need to understand how this will affect the resources, which depend on the problem.

And now let’s deploy
If all the changes are as we need, we can safely proceed to deployment. Let’s see what the final script required for full automation will look like:

# initializing configuration!
export TF_VAR_<your variable>=... 

# setting up or syncing with a remote state
terraform init


# reviewing a list of changes
terraform plan -input=false	
    
# deploying our infrastructure changes
terraform apply -input=false -auto-approve

As you can see, this is a banal shell script that can be integrated with Jenkins, Gitlab CI, and other CI-CD solutions, including our CI-CD pipeline. This solves the non-automated load balancer problem described at the beginning of this article. We can now describe our infrastructure as code, save changes to the version control system, code review them, and automatically deploy in minutes, not days or weeks. But all this applies only to Terraform …

Is there life outside of Terraform

So! Consider the example of Kubernetes, which allows you to describe the infrastructure for containers and network resources declaratively. We can wrap this declaration in a template, using a banal shell script (or more advanced tools – from YTT to Helm.

#!/bin/bash
cat <<YAML
apiVersion: apps/v1beta1
kind: Deployment
...
spec:
 replicas: 1
 template:
   spec:
     containers:
       - name: $SERVICE_NAME
         image: $DOCKER_IMAGE
         imagePullPolicy: Always
         ports:
           - containerPort: 8090
...
YAML

This script can also be configured using environment variables. If we look at the process of deployment of kubernetes-resources, we will see that there is no fundamental difference:

# initializing configuration!
export DOCKER_IMAGE=hello:latest
export SERVICE_NAME=helloworld

# reviewing a list of changes
k8s_template.yaml.sh | kubectl apply --dry-run -f -

# deploying our infrastructure changes
k8s_template.yaml.sh | kubectl apply -f -

We also describe our resources as a code, store them in the version control system. We check the changes before deploying, and we can integrate this process with the CI-CD pipeline.

And what about the future

And even if a fundamentally new declarative tool for managing (for example!) Neural networks appears in the future, all these methods can be applied to it as well! I hope it will be easier for you to start experimenting with the infrastructure now.