How to Save on Resources with the Terraform Module for AWS Spot Instance

5/5 - (1 vote)

Sometimes it can be quite difficult to save money on Amazon Web Services (AWS) Spot Instance. Do not forget that in fact this is an auction. Although prices do not change as sharply as on exchanges, they can nevertheless grow until they reach ondemand. It happens that this does not last a day or two, but up to several months, which leads to a doubling of costs for conventional resources.

How not to get into a situation when, for example, in November everything went perfectly, and in December, due to the holiday hype, the price doubled, and you got not only an expensive system, but also with an interruption? Let’s look at the reasons why this happens. To do this, let’s look at Spot Instance distribution strategies and understand how you can use the Terraform module without all these problems and at the same time save money.

Spot Instance is an unused EC2 (Elastic Compute Cloud), which is put up for auction at the lowest price. Its main drawback is interruptions, and you should always be ready for them, as a constant phenomenon that will definitely happen. Interruptions for Spot Instance are not something specific – they are part of the workflow, and not some kind of incident.

There are three main reasons for interrupts:

The current price is higher than what you are willing to pay.
Run out of spot Instance of the type you are using.
Additional restrictions that you have set yourself. For example, at first you set all availability zones, but then left only one.

Interrupt information will be delivered via EC2 Metadata Service and EventBridge. With their help, you can do some processing, but it is worth remembering that AWS SLA for one instance is only 90%, so you need to expect interruptions in all instances, not just in spots.

Simple recommendations for dealing with interruptions

Add stability and elasticity

To make your Spot Instances more stable and resilient, use the Auto Scaling Group . Then, in the event of an interruption, you can complete some actions or, at least, get a new instance to replace the previous one.

Increase launch chance

AWS itself uses different types of instances by default. But if you manually create an “Auto Scaling Group”, then you will get a complete list of types of possible instances. And if you don’t have one type, then you can use others. In some cases, even their past generations.

AWS has statistics for all types with its own interruption probability for each region. Therefore, it is best to start selecting an Instance Type from the Spot Instance Advisor service . He will tell you which types are better to take for each specific region. On the same page, you can quickly sort them out — for example, which ones were less prone to interruptions last month.

Spot Instance Distribution Strategies

There are several distribution strategies. They affect which Spot Pool (groups of unused EC2s, united by a common availability zone and type) Spot Instance will be launched from.

Default strategy

The default allocation strategy is Capacity optimized . With it you will get Spot Instance from the most available Spot Instance Pool. However, when combined with an Auto Scaling Group, only one type per Availability Zone will be selected – which is less likely to fail.

This strategy is very good for loads that are expensive to interrupt, but if you want to achieve more savings, there is an alternative.

lowest price

This strategy will select a Spot Instance from multiple Spot Instance Pools in each Availability Zone with the lowest cost per Spot. You can even specify from how many spots. The default is two, and the maximum you can install is 20 pieces.

This also works well with Auto Scaling Group and multiple types, meaning we can get different types of instances in different Availability Zones. This is convenient, but keep in mind that the lowest price implies the most interruptions, because we will no longer have clues about which Spot Pool has the least interruption probability.

Sustainability with Capacity rebalance

As of November 5, 2020, there is another opportunity to improve spot resilience with the new EC2 Rebalance Recommendation signal. It may arrive earlier than the standard two-minute Spot Instance abort notification. With its advent, EC2 Autoscaling released a new feature – Capacity rebalance:

Let’s dwell on it in more detail, because it added more stability to the spots. First, you get a message about the appearance of the Capacity rebalance signal. The Auto Scaling Group (or a third party script you have and received this signal) then starts a new instance running through all of its processes. After that, the instance that has increased the probability of interruption and received the Capacity rebalance signal, begins the termination process.

For correct completion, it is finally possible to correctly handle the completion of the Spot Instance. You can use the Lifecycle hook Auto Scaling Group, handle termination, or take an event from EventBridge and run some kind of lambda on it with some kind of logic. For example, in the case of containerization, that is, ECS or EKS, you can kick out all Pods / ECS Tasks from this instance and not interrupt the load at all.

Perhaps for some, these simple recommendations will already be enough, but let’s see how you can combine savings and high efficiency.

How much does Spot Instance cost?

This is probably the most important question. There are a couple of tools for the answer:

Familiar to all Cost Explorer, with basic and accessible to all information;
Additional Spot Instance data feed – a detailed report in S3, which is used in the settings.

Let’s first look at how the Cost Explorer can be used. We group the spots by Instance Type, filter by Usage Type Group EC2 Running Hours, and additionally set the important Purchase Option Spot option. As a result, we get two graphs: Costs and Usage:

Theoretically, they can be used to determine the average price per day, if you divide one by the other. If you add filters by tags, then you can generally understand what is happening. But if there are many different types of instances in the results, then this is quite difficult. Plus, by default, there is no detailed distribution by hour. Of course, it can be included in the Cost Explorer, but for this task it is not really required, and besides, it costs money.

Therefore, it is better to take a look at the Spot Instance data feed. Information comes to it every hour, but with delays. Keep in mind that according to tests, delays can be from 15 minutes to several hours:

The most important thing that you can get in the Spot Instance data feed is detailed information with the type of instance, the maximum price, the current market price and, finally, the figure that we still paid (Charge column). Of the features of setting up this process, only a slightly custom ACL for S3 can be noted, but everything is quite simply solved.

Spot Instance data feed settings

The current price is called Spot Price (Charge column) . This price is set per Availability Zone and Instance Type based on supply and demand on Amazon . It is gradually adjusted based on long-term forecasts. However, it is worth remembering that the dynamics of changes in this price is smooth – both up and down. That is, it will not change in an hour, and if today it was $1, then tomorrow it will not be $3 .

Therefore, the important but lesser known Spot Max Price setting can be used . This is the maximum price you are willing to pay for a Spot Instance. As soon as it becomes higher than what you indicated, then you simply stop paying. Often, Spot Max Price is not configured, because it was not in the UI before and had to be set through SLA (now it has been added to the Spot request).

Setting Spot Max Price is very important . If you do not specify any value, then you automatically agree to pay any price, up to ondemand, this is always the ceiling. For example, look at the price dynamics during the December holidays for three different types of instances:

At the beginning, the most profitable type is t3a, but due to the fact that it is cheaper by 10% on average, the demand for it begins to grow. It is quickly bought, and at some point t3 becomes more profitable. And when they buy him out, c5 turns out to be cheaper, which is not very expected.

Keep in mind that often there is a situation at spots when spot instance “A” of type is more expensive than usual ones. Most often this is due to their popularity – they are cheaper on ondemand and they are bought more often.

It seems, and then what to do? Abandon Spot Instance? Take your time, there is a solution for this problem, and Terraform module will help you to solve most of these problems without constantly studying Spot price history.

Terraform module

The module solves the problem of automating the calculation of Spot Max Price. It has several behaviors that can be useful:

spot_price_current_min – at least one Instance Type in at least one AZ;
spot_price_current_optimal – at least one Instance Type in all AZs;
spot_price_current_max – all Instance Types in all AZs;
spot_price_current_max_mod – All Instance Types in all AZs with increased reliability.

Let’s go through them from simple to complex using the example of a price matrix:

First we get the current spot_price without opening the UI and Spot price history, this is the simplest example:

We see that it is enough to transfer just one availability zone, one type of instance and get the result. The answer here is 0.20 because we only passed r5 and only AZ 1a.

Minimum price

A slightly more complex example if your goal is to run the cheapest instance possible. In this situation, use the spot_price_current_min behavior:

In this case, instances of at least one Instance Type will be launched in one Availability Zone. But even though multiple Availability Zones are passed, all instances will end up in the same zone because only one is selected. It is important to remember that the cheapest instance is always somewhere only one.

Looking at our table, we can see that there is only one instance of type r5a in Availability Zone 1b with a minimum price of 0.10. In the process of changing prices, you will receive the lowest price every time. For example, if the price starts to rise in one Availability Zone, then the next time you Terraform apply, you will get the next minimum price, for example, from another Availability Zone. Of course, this behavior will cause the most frequent interruptions, and if the price rises even a little, then you will immediately lose this instance.

Therefore, this behavior is not very effective, and it is better to look for the best balance between price and availability.

Optimal price

This behavior makes it possible to run at least one Instance Type in all Availability Zones, which saves you from the problem of determining which Instance Type is more profitable: t3, t3a, c5, or some of the r. When prices change, you will switch to the most profitable instance types for each Availability Zone, and if prices go up, you will lose Availability Zones one at a time, just like when you disable them. Returning to the table, we see a lot more options:

Only r5 instances will be available in Availability Zone 1c at 0.20. But if, say, they increase in price to 0.30, then you will still have two more Availability Zones and two other types of instances. It is important to note that such a transition will be with the interruption of the work of those instances whose price has become higher. But in the world of spots, you always need to be ready for interruptions, as a constant phenomenon.

All previous choices implied the highest savings, but if you need more stability, then the behavior of spot_price_current_max will do.

Maximum current price

This behavior can run all instance types in all Availability Zones. This will solve the problem of the lack of Spot Instance by setting the price for different types. Also, this behavior allows a large delay between Terraform apply, and they can be run less often. This is especially convenient if the module is started manually somewhere, and not in CI / CD.

Judging by the table, in this case, any Instance Type can be launched in any AZ, because none of them exceeds the maximum price. This means that if you run r5 at 1a, which costs 0.20, you will pay that price up to and including 0.30. This behavior allows you to be independent of instance types and Availability Zones. If you add many types of instances, then you can cover all cases in general and be always with spots.

Modified Price

In case you need even more stability, and you are sometimes ready to pay extra for it by 5-10% more, there is the spot_price_current_max_mod behavior:

This behavior reduces the chance of being interrupted due to minor price fluctuations and will help if Terraform is run very, very infrequently or in manual mode. You can indicate in advance that you are ready to pay, for example, plus 10% to the current price, that is, 0.33 instead of 0.30. This is a small overpayment for reduced interruptions.

But it is worth remembering that the overpayment is highly dependent on the number of instances that you use. A difference of +2 cents can result in $14 per month for each EC2 Instance, and if there are 100 of them, it can turn into $1400. Therefore, calculate whether the reduction in the probability of interruptions is worth it. If you still think it’s worth it, then through the custom price modifier this behavior can be applied to all other scenarios.

Application areas of EC2 Spot Price module

The most basic problem that the Terraform module solves is that it does not allow you to release the price to the very ceiling, so as not to double the price. Therefore, it can be used wherever Auto Scaling Group is used.

If you have an ECS Capacity provider, EKS-worker nodes, GitLab runners, any workload that can be interrupted, build machines, side monitoring things, a DevTest environment – they can be run entirely on spots. If you use the special flag ECS_ENABLE_SPOT_INSTANCE_DRAINING in ECS or use the EKS Node Termination Handler for EKS, then you can even start production. There are examples when it works entirely on spots.

But more typical examples are when they launch one or two ondemand instances in case everything is taken away from you. And everything that scales is sent to the spots, because it is not known what the load will be and nothing can be bought in advance.