What Is Terraform Drift And How To Avoid It?

Featured image

Drift is the term for when the real-world state of your infrastructure differs from the state defined in your configuration

Have you ever been about to finish washing the dishes and suddenly someone puts a dirty glass and some pots of the last meal in front of you? Then you have to clean the glass and the pots, and then the scene repeats itself over and over again. Solving Terraform drifts is like washing a huge pile of dishes that never ends. In a real environment at any scale mitigate drifts is not an easy task.

Let’s take a look at how we can avoid or reduce the drifts in our Terraform configurations.

Manage All Changes With Terraform

On Hashicorp’s blog post Change Management At Scale: How Terraform Helps End Out-of-Band Anti-Patterns recommends to Work Toward Having All Changes Go Through Terraform. By locking the cloud accounts, you ensure that the changes are applied only through Terraform and not through other means. This also goes aligned with some security principles like the Principle of Least Privilege; where a user is given the minimum level of access or permissions required to perform an action or job function.

This is not always possible, mostly if there is no agreement or policy to lock the cloud accounts where Terraform is managing the infrastructure. In such cases, it is important to establish a drift detection system to fire alarms when the infrastructure is not in sync with the Terraform state. I really liked this youtube video where Michael Simo talked about Terraform Drift Detection and Reporting Terraform Drift Detection and Reporting?

Bring Automation

AWS said it on it’s Reliability Pillar:

Although conventional wisdom suggests that you keep humans in the loop for the most difficult operational procedures, we suggest that you automate the most difficult procedures for that very reason.

Changes to your infrastructure should be made using automation. The changes that need to be managed include changes to the automation, which then can be tracked and reviewed.

Hashicorp put available a guide for Running Terraform in Automation and is worth to read it if you want to automate your infrastructure.

Visit the following links to learn more about the process:

Better Terraform Modules

With Terraform modules it is possible to create reusable pieces of infrastructure that can be used in multiple Terraform runs without having to repeat the same code and the end result is a more maintainable and reusable infrastructure. Infrastructure as Code (IaC) is a great way to manage your infrastructure and since all the workloads are defined in code, the same rules for managing any kind of code can be applied in order to improve its quality. By reducing the amount of code and making it more maintainable, you can also reduce the amount of time it takes to deploy the infrastructure reducing toil and error, and therefore reducing the number of drifts in the infrastructure.

Keep in mind the following general principles:

Track Origin and Adjust the Configuration

Knowing where the changes are coming from can determine the best course of action. They are three types of changes origin:

Caused By Changes in the Code

This is the most common type of change and it is the one that you should be able to track. Often this affects teams that are not leveraging automation and applying changes without a centralized pipeline or haven’t a clear deployment flow.

Caused By the Context

The cloud itself changes sometimes for good and sometimes for bad reasons. These kind of changes are not always easy to track since it is commonly out of the scope of the code and the Cloud provider is the responsible.

Caused Manual Changes Directly in the Infrastructure

This is a real burden. It is not always possible to track the changes made by the infrastructure directly or at least it is time-consuming. Applying changes directly might seem like the easiest option for one-off tasks, but for recurring operations, it is a big consumer of valuable engineering time making it difficult to track and manage changes.

Immutable infrastructure

By definition an immutable infrastructure is one that cannot be changed. Even though it is hard to achieve the reward is worth the effort:

HashiCorp co-founder and CTO Armon Dadgar explains the differences and tradeoffs of immutable and mutable infrastructure really good in this video: What is mutable vs. immutable infrastructure?

TL;DR

At an organizational level, aim to determine the roles and permissions for individuals leveraging Terraform by managing all changes with it in an automated way with a review process. Easy the toil and reduce the code base by using Terraform modules, keep an eye open looking for ways to improve it like any other codebase in the organization. It is important to understand where are the drifts are coming from and adjust the configuration to avoid them or to ignore tolerable changes. And finally, aim for immutable infrastructure whenever possible.