Breaking Through Terraform's Ceiling: A New Approach to IaC State Management

November 6, 2024 · 9 min read

The industry is at the cross roads with AI-led code generation. The options are to either proliferate tech-debt and push ourselves into a rabbit hole or use this opportunity to rethink our approach, and build next-generation solutions that remove bottle-necks in cloud usage.

This post dives into how we can move from Terraform to next-generation solutions like Mantis, examining why we need to rethink our approach to infrastructure automation at scale.

Understanding Terraform and State Management

What is Terraform?

Terraform is an Infrastructure as Code (IaC) tool that allows teams to define and provision cloud infrastructure using declarative configuration files. It revolutionized infrastructure management by introducing:

Declarative infrastructure definitions
Resource dependency management
State tracking for infrastructure
Multi-cloud provider support

What is State Management?

State management in Terraform refers to how it tracks and manages the real-world infrastructure resources it creates. Think of it as a mapping between your code and actual cloud resources. This state:

Records what infrastructure exists
Tracks resource dependencies
Manages resource updates and deletions
Handles concurrent access to infrastructure

The State Management Ceiling

While this approach worked well for smaller teams and simpler infrastructures, it creates a ceiling when scaling because:

Single Source of Truth: All infrastructure state is stored in one or few state files
Global State Lock: Only one operation can modify infrastructure at a time
All-or-Nothing Updates: Changes are applied as a single transaction
State File Size: As infrastructure grows, state files become unwieldy

The Growing Gap

This creates fundamental conflicts:

Hidden Assumptions and Today's Reality

Terraform's Original Assumptions

When Terraform was designed (circa 2014), it made several assumptions:

Teams were smaller and more centralized
Infrastructure changes were less frequent
Cloud architectures were simpler.
Microservices weren't as prevalent. Kubernetes grew along with Terraform, but it was not the norm.
CI/CD wasn't as sophisticated. Industry was ok with shipping once or twice a week. Cloud governance was still in infancy.

Today's Infrastructure Needs

The landscape has dramatically changed:

Scale: Organizations manage thousands of resources
Speed: Multiple teams need to deploy simultaneously
Complexity: Microservices require intricate dependencies
Distribution: Teams are global and work asynchronously
Automation: CI/CD requires more granular control.
Governance: Cloud spend, security is now a high priority in developer workflows.

Needs for next-gen IaC in the coming years

AI-led code generation: AI will accelerate code generation, and pressure will increase on the IaC tools to keep up.
Infrastructure will be a competitive advantage: Companies that build on solid infrastructure will have a competitive advantage in their ability to innovate and scale. Companies that can organize their infrastructure well will be able to deploy faster and innovate faster.
Change management will be key: As infrastructure becomes more complex, change management will become more important. We need to have a good handle on what changed, and what didn't. Keeping the house in order at all times will become more important.

The Ideal Solution

A modern IaC solution should provide:

Granular State Management: Independent state for different components
Parallel Operations: Multiple teams working simultaneously
Clear Dependencies: Explicit and manageable resource relationships
Scalable Performance: No degradation with infrastructure growth
Strong Type Safety: Catch errors before deployment
Task-Based Workflows: Clear operation ordering and dependencies

How Mantis Addresses These Needs

State Management that scales

Let's take an example to illestrate the difference between the approaches of Terraform and Mantis to state management.

Terraform Approach:

main.tf
# Single state file managing all resources
resource "aws_vpc" "main" {
  cidr_block = "10.0.0.0/16"
}

resource "aws_subnet" "public" {
  vpc_id     = aws_vpc.main.id
  cidr_block = "10.0.1.0/24"
}

resource "aws_security_group" "allow_http" {
  vpc_id = aws_vpc.main.id
  # ... security group rules
}

Terraform Challenges: State files grow in size and become difficult to manage due to Terraform's single entry point approach (main.tf). While its declarative approach promises simplicity, trying to sync everything in main.tf with a single state file creates several critical problems at scale:

Monolithic State Management: All resources are tracked in one state file, making it unwieldy and hard to reason about
Limited Update Control: Partial updates become difficult to manage safely
Risk of Unintended Changes: Without granular control over resources and desired state, accidental deletions or modifications become more likely. For example, when replacing resource A with resource D in a Terraform script, resource A will be deleted and D will be created. While this is the expected behavior, such changes can be easily missed during code reviews of large scripts, potentially leading to unintended resource deletions. This sensitivity is a well-known challenge with Terraform, as documented in various community discussions and articles (see References section below).
Complex Dependencies: Resource relationships become harder to track and maintain as the state file grows

Mantis Approach:

vpc_setup.cue
vpc_setup: {
    @flow("vpc_setup")
    
    create_vpc: {
        @task(mantis.core.TF)
        config: {
            resource: aws_vpc: main: {
                cidr_block: "10.0.0.0/16"
            }
        }
    }
    
    create_subnet: {
        @task(mantis.core.TF)
        dep: create_vpc
        config: {
            resource: aws_subnet: public: {
                vpc_id: @var(vpc_id)
                cidr_block: "10.0.1.0/24"
            }
        }
    }
}

This approach provides several key benefits:

Key Benefits:

Independent State Management: Each task maintains its own state file, reducing complexity and making operations more predictable. This prevents accidental resource deletions since changes are explicitly scoped to individual tasks.
Clear Dependencies and Safety: Task dependencies are explicitly defined, making the workflow easier to understand and debug. When replacing resources (e.g., resource A with resource B), deletions are intentional and clearly visible in the task definition.
Enhanced Scalability: By splitting state files per task, the system:
- Reduces contention on state files
- Enables parallel operations across different tasks
- Makes the workflow more maintainable at scale
- Simplifies debugging and review processes
Better Error Isolation: Issues are contained within individual tasks, making it easier to identify and fix problems without affecting the entire infrastructure.

The rsync analogy:

For folks familiar with good old rsync, I'd like to use to explain the difference between the two approaches.

rsync or remote synchronization is a software utility for Unix-Like systems that efficiently sync files and directories between two hosts or machines. It takes a source directory and a destination directory, on a remote host and syncs them via secure or insecure protocols.

rsync -av source_directory destination_directory

The rsync command works well when you're doing a single folder, but when you're trying to sync a large number of folders, you dont have control on the order of syncing and tracking intermediate states, which is important at scale, as sometimes these are time-taking operations, and may fail. You'd want to be able to retry failed tasks.

Terraform's approach is like trying to sync a single folder with a large number of sub-folders, and let rsync figure out the order of syncing and intermediate states.

Mantis's approach is to write a script that can iterate over the list of folders, and then rsync each of them as specified by the author. This gives the author full control on the order of syncing and intermediate states.

Now in order to fully achieve it's mission and be a 10x improvement over terraform, Mantis also addresses other challenges, and we'll do separate posts on them in the coming weeks:

Improved Readability with Make-Style Build Systems

Terraform Challenge: Configurations are complex and make it difficult to understand the deployment flow and responsibilities.

Mantis Solution: Task-Oriented Workflow Engine

setup_vpc.tf.cue
vpc_setup: {
    // High-level VPC setup flow with annotated tasks
    @flow("vpc_setup")

    // Task 1: Create VPC
    create_vpc: {
        @task(mantis.core.TF)
        description: "Create and configure VPC"
        config: {
            resource: {
                aws_vpc: main: {
                    cidr_block: "10.0.0.0/16"
                    tags: {
                        Name: "mantis-production-vpc"
                        Environment: "production"
                    }
                }
            }
        }
        exports: {
            vpc_id: "aws_vpc.main.id"
            subnet_ids: []
        }
    }

    // Task 2: Set up networking
    setup_networking: { }

    // Task 3: Configure security
    security_setup: { }

    // Task 4: Deploy monitoring
    monitoring: { }
}

First-Class Package Management and Dependencies

Terraform Challenge: Limited module system makes it difficult to manage dependencies and share code.

Mantis Solution: First-Class Package System using CUE modules

cue.mod/module.cue
module: "augur.ai/vpc-setup"
language: {
    version: "v0.10.0"
}

// Define the dependencies for the project
dependencies: [
    "[augur.ai/org/base-networking]",
    "[augur.ai/org/security-baseline]",
]

network.cue
package network

import (
    base "augur.ai/org/base-networking"
    security "augur.ai/org/security-baseline"
)

// Define network configuration using base templates
network_config: {
    base.NetworkingConfig & {
        ...
    }
}

Making the Transition to Mantis

The evolution from Terraform to Mantis represents more than just a tool change—it's a fundamental shift in how we approach infrastructure automation at scale. By leveraging CUE and building upon OpenTofu, Mantis provides a powerful yet familiar platform that addresses the key challenges of modern infrastructure management.

Mantis offers:

Improved readability and maintainability through task-oriented workflows
Better governance and control mechanisms built into the core platform
Advanced modularity and dependency management
Flexible execution models that support complex deployment patterns

For organizations considering the transition:

Start with new projects or non-critical workloads
Use Mantis's Terraform import capabilities to gradually migrate existing infrastructure
Consider a phased approach, moving team by team or application by application

The future of infrastructure automation demands tools that can handle the complexity of modern cloud environments while supporting the needs of large, distributed teams. Mantis represents the next generation of IaC tools, designed specifically to meet these challenges head-on.

References

Links that talk about mitigating strategies for terraform accidental deletion:

Understanding Terraform and State Management​

What is Terraform?​

What is State Management?​

The State Management Ceiling​

The Growing Gap​

Hidden Assumptions and Today's Reality​

Terraform's Original Assumptions​

Today's Infrastructure Needs​

Needs for next-gen IaC in the coming years​

The Ideal Solution​

How Mantis Addresses These Needs​

State Management that scales​

The rsync analogy:​

Improved Readability with Make-Style Build Systems​

First-Class Package Management and Dependencies​

Making the Transition to Mantis​

References​