Skip to main content

Breaking Through Terraform's Ceiling: A New Approach to IaC State Management

· 8 min read

Terraform has been the de facto standard for infrastructure as code for a long time. With its vast ecosystem of 3000+ providers, it revolutionized how organizations approach infrastructure management. However, as tech stacks grow in size and complexity, and with remote and distributed teams becoming the norm, we hit the limits of Terraform.

This post dives specifically into how we can move from Terraform to next-generation solutions like Mantis, examining why we need to rethink our approach to infrastructure automation at scale.

Understanding Terraform and State Management

Terraform's declarative approach transformed infrastructure provisioning from manual, error-prone processes into clean, version-controlled code. Key innovations included:

  • Declarative infrastructure definitions that enable teams to specify desired end-states rather than step-by-step procedures
  • Resource dependency management
  • State tracking for infrastructure
  • Multi-cloud provider support

What is State Management?

State management in Terraform refers to how it tracks and manages the real-world infrastructure resources it creates. Think of it as a mapping between your code and actual cloud resources. This state:

  • Records what infrastructure exists
  • Tracks resource dependencies
  • Manages resource updates and deletions
  • Handles concurrent access to infrastructure

The State Management Ceiling

While this approach worked well for smaller teams and simpler infrastructures, it creates a ceiling when scaling because:

  • Global State Lock: Only one operation can modify infrastructure at a time
  • All-or-Nothing Updates: Changes are applied as a single transaction
  • State File Size: As infrastructure grows, state files become unwieldy

The Growing Gap

This creates fundamental conflicts:

Terraform State Management Bottlenecks

The Assumptions dont hold anymore

When Terraform was designed (circa 2014), it made several assumptions:

  • Teams were smaller and more centralized
  • Infrastructure changes were less frequent
  • Cloud architectures were simpler.
  • Microservices weren't as prevalent. Kubernetes grew along with Terraform, but it was not the norm.
  • CI/CD wasn't as sophisticated. Industry was ok with shipping once or twice a week. Cloud governance was still in infancy.

Today's Infrastructure Needs

Terraform is the first-gen multi-cloud IaC tool, and it has done a great job at enabling teams to manage their infrastructure at scale. However, the landscape has dramatically changed, and Terraform's assumptions no longer hold true:

  • Scale: Organizations manage thousands and some times tens of thousands of resources
  • Complexity: Tech stacks are more complex, with microservices and Kubernetes becoming a big part of the stack.
  • Distribution: Teams are global and work asynchronously
  • Automation: CI/CD requires more granular control.
  • Governance: Cloud spend, security is now a high priority in developer workflows.

The Ideal Solution

A modern multi-cloud IaC solution should provide:

  1. Granular State Management: Independent state for different components
  2. Unified tool chain across the stack: As Kubernetes has become the substrate for traditional microservices and ML workloads, the lines between cloud infra and app infra are blurring. A unified tool chain across the stack will enable a more integrated approach to infrastructure management.
  3. Strong Type Safety: Catch errors before deployment
  4. Built-in governance: Clear operation ordering and dependencies
  5. First-class package management: Manage dependencies and share code via OCI com

How Mantis Addresses These Needs

Task-based state management with independent state files

At the core of Mantis is the concept of a task, which is a self-contained unit of work. Each task has its own state file, and the state is managed independently. This approach provides several key benefits:

Let's take an example to illustrate the difference between the approaches of Terraform and Mantis to state management.

Terraform Approach:

main.tf
# Single state file managing all resources
resource "aws_vpc" "main" {
cidr_block = "10.0.0.0/16"
}

resource "aws_subnet" "public" {
vpc_id = aws_vpc.main.id
cidr_block = "10.0.1.0/24"
}

resource "aws_security_group" "allow_http" {
vpc_id = aws_vpc.main.id
# ... security group rules
}

Terraform Challenges:

  • State files grow in size: Terraform's single entry point approach (main.tf) creates a single state file that grows in size as infrastructure grows. While its declarative approach promises simplicity, trying to sync everything in main.tf with a single state file creates several critical problems at scale:

  • Risk of Unintended Changes: Without granular control over resources and desired state, accidental deletions or modifications become more likely. For example, when replacing resource A with resource D in a Terraform script, resource A will be deleted and D will be created. While this is the expected behavior, such changes can be easily missed during code reviews of large scripts, potentially leading to unintended resource deletions. This sensitivity is a well-known challenge with Terraform, as documented in various community discussions and articles (see References section below).

  • Complex Dependencies: Resource relationships become harder to track and maintain as the state file grows

Mantis Approach:

vpc_setup.cue
vpc_setup: {
@flow("vpc_setup")

create_vpc: {
@task(mantis.core.TF)
config: {
resource: aws_vpc: main: {
cidr_block: "10.0.0.0/16"
}
}
}

create_subnet: {
@task(mantis.core.TF)
dep: create_vpc
config: {
resource: aws_subnet: public: {
vpc_id: @var(vpc_id)
cidr_block: "10.0.1.0/24"
}
}
}
}

Mantis translates the CUE code into Terraform-compatible json and hence inherits all the good parts of Terraform, while addressing the challenges via:

  • Independent State Management: Each task maintains its own state file, reducing complexity and making operations more predictable. This prevents accidental resource deletions since changes are explicitly scoped to individual tasks.

  • Clear Dependencies and Safety: Task dependencies are explicitly defined, making the workflow easier to understand and debug. When replacing resources (e.g., resource A with resource B), deletions are intentional and clearly visible in the task definition.

  • Enhanced Scalability: By splitting state files per task, the system:

    • Reduces contention on state files
    • Enables parallel operations across different tasks
    • Makes the workflow more maintainable at scale
    • Simplifies debugging and review processes
  • Better Error Isolation: Issues are contained within individual tasks, making it easier to identify and fix problems without affecting the entire infrastructure.

The rsync analogy:

For folks familiar with good old rsync, I'd like to use to explain the difference between the two approaches.

rsync or remote synchronization is a software utility for Unix-Like systems that efficiently sync files and directories between two hosts or machines. It takes a source directory and a destination directory, on a remote host and syncs them via secure or insecure protocols.

rsync -av source_directory destination_directory

The rsync command works well when you're doing a single folder, but when you're trying to sync a large number of folders, you dont have control on the order of syncing and tracking intermediate states, which is important at scale, as sometimes these are time-taking operations, and may fail. You'd want to be able to retry failed tasks.

Terraform's approach is like trying to sync a single folder with a large number of sub-folders, and let rsync figure out the order of syncing and intermediate states.

Mantis's approach is to write a script that can iterate over the list of folders, and then rsync each of them as specified by the author. This gives the author full control on the order of syncing and intermediate states.

Now in order to fully achieve it's mission and be a 10x improvement over terraform, Mantis also addresses other challenges, and we'll do separate posts on them in the coming weeks:

Improved Readability with Makefile like tasks

Terraform Challenge: Configurations are complex and make it difficult to understand the deployment flow and responsibilities.

Mantis Solution: Task-Oriented Workflow Engine

setup_vpc.tf.cue
vpc_setup: {
// High-level VPC setup flow with annotated tasks
@flow("vpc_setup")

// Task 1: Create VPC
create_vpc: {
@task(mantis.core.TF)
description: "Create and configure VPC"
config: {
resource: {
aws_vpc: main: {
cidr_block: "10.0.0.0/16"
tags: {
Name: "mantis-production-vpc"
Environment: "production"
}
}
}
}
exports: {
vpc_id: "aws_vpc.main.id"
subnet_ids: []
}
}

// Task 2: Set up networking
setup_networking: {
dep: [create_vpc] // explicit depdencies
config: {}
}

// Task 3: Configure security
security_setup: { ... }

// Task 4: Deploy monitoring
monitoring: { .. }
}

First-Class Package Management and Dependencies

Terraform Challenge: Limited module system makes it difficult to manage dependencies and share code.

Mantis Solution: First-Class Package System using CUE modules

cue.mod/module.cue
module: "augur.ai/vpc-setup"
language: {
version: "v0.10.0"
}

// Define the dependencies for the project
dependencies: [
"[augur.ai/org/base-networking]",
"[augur.ai/org/security-baseline]",
]

network.cue
package network

import (
base "augur.ai/org/base-networking"
security "augur.ai/org/security-baseline"
)

// Define network configuration using base templates
network_config: {
base.NetworkingConfig & {
...
}
}

Making the Transition to Mantis

The evolution from Terraform to Mantis represents more than just a tool change—it's a fundamental shift in how we approach infrastructure automation at scale. By leveraging CUE and building upon OpenTofu, Mantis provides a powerful yet familiar platform that addresses the key challenges of modern infrastructure management.

Mantis offers:

  • Improved readability and maintainability through task-oriented workflows
  • Better governance and control mechanisms built into the core platform
  • Advanced modularity and dependency management
  • Flexible execution models that support complex deployment patterns

For organizations considering the transition:

  1. Start with new projects or non-critical workloads
  2. Use Mantis's Terraform import capabilities to gradually migrate existing infrastructure
  3. Consider a phased approach, moving team by team or application by application

The future of infrastructure automation demands tools that can handle the complexity of modern cloud environments while supporting the needs of large, distributed teams. Mantis represents the next generation of IaC tools, designed specifically to meet these challenges head-on.

References

Links that talk about mitigating strategies for terraform accidental deletion: