Breaking Through Terraform's Ceiling: A New Approach to IaC State Management
The industry is at the cross roads with AI-led code generation. The options are to either proliferate tech-debt and push ourselves into a rabbit hole or use this opportunity to rethink our approach, and build next-generation solutions that remove bottle-necks in cloud usage.
This post dives into how we can move from Terraform to next-generation solutions like Mantis, examining why we need to rethink our approach to infrastructure automation at scale.
Understanding Terraform and State Management
What is Terraform?
Terraform is an Infrastructure as Code (IaC) tool that allows teams to define and provision cloud infrastructure using declarative configuration files. It revolutionized infrastructure management by introducing:
- Declarative infrastructure definitions
- Resource dependency management
- State tracking for infrastructure
- Multi-cloud provider support
What is State Management?
State management in Terraform refers to how it tracks and manages the real-world infrastructure resources it creates. Think of it as a mapping between your code and actual cloud resources. This state:
- Records what infrastructure exists
- Tracks resource dependencies
- Manages resource updates and deletions
- Handles concurrent access to infrastructure
The State Management Ceiling
While this approach worked well for smaller teams and simpler infrastructures, it creates a ceiling when scaling because:
- Single Source of Truth: All infrastructure state is stored in one or few state files
- Global State Lock: Only one operation can modify infrastructure at a time
- All-or-Nothing Updates: Changes are applied as a single transaction
- State File Size: As infrastructure grows, state files become unwieldy
The Growing Gap
This creates fundamental conflicts:
Hidden Assumptions and Today's Reality
Terraform's Original Assumptions
When Terraform was designed (circa 2014), it made several assumptions:
- Teams were smaller and more centralized
- Infrastructure changes were less frequent
- Cloud architectures were simpler.
- Microservices weren't as prevalent. Kubernetes grew along with Terraform, but it was not the norm.
- CI/CD wasn't as sophisticated. Industry was ok with shipping once or twice a week. Cloud governance was still in infancy.
Today's Infrastructure Needs
The landscape has dramatically changed:
- Scale: Organizations manage thousands of resources
- Speed: Multiple teams need to deploy simultaneously
- Complexity: Microservices require intricate dependencies
- Distribution: Teams are global and work asynchronously
- Automation: CI/CD requires more granular control.
- Governance: Cloud spend, security is now a high priority in developer workflows.
Needs for next-gen IaC in the coming years
- AI-led code generation: AI will accelerate code generation, and pressure will increase on the IaC tools to keep up.
- Infrastructure will be a competitive advantage: Companies that build on solid infrastructure will have a competitive advantage in their ability to innovate and scale. Companies that can organize their infrastructure well will be able to deploy faster and innovate faster.
- Change management will be key: As infrastructure becomes more complex, change management will become more important. We need to have a good handle on what changed, and what didn't. Keeping the house in order at all times will become more important.
The Ideal Solution
A modern IaC solution should provide:
- Granular State Management: Independent state for different components
- Parallel Operations: Multiple teams working simultaneously
- Clear Dependencies: Explicit and manageable resource relationships
- Scalable Performance: No degradation with infrastructure growth
- Strong Type Safety: Catch errors before deployment
- Task-Based Workflows: Clear operation ordering and dependencies
How Mantis Addresses These Needs
State Management that scales
Let's take an example to illestrate the difference between the approaches of Terraform and Mantis to state management.
Terraform Approach:
# Single state file managing all resources
resource "aws_vpc" "main" {
cidr_block = "10.0.0.0/16"
}
resource "aws_subnet" "public" {
vpc_id = aws_vpc.main.id
cidr_block = "10.0.1.0/24"
}
resource "aws_security_group" "allow_http" {
vpc_id = aws_vpc.main.id
# ... security group rules
}
Terraform Challenges: State files grow in size and become difficult to manage due to Terraform's single entry point approach (main.tf). While its declarative approach promises simplicity, trying to sync everything in main.tf with a single state file creates several critical problems at scale:
- Monolithic State Management: All resources are tracked in one state file, making it unwieldy and hard to reason about
- Limited Update Control: Partial updates become difficult to manage safely
- Risk of Unintended Changes: Without granular control over resources and desired state, accidental deletions or modifications become more likely. For example, when replacing resource A with resource D in a Terraform script, resource A will be deleted and D will be created. While this is the expected behavior, such changes can be easily missed during code reviews of large scripts, potentially leading to unintended resource deletions. This sensitivity is a well-known challenge with Terraform, as documented in various community discussions and articles (see References section below).
- Complex Dependencies: Resource relationships become harder to track and maintain as the state file grows
Mantis Approach:
vpc_setup: {
@flow("vpc_setup")
create_vpc: {
@task(mantis.core.TF)
config: {
resource: aws_vpc: main: {
cidr_block: "10.0.0.0/16"
}
}
}
create_subnet: {
@task(mantis.core.TF)
dep: create_vpc
config: {
resource: aws_subnet: public: {
vpc_id: @var(vpc_id)
cidr_block: "10.0.1.0/24"
}
}
}
}
This approach provides several key benefits:
Key Benefits:
-
Independent State Management: Each task maintains its own state file, reducing complexity and making operations more predictable. This prevents accidental resource deletions since changes are explicitly scoped to individual tasks.
-
Clear Dependencies and Safety: Task dependencies are explicitly defined, making the workflow easier to understand and debug. When replacing resources (e.g., resource A with resource B), deletions are intentional and clearly visible in the task definition.
-
Enhanced Scalability: By splitting state files per task, the system:
- Reduces contention on state files
- Enables parallel operations across different tasks
- Makes the workflow more maintainable at scale
- Simplifies debugging and review processes
-
Better Error Isolation: Issues are contained within individual tasks, making it easier to identify and fix problems without affecting the entire infrastructure.
The rsync analogy:
For folks familiar with good old rsync, I'd like to use to explain the difference between the two approaches.
rsync or remote synchronization is a software utility for Unix-Like systems that efficiently sync files and directories between two hosts or machines. It takes a source directory and a destination directory, on a remote host and syncs them via secure or insecure protocols.
rsync -av source_directory destination_directory
The rsync command works well when you're doing a single folder, but when you're trying to sync a large number of folders, you dont have control on the order of syncing and tracking intermediate states, which is important at scale, as sometimes these are time-taking operations, and may fail. You'd want to be able to retry failed tasks.
Terraform's approach is like trying to sync a single folder with a large number of sub-folders, and let rsync figure out the order of syncing and intermediate states.
Mantis's approach is to write a script that can iterate over the list of folders, and then rsync each of them as specified by the author. This gives the author full control on the order of syncing and intermediate states.
Now in order to fully achieve it's mission and be a 10x improvement over terraform, Mantis also addresses other challenges, and we'll do separate posts on them in the coming weeks:
Improved Readability with Make-Style Build Systems
Terraform Challenge: Configurations are complex and make it difficult to understand the deployment flow and responsibilities.
Mantis Solution: Task-Oriented Workflow Engine
vpc_setup: {
// High-level VPC setup flow with annotated tasks
@flow("vpc_setup")
// Task 1: Create VPC
create_vpc: {
@task(mantis.core.TF)
description: "Create and configure VPC"
config: {
resource: {
aws_vpc: main: {
cidr_block: "10.0.0.0/16"
tags: {
Name: "mantis-production-vpc"
Environment: "production"
}
}
}
}
exports: {
vpc_id: "aws_vpc.main.id"
subnet_ids: []
}
}
// Task 2: Set up networking
setup_networking: { }
// Task 3: Configure security
security_setup: { }
// Task 4: Deploy monitoring
monitoring: { }
}
First-Class Package Management and Dependencies
Terraform Challenge: Limited module system makes it difficult to manage dependencies and share code.
Mantis Solution: First-Class Package System using CUE modules
module: "augur.ai/vpc-setup"
language: {
version: "v0.10.0"
}
// Define the dependencies for the project
dependencies: [
"[augur.ai/org/base-networking]",
"[augur.ai/org/security-baseline]",
]
package network
import (
base "augur.ai/org/base-networking"
security "augur.ai/org/security-baseline"
)
// Define network configuration using base templates
network_config: {
base.NetworkingConfig & {
...
}
}
Making the Transition to Mantis
The evolution from Terraform to Mantis represents more than just a tool change—it's a fundamental shift in how we approach infrastructure automation at scale. By leveraging CUE and building upon OpenTofu, Mantis provides a powerful yet familiar platform that addresses the key challenges of modern infrastructure management.
Mantis offers:
- Improved readability and maintainability through task-oriented workflows
- Better governance and control mechanisms built into the core platform
- Advanced modularity and dependency management
- Flexible execution models that support complex deployment patterns
For organizations considering the transition:
- Start with new projects or non-critical workloads
- Use Mantis's Terraform import capabilities to gradually migrate existing infrastructure
- Consider a phased approach, moving team by team or application by application
The future of infrastructure automation demands tools that can handle the complexity of modern cloud environments while supporting the needs of large, distributed teams. Mantis represents the next generation of IaC tools, designed specifically to meet these challenges head-on.
References
Links that talk about mitigating strategies for terraform accidental deletion: