Gitpod and Importance of Understanding Kubernetes Workloads
In a provocative and detailed post, GitPod team posted about why they were leaving Kubernetes. It’s a deeply reflective post journaling the experiments and learnings that could greatly benefit the community and demonstrates the complex decision making.
We felt that we need to frame the discussion to help the community understand better what happened and how it is applicable to them.
GitPod team’s key points can be summarized as follows:
Reason | Description |
---|---|
Complexity and Limitations | Kubernetes is optimized for production workloads, but its complexity hinders Gitpod's development environments, which need unique resource allocations and secure, interactive setups. |
Resource Management Issues | Kubernetes struggles to handle the bursty and unpredictable CPU and memory demands of development environments without causing latency. |
Storage and Startup Performance Challenges | Persistent Volume Claims and Kubernetes storage solutions introduced delays, impacting startup speed and overall environment reliability. |
It is our assessment that there was a mismatch between assumptions and structure of K8s and what the workload required. Cloud Development Environments like GitPod and Github codespaces are like VSCode sessions on the browser. They are highly sticky sessions compute and data with very less tolerance for data loss. This is hard to do at scale in general. Kubernetes as an approach to deal with deployment complexity ran into two issues:
- Statelessness: Bin-packing tech like K8s was primarily meant for either stateless microservices or long running stateful dbs that can be pinned to certain nodes. The load here was stateful and dynamic over short periods.
- Complexity during scaling: Having worked in high growth startups, I can imagine the immense complexity engineering teams face while trying to make Kubernetes work for their systems. Fine-tuning requires managing multiple configuration sets for different environments, and there’s a need for smoother support for iterative experiments—without the manual effort of constantly switching configs, testing, rolling out changes, and keeping stakeholders informed.
Understanding Workloads
Image credit: dchan.cc/a-kubernetes-jouney
The analogy of cattle shelter, pet shelter, and pet daycare on subscription service is quite fitting and can help illustrate the nature and needs of each workload category:
- Cattle Shelter (Stateless Microservices)
- Cattle shelter is a place where animals (like cattle) are housed in bulk, generally with standardized care and minimal individual attention. This aligns with stateless microservices, which are scalable, uniform, and often designed to be quickly created and destroyed as needed.
- Just like cattle in a shelter, each microservice is more or less "anonymous" in its purpose, meaning if one goes down, another can easily replace it without major impact. This is where Kubernetes shines, orchestrating these services reliably at scale.
- Pet Shelter (Long-Running Stateful Services)
- A pet shelter (or a home) cares for individual animals that require specific, dedicated attention and persistence over time. Similarly, long-running stateful services (e.g., databases) need sustained care, attention, and persistent storage.
- The complexity of running databases on Kubernetes is well documented, and every distributed database has a different set of requirements that needs special configurations and tuning. You can see this post To Run or Not to Run a Database on Kubernetes: What to Consider.
- Pet Daycare on Subscription (Temporary, Bursty Development Environments)
- Pet daycare is a place where pets come and go on a regular, often daily basis. This parallels cloud development environments, where users bring their work environment up and down as needed, with sessions that might last hours but are limited to certain times or purposes.
- Daycares need to handle bursts of activity, especially during drop-off and pick-up hours, just as development environments experience bursty traffic and high resource demand. Additionally, daycare requires adaptable spaces for each pet’s needs (like custom setups or tools), which mirrors the high persistence and quick setup needs for development environments in the cloud.
- Gitpod and cloud development environments are like pet daycare, which means they have to be treated like a special case of database, where each user session is like a database instance with writes and snapshots that need to be persisted.
Architectural Approach for Each Workload Class
We've put together a rubric to help understand the trade-offs and fit for different workloads:
Dimension | Description | When K8s is a Good Fit | When Alternatives May Be Preferred |
---|---|---|---|
Persistence Requirements | Defines whether the workload needs long-term data storage and strong persistence guarantees. | Low persistence (stateless microservices), or managed through external services (e.g., databases) | High persistence for workloads requiring dedicated storage and data consistency (e.g., stateful services, databases) |
Startup Time Sensitivity | Indicates the tolerance for startup latency, especially in scenarios where fast boot times are essential (e.g., interactive environments). | Startup latency is non-critical; services can wait a few seconds or more. | Low-latency, fast start-up environments like development platforms or analytics |
Data Consistency | Level of consistency required across distributed components or nodes, especially in stateful applications. | Low-to-medium consistency (eventual consistency is acceptable). | High consistency, requiring synchronous replication or ACID guarantees |
Isolation and Security | The degree of isolation and control required for workloads, including root permissions and network isolation. | Low-to-moderate isolation, where namespaces and RBAC provide adequate security boundaries. | High isolation needs, where users need root access or control over system resources (e.g., dev environments) |
Compute and Resource Elasticity | Measures the need for flexible scaling and resource pooling, especially CPU and memory. | High elasticity, particularly for stateless or auto-scaled workloads. | Non-elastic or workloads that are resource-bound, such as long-running, high-memory databases |
Operational Overhead | Complexity of operating and managing the workload infrastructure, including cluster management and maintenance. | When infrastructure can be standardized, and Kubernetes ecosystem tools (e.g., Helm, Operators) reduce complexity. | When operational complexity is high due to custom configurations or unique resource demands |
Data Security and Compliance | Level of required data protection, compliance, and regulatory needs that can impact workload placement. | Basic data security policies like namespaces or compliance that can be handled within a cluster. | Higher regulatory and security demands where dedicated environments offer more control. |
Applying the Rubric: Example Workload Evaluations
Now, let's see how this rubric applies to different workloads including Gitpod use-cases:
- Stateless Microservices
- Persistence Requirements: Low (typically stateless)
- Startup Time Sensitivity: Low (delays acceptable)
- Data Consistency: Low-to-medium (eventual consistency acceptable)
- Traffic Patterns: Predictable/auto-scaled
- Isolation and Security: Moderate (minimal root access)
- Compute Elasticity: High
- Operational Overhead: Low-to-medium
- Data Security: Moderate (handled within Kubernetes)
- K8s Fit: High – Kubernetes is a great fit here for scaling and orchestrating.
- Long-Running Stateful Services (e.g., Databases)
- Persistence Requirements: High (strong data durability)
- Startup Time Sensitivity: Moderate
- Data Consistency: High (ACID compliance)
- Traffic Patterns: Consistent
- Isolation and Security: Moderate (dedicated storage, but less system control)
- Compute Elasticity: Low
- Operational Overhead: High (requires custom PV setups)
- Data Security: High (often separate compliance)
- K8s Fit: Medium – Kubernetes can work here but may require custom PV and specialized setup.
- Temporary, Bursty Development Environments (Gitpod, Codespaces)
- Persistence Requirements: High (workspace snapshots, data retention)
- Startup Time Sensitivity: High (low-latency boot times)
- Data Consistency: Moderate (session consistency)
- Traffic Patterns: Bursty (user sessions, high bursts)
- Isolation and Security: High (root access, network isolation)
- Compute Elasticity: High (rapid resource release and acquisition)
- Cost Efficiency: High (needs resource optimization)
- Operational Overhead: High (complex infrastructure requirements)
- Data Security: High (user isolation, security controls)
- K8s Fit: Low – Kubernetes struggles with bursty, high-security dev environments where custom control is required.
Handling Complex Workload Scenarios
When dealing with mixed or dynamic workloads, the team faces a vast design space with many potential configurations (e.g., 9 dimensions, each with 3 options, resulting in 3^9 combinations). In such situations, consider the following approaches:
- Adjust the Application or Relax Guarantees: Where possible, modify the application to reduce dependency on strict guarantees, focusing only on essential areas.
- Utilize Multiple Management Mechanisms: Employ different management techniques for various workload components, allowing flexibility and optimized resource allocation.
- Adopt Improved Abstractions: Develop abstractions that consider these challenges, allowing for workload categorization and more effective resource management across different scenarios.
Summary
In summary, the Gitpod team has had to learn the hard way that while Kubernetes is a powerful tool for managing complex workloads, it comes with trade-offs that are not suitable for their workload. Deciding if Kubernetes is the right choice involves nuanced understanding of the nature of the workload and the trade-offs involved. By understanding and categorizing workloads effectively, teams can make more informed decisions about infrastructure management, potentially leading to better performance, cost efficiency, and operational agility.