Gang Scheduling Arrives in Kubernetes 1.35: What AI/ML Teams Need to Know (2026 Guide)

Kubernetes 1.35 (released December 17, 2025) quietly introduced Gang Scheduling—an alpha feature that AI/ML teams have been desperately waiting for. Here’s why this matters: imagine your distributed training job needs 100 GPUs to start, but only 95 are available. Without Gang Scheduling, those 95 GPUs sit idle, creating cluster deadlocks. Gang Scheduling solves this with an “all-or-nothing” approach—and even if you’re not training models, this impacts batch workloads, HPC simulations, and multi-pod StatefulSets.

The Problem: The “Partial Pod Deadlock” That’s Costing You Money

Before Gang Scheduling, Kubernetes scheduled Pods individually based on available resources. This sounds reasonable—until you realize what happens with tightly coupled workloads:

Scenario: You launch a distributed PyTorch training job that requires 10 GPUs (10 worker Pods). Your cluster has only 5 GPUs available.

Without Gang Scheduling:

5 Pods get scheduled and start consuming GPUs
The remaining 5 Pods sit in Pending state
Your training job CANNOT start because it needs all 10 workers
Those 5 scheduled Pods are now blocking resources for other workloads
Other jobs that could fit in the available capacity are now starved
Result: Cluster deadlock + wasted GPU costs

This problem is known as the “partial scheduling deadlock,” and it’s particularly brutal for:

Distributed AI/ML training (PyTorch, TensorFlow, Ray)
MPI-based HPC simulations
Batch jobs requiring synchronous execution
StatefulSets with strict sequencing requirements

Without Gang Scheduling, teams resort to external schedulers like Volcano, Kueue, or custom operators—adding operational complexity just to solve what should be a core Kubernetes capability.

The Solution: Native Gang Scheduling + Workload API in Kubernetes 1.35

Kubernetes v1.35 introduces native Gang Scheduling through two new components:

Workload API (scheduling.k8s.io/v1alpha1): A new core API resource that defines a group of Pods as a single scheduling entity.
GangScheduling Plugin: A scheduler plugin that enforces “all-or-nothing” placement for Pods belonging to a Workload.

How It Works:

Instead of the scheduler seeing 10 individual Pod requests, it now sees 1 Workload with 10 members. The GangScheduling plugin ensures that ALL 10 Pods can be scheduled before binding ANY of them to nodes.

Key Concepts:

Why This Matters Even If You’re Not Training AI ModelsPodGroup: Defines the group of Pods that must be scheduled together
Gang Scheduling isn’t just for AI/ML teams. Here’s where else it solves critical problems:
Batch Processing Jobs
Spark, Hadoop, or custom batch workloads that require all workers to be present before starting. Without Gang Scheduling, partially scheduled batch jobs waste cluster resources for hours.
StatefulSets with Dependencies
Databases or distributed systems where Pods must start in a specific sequence. Gang Scheduling ensures all replicas are schedulable before binding any.
HPC Simulations
Scientific computing workloads (weather modeling, fluid dynamics) that require MPI-based coordination across all nodes simultaneously.
Multi-Tenant Clusters
In shared environments, Gang Scheduling prevents one tenant’s partially scheduled job from blocking another tenant’s smaller, fully schedulable workload.
Cost Optimization
By preventing partial scheduling, you avoid wasting money on idle resources that can’t do useful work until their peers are scheduled.

How to Enable Gang Scheduling in Kubernetes 1.35

Gang Scheduling is alpha in Kubernetes 1.35, meaning it’s disabled by default. Here’s what you need to enable it:

Feature Gates
Enable two feature gates on both the API server and scheduler:

GenericWorkload: Enables the new Workload custom resource
GangScheduling: Activates the scheduler plugin for all-or-nothing placement

API Group Activation
Enable the alpha API group scheduling.k8s.io/v1alpha1
Workload Definition
Define a Workload object with PodGroup specifications, including minCount for quorum-based scheduling.
Pod Association
Pods reference the Workload using workloadRef in their spec.

Important Caveats for Alpha Testing:

This is NOT production-ready—it’s for testing and feedback only
APIs may change significantly before beta/stable
Performance impact on large clusters is still being evaluated
Integration with other scheduling features (pod affinity, taints/tolerations) may have edge cases

If you’re running AI/ML or batch workloads on Kubernetes, spin up a test cluster with 1.35 and try it out. Your feedback will help shape the beta release in 2026.

minCount Parameter: Allows “quorum-based” scheduling (e.g., “schedule at least 8 of 10 Pods”)
Atomic Scheduling: The scheduler evaluates capacity for the entire group before making placement decisions
Standardization: This native implementation aims to standardize gang scheduling across all Kubernetes distributions

FAQ: Gang Scheduling in Kubernetes 1.35

The Bottom Line: Start Testing Now, Deploy in 2026Is Gang Scheduling production-ready in Kubernetes 1.35?
No. It’s in alpha stage, disabled by default. It’s intended for testing and feedback, not production workloads.
Gang Scheduling in Kubernetes 1.35 is a long-awaited feature that solves a fundamental problem: the partial scheduling deadlock. While it’s still in alpha and not ready for production, its arrival signals a major shift in how Kubernetes handles tightly coupled, multi-pod workloads.
For AI/ML teams running distributed training on PyTorch, TensorFlow, or Ray, this is a game-changer. For batch processing teams dealing with Spark or Hadoop, this eliminates hours of wasted cluster capacity. For HPC teams, this brings Kubernetes closer to being a viable alternative to traditional HPC schedulers.
The native implementation means you won’t need to maintain external schedulers just to get all-or-nothing semantics—though mature solutions like Volcano and Kueue will continue to offer value-added features.
What you should do now:
If you’re running AI/ML or batch workloads: Spin up a Kubernetes 1.35 test cluster and try Gang Scheduling. Your feedback will shape the beta release.
If you’re using external schedulers: Monitor how your vendor integrates with or adopts the native Workload API.
If you’re planning 2026 infrastructure: Factor Gang Scheduling into your Kubernetes roadmap. Expect beta in mid-2026, stable by late 2026/early 2027.
The era of cluster deadlocks from partial scheduling is ending. Gang Scheduling in Kubernetes 1.35 is just the beginning—and 2026 is when it becomes production-ready.
Ready to test? Grab Kubernetes 1.35, enable the feature gates, and see how all-or-nothing scheduling changes your workload management strategy.
Does this replace Volcano, Kueue, or other batch schedulers?
Not yet. Those external schedulers offer more mature gang scheduling plus additional features (queue management, resource quotas). However, having native gang scheduling in Kubernetes means these tools can eventually simplify their architectures or focus on higher-level features.
What’s the performance impact on large clusters?
Still being evaluated. The scheduler now needs to evaluate capacity for entire Workloads instead of individual Pods, which could impact scheduling latency in clusters with thousands of Pods.
Can I use Gang Scheduling with existing Jobs or Deployments?
No. You need to define a Workload object and reference it from your Pods using workloadRef. Existing resources won’t automatically use gang scheduling.
When will this graduate to beta or stable?
Based on typical Kubernetes timelines, expect beta in early-to-mid 2026 (likely Kubernetes 1.37 or 1.38) if alpha feedback is positive. Stable graduation would follow 1-2 releases after beta.

Gang Scheduling Arrives in Kubernetes 1.35: What AI/ML Teams Need to Know (2026 Guide)

Share this:

Like this:

Leave a ReplyCancel reply

Discover more from inboryn