Kubernetes 1.35 (released December 17, 2025) quietly introduced Gang Schedulingโan alpha feature that AI/ML teams have been desperately waiting for. Here’s why this matters: imagine your distributed training job needs 100 GPUs to start, but only 95 are available. Without Gang Scheduling, those 95 GPUs sit idle, creating cluster deadlocks. Gang Scheduling solves this with an “all-or-nothing” approachโand even if you’re not training models, this impacts batch workloads, HPC simulations, and multi-pod StatefulSets.
The Problem: The “Partial Pod Deadlock” That’s Costing You Money
Before Gang Scheduling, Kubernetes scheduled Pods individually based on available resources. This sounds reasonableโuntil you realize what happens with tightly coupled workloads:
Scenario: You launch a distributed PyTorch training job that requires 10 GPUs (10 worker Pods). Your cluster has only 5 GPUs available.
Without Gang Scheduling:
- 5 Pods get scheduled and start consuming GPUs
- The remaining 5 Pods sit in Pending state
- Your training job CANNOT start because it needs all 10 workers
- Those 5 scheduled Pods are now blocking resources for other workloads
- Other jobs that could fit in the available capacity are now starved
- Result: Cluster deadlock + wasted GPU costs
This problem is known as the “partial scheduling deadlock,” and it’s particularly brutal for:
- Distributed AI/ML training (PyTorch, TensorFlow, Ray)
- MPI-based HPC simulations
- Batch jobs requiring synchronous execution
- StatefulSets with strict sequencing requirements
Without Gang Scheduling, teams resort to external schedulers like Volcano, Kueue, or custom operatorsโadding operational complexity just to solve what should be a core Kubernetes capability.
The Solution: Native Gang Scheduling + Workload API in Kubernetes 1.35
Kubernetes v1.35 introduces native Gang Scheduling through two new components:
- Workload API (scheduling.k8s.io/v1alpha1): A new core API resource that defines a group of Pods as a single scheduling entity.
- GangScheduling Plugin: A scheduler plugin that enforces “all-or-nothing” placement for Pods belonging to a Workload.
How It Works:
Instead of the scheduler seeing 10 individual Pod requests, it now sees 1 Workload with 10 members. The GangScheduling plugin ensures that ALL 10 Pods can be scheduled before binding ANY of them to nodes.
Key Concepts:
- Why This Matters Even If You’re Not Training AI ModelsPodGroup: Defines the group of Pods that must be scheduled together
- Gang Scheduling isn’t just for AI/ML teams. Here’s where else it solves critical problems:
- Batch Processing Jobs
Spark, Hadoop, or custom batch workloads that require all workers to be present before starting. Without Gang Scheduling, partially scheduled batch jobs waste cluster resources for hours. - StatefulSets with Dependencies
Databases or distributed systems where Pods must start in a specific sequence. Gang Scheduling ensures all replicas are schedulable before binding any. - HPC Simulations
Scientific computing workloads (weather modeling, fluid dynamics) that require MPI-based coordination across all nodes simultaneously. - Multi-Tenant Clusters
In shared environments, Gang Scheduling prevents one tenant’s partially scheduled job from blocking another tenant’s smaller, fully schedulable workload. - Cost Optimization
By preventing partial scheduling, you avoid wasting money on idle resources that can’t do useful work until their peers are scheduled.
How to Enable Gang Scheduling in Kubernetes 1.35
Gang Scheduling is alpha in Kubernetes 1.35, meaning it’s disabled by default. Here’s what you need to enable it:
- Feature Gates
Enable two feature gates on both the API server and scheduler:
- GenericWorkload: Enables the new Workload custom resource
- GangScheduling: Activates the scheduler plugin for all-or-nothing placement
- API Group Activation
Enable the alpha API group scheduling.k8s.io/v1alpha1 - Workload Definition
Define a Workload object with PodGroup specifications, including minCount for quorum-based scheduling. - Pod Association
Pods reference the Workload using workloadRef in their spec.
Important Caveats for Alpha Testing:
- This is NOT production-readyโit’s for testing and feedback only
- APIs may change significantly before beta/stable
- Performance impact on large clusters is still being evaluated
- Integration with other scheduling features (pod affinity, taints/tolerations) may have edge cases
If you’re running AI/ML or batch workloads on Kubernetes, spin up a test cluster with 1.35 and try it out. Your feedback will help shape the beta release in 2026.
- minCount Parameter: Allows “quorum-based” scheduling (e.g., “schedule at least 8 of 10 Pods”)
- Atomic Scheduling: The scheduler evaluates capacity for the entire group before making placement decisions
- Standardization: This native implementation aims to standardize gang scheduling across all Kubernetes distributions
FAQ: Gang Scheduling in Kubernetes 1.35
- The Bottom Line: Start Testing Now, Deploy in 2026Is Gang Scheduling production-ready in Kubernetes 1.35?
No. It’s in alpha stage, disabled by default. It’s intended for testing and feedback, not production workloads. - Gang Scheduling in Kubernetes 1.35 is a long-awaited feature that solves a fundamental problem: the partial scheduling deadlock. While it’s still in alpha and not ready for production, its arrival signals a major shift in how Kubernetes handles tightly coupled, multi-pod workloads.
- For AI/ML teams running distributed training on PyTorch, TensorFlow, or Ray, this is a game-changer. For batch processing teams dealing with Spark or Hadoop, this eliminates hours of wasted cluster capacity. For HPC teams, this brings Kubernetes closer to being a viable alternative to traditional HPC schedulers.
- The native implementation means you won’t need to maintain external schedulers just to get all-or-nothing semanticsโthough mature solutions like Volcano and Kueue will continue to offer value-added features.
- What you should do now:
- If you’re running AI/ML or batch workloads: Spin up a Kubernetes 1.35 test cluster and try Gang Scheduling. Your feedback will shape the beta release.
- If you’re using external schedulers: Monitor how your vendor integrates with or adopts the native Workload API.
- If you’re planning 2026 infrastructure: Factor Gang Scheduling into your Kubernetes roadmap. Expect beta in mid-2026, stable by late 2026/early 2027.
- The era of cluster deadlocks from partial scheduling is ending. Gang Scheduling in Kubernetes 1.35 is just the beginningโand 2026 is when it becomes production-ready.
- Ready to test? Grab Kubernetes 1.35, enable the feature gates, and see how all-or-nothing scheduling changes your workload management strategy.
- Does this replace Volcano, Kueue, or other batch schedulers?
Not yet. Those external schedulers offer more mature gang scheduling plus additional features (queue management, resource quotas). However, having native gang scheduling in Kubernetes means these tools can eventually simplify their architectures or focus on higher-level features. - What’s the performance impact on large clusters?
Still being evaluated. The scheduler now needs to evaluate capacity for entire Workloads instead of individual Pods, which could impact scheduling latency in clusters with thousands of Pods. - Can I use Gang Scheduling with existing Jobs or Deployments?
No. You need to define a Workload object and reference it from your Pods using workloadRef. Existing resources won’t automatically use gang scheduling. - When will this graduate to beta or stable?
Based on typical Kubernetes timelines, expect beta in early-to-mid 2026 (likely Kubernetes 1.37 or 1.38) if alpha feedback is positive. Stable graduation would follow 1-2 releases after beta.