Kubernetes v1.36 DRA: What's New in Dynamic Resource Allocation?

By

Dynamic Resource Allocation (DRA) continues to evolve in Kubernetes v1.36, introducing significant improvements that make hardware management more flexible and efficient. This release brings several feature graduations, including stable support for prioritized device requests, beta capabilities for extended resources, partitionable devices, device taints, and enhanced scheduling reliability through device binding conditions. Additionally, the ecosystem of drivers has expanded beyond GPUs to include networking and other hardware types, while support for ResourceClaims in PodGroups improves workload orchestration. Whether you're managing large GPU fleets or seeking better failure handling, this release delivers practical enhancements. Below, we answer key questions about these updates.

1. What are the major feature graduations in Kubernetes v1.36 DRA?

Kubernetes v1.36 marks several important milestones for DRA features. The Prioritized list feature has graduated to stable, allowing administrators to define ordered device preferences (e.g., prefer H100, fall back to A100) to improve scheduling flexibility and cluster utilization. The Extended resource support has moved to beta, bridging the gap between legacy extended resource requests and DRA by enabling gradual migration without forcing immediate adoption of ResourceClaims. Partitionable devices (beta) allow dynamic carving of hardware like MIG GPUs into smaller logical units, boosting efficiency. Device taints (beta) give operators granular control over device accessibility, similar to node taints. Finally, device binding conditions (beta) enhance scheduling reliability by ensuring compatible device selection before binding.

Kubernetes v1.36 DRA: What's New in Dynamic Resource Allocation?

2. How does the Prioritized List feature improve cluster utilization?

The Prioritized list feature, now stable, lets you specify a ranked sequence of device preferences when requesting hardware. For example, you can define "Give me an NVIDIA H100 if available; otherwise, fall back to an A100." The scheduler evaluates these requests in order, selecting the most preferred available option. This eliminates the need for rigid device-specific requests, allowing workloads to run even when the ideal hardware is in use. As a result, cluster utilization increases because devices that might otherwise sit idle—like older GPU models—are used for fallback. This flexibility is crucial for managing hardware heterogeneity in large fleets, reducing waste while maintaining performance requirements. Administrators can define fallback strategies per workload or globally, empowering smarter resource allocation without manual intervention.

3. What is Extended Resource support in DRA and why is it important?

The Extended resource support (beta) in v1.36 enables DRA to handle requests made via traditional extended resources in a Pod spec. This bridges the gap between legacy resource management and DRA's modern ResourceClaim API. Cluster operators can migrate infrastructure to DRA step by step—continuing to expose common resources like GPUs as extended resources—while allowing application developers to adopt ResourceClaims at their own pace. This gradual transition minimizes disruption and enables teams to test DRA without overhauling existing manifests. Once extended resources are mapped to ResourceClaims through DRA drivers, the scheduler can manage them with DRA's advanced features like taints and prioritized lists. This feature is vital for large clusters where immediate migration is impractical, ensuring smooth evolution toward a fully DRA-backed environment.

4. How do Partitionable Devices enable better hardware sharing?

With Partitionable devices (beta), DRA natively supports carving high-end accelerators—like NVIDIA's Multi-Instance GPU (MIG) or AMD's MxGPU—into smaller, logical partitions. Administrators can define partition specifications in the device driver, enabling the scheduler to allocate fractions of a single physical device to different Pods. This maximizes utilization of expensive hardware by running multiple compatible workloads simultaneously on one GPU. For example, a single GPU could serve two inference jobs requiring only a slice of memory and compute. The feature automatically prevents overlapping partitions and ensures isolation. This is a game-changer for cost optimization in AI/ML clusters, as it reduces idle capacity and allows finer-grained resource billing without sacrificing performance. Operators can also combine partitionable devices with taints and prioritized lists for advanced scheduling strategies.

5. What are Device Taints and how do they improve hardware management?

Just as Kubernetes nodes can be tainted to control pod placement, Kubernetes v1.36 introduces Device taints (beta) for individual DRA devices. This feature lets administrators mark specific hardware—like faulty GPUs or high-priority accelerators—with taints that restrict their allocation. Only Pods with matching tolerations can claim tainted devices. Use cases include: isolating experimental hardware to prevent production workloads from accessing it, reserving top-tier accelerators for critical jobs, or marking malfunctioning devices for maintenance without removing them from the cluster. Combined with the Prioritized list feature, operators can direct workloads to fall back to non-tainted devices if preferred ones are unavailable. Device taints give operators fine-grained control over hardware usage, improving reliability by preventing allocation of unstable devices and ensuring critical workloads get the resources they need.

6. How do Device Binding Conditions enhance scheduling reliability?

The new Device binding conditions feature (beta) ensures that a DRA device isn't bound to a Pod until all scheduling constraints are satisfied. Previously, the scheduler might select a device early based on availability, only to fail later due to incompatible node or pod requirements. Binding conditions postpone the final binding until the scheduler can confirm the entire resource environment is compatible—including node taints, resource requirements, and DRA-specific constraints. This reduces failed schedule attempts and waste of reservation locks, improving overall cluster efficiency. For example, when using partitionable devices, the feature verifies that the partition parameters match the Pod's needs before committing the device. This is especially valuable in dynamic environments where devices may have temporary restrictions. The result is more reliable pod placement and less contention for specialized hardware.

7. What new driver ecosystem and PodGroup support come with v1.36?

Kubernetes v1.36 extends the DRA driver ecosystem beyond traditional compute accelerators like GPUs. Drivers now support networking hardware (e.g., smart NICs, FPGAs) and other specialized devices, enabling DRA to manage a broader range of resources. This reflects the community's push toward a hardware-agnostic infrastructure. Additionally, ResourceClaims in PodGroups have been enhanced, allowing groups of Pods to collectively claim sets of devices. This is critical for distributed workloads (e.g., multi-GPU training or data pipelines) where all Pods in the group need coordinated access to the same type of hardware. PodGroup support ensures that the scheduler allocates devices atomically, preventing partial allocations that could stall the entire job. These expansions make DRA a more versatile tool for modern data center and edge deployments, accommodating diverse hardware requirements while simplifying management.

Tags:

Related Articles

Recommended

Discover More

Finance AI Adoption Hits 88% but Scaling Remains a Critical Bottleneck: McKinsey Survey Reveals One-Third of Firms Exit Pilot Phase How to Secure Your Crypto with Time-Lock Vaults: A Step-by-Step Guide Revive Your Android TV: The Simple Speed Boost You’ve Been Missing 5 Surprising Discoveries About a Prehistoric Creature with a Twisted Jaw Ubuntu to Embrace AI in 2026: Canonical Unveils Principled Local Inference Strategy