feat: add Intel XPU deployment templates with Kubernetes DRA support (#7464)

Signed-off-by: Yao, Qing <qing.yao@intel.com>

feat: add Intel XPU deployment templates with Kubernetes DRA support (#7464)
Signed-off-by: Yao, Qing <qing.yao@intel.com>
310f8ca9 · Yao Qing · GitHub · a2154ba5 · 310f8ca9 · 310f8ca9
Unverified Commit 310f8ca9 authored Mar 27, 2026 by Yao Qing Committed by GitHub Mar 27, 2026
3 changed files
--- a/examples/backends/vllm/deploy/README.md
+++ b/examples/backends/vllm/deploy/README.md
@@ -38,6 +38,21 @@ Advanced disaggregated deployment with KV cache routing capabilities.
 ### 5. **Global Planner Deployments** (see [`examples/global_planner/`](../../../global_planner/))
 Centralized scaling across multiple DGDs via GlobalPlanner. Examples include single-endpoint multi-pool and multi-model GPU budget patterns. See the [global planner examples](../../../global_planner/) for details.
+### 6. **Deployments with Intel XPU (Optional)** (`agg_xpu_dra.yaml` or `disagg_xpu_dra.yaml`)
+Hardware-specific aggregated/disaggregated deployment using Kubernetes Dynamic Resource Allocation (DRA).
+**Aggregated Architecture:**
+- `Frontend`: OpenAI-compatible API server
+- `VllmDecodeWorker`: Single worker with XPU target (`VLLM_TARGET_DEVICE=xpu`)
+- GPU allocation via `ResourceClaimTemplate` and pod-level `resourceClaims`
+**Disaggregated Architecture:**
+- `Frontend`: HTTP API server coordinating between workers
+- `VllmDecodeWorker`: Specialized decode-only worker with XPU target
+- `VllmPrefillWorker`: Specialized prefill-only worker with XPU target
+- GPU allocation via `ResourceClaimTemplate` and pod-level `resourceClaims`
+- Communication via NIXL transfer backend with XPU buffer
 ## CRD Structure
 All templates use the **DynamoGraphDeployment** CRD:
@@ -97,7 +112,7 @@ Before using these templates, ensure you have:
 1. **Dynamo Kubernetes Platform installed** - See [Quickstart Guide](../../../../docs/kubernetes/README.md)
 2. **Kubernetes cluster with GPU support**
-3. **Container registry access** for vLLM runtime images
+3. **Container registry access** for vLLM runtime images (optional for default NGC CUDA images - `nvcr.io/nvidia/ai-dynamo/*` images are publicly accessible; Intel XPU users should build custom images with `--device xpu`)
 4. **HuggingFace token secret** (referenced as `envFromSecret: hf-token-secret`)
 ### Container Images
@@ -124,6 +139,8 @@ Select the deployment pattern that matches your requirements:
 - Use `disagg.yaml` for maximum performance
 - Use `disagg_router.yaml` for high-performance with KV cache routing
 - Use `disagg_planner.yaml` for SLA-optimized performance
+- Use `agg_xpu_dra.yaml` for aggregated deployment on Intel XPU clusters using Kubernetes DRA
+- Use `disagg_xpu_dra.yaml` for disaggregated deployment on Intel XPU clusters using Kubernetes DRA
 - Use [global planner examples](../../../global_planner/) for centralized scaling across multiple DGDs
 ### 2. Customize Configuration
@@ -162,6 +179,31 @@ export DEPLOYMENT_FILE=agg.yaml
 kubectl apply -f $DEPLOYMENT_FILE -n $NAMESPACE
 ```
+#### Deploy with Intel XPU  (Optional)
+If your cluster uses Intel GPU devices via Kubernetes Dynamic Resource Allocation (DRA), ensure:
+- Your Kubernetes cluster is **v1.34+** (required for DRA API v1), and
+- The [Intel XPU Resource Driver](https://github.com/intel/intel-resource-drivers-for-kubernetes) is installed.
+Deploy the XPU template (includes the ResourceClaimTemplate):
+```bash
+cd <dynamo-source-root>/examples/backends/vllm/deploy
+# For aggregated deployment
+kubectl apply -f agg_xpu_dra.yaml -n $NAMESPACE
+# OR for disaggregated deployment
+kubectl apply -f disagg_xpu_dra.yaml -n $NAMESPACE
+```
+Verify claim allocation:
+```bash
+kubectl get resourceclaim -n $NAMESPACE
+kubectl get dynamographdeployment -n $NAMESPACE
+```
+`agg_xpu_dra.yaml` and `disagg_xpu_dra.yaml` are optional hardware-specific templates and do not change the default deployment paths defined by `agg.yaml` and `disagg.yaml`.
 ### 4. Using Custom Dynamo Frameworks Image for vLLM
 To use a custom dynamo frameworks image for vLLM, you can update the deployment file using yq:

--- a/examples/backends/vllm/deploy/agg_xpu_dra.yaml
+++ b/examples/backends/vllm/deploy/agg_xpu_dra.yaml
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+apiVersion: resource.k8s.io/v1
+kind: ResourceClaimTemplate
+metadata:
+  name: gpu-template
+spec:
+  spec:
+    devices:
+      requests:
+        - name: gpu
+          exactly:
+            deviceClassName: gpu.intel.com
+            count: 1
+---
+apiVersion: nvidia.com/v1alpha1
+kind: DynamoGraphDeployment
+metadata:
+  name: vllm-agg-xpu-dra
+spec:
+  services:
+    Frontend:
+      envFromSecret: hf-token-secret
+      componentType: frontend
+      replicas: 1
+      extraPodSpec:
+        mainContainer:
+          image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:my-tag
+    VllmDecodeWorker:
+      envFromSecret: hf-token-secret
+      componentType: worker
+      replicas: 1
+      resources:
+        requests:
+          custom:
+            # Increase this value for larger models
+            ephemeral-storage: "2Gi"
+      extraPodSpec:
+        resourceClaims:
+          - name: gpu
+            resourceClaimTemplateName: gpu-template
+        mainContainer:
+          image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:my-tag
+          resources:
+            claims:
+              - name: gpu
+          env:
+            - name: VLLM_TARGET_DEVICE
+              value: xpu
+          workingDir: /workspace/examples/backends/vllm
+          command:
+            - python3
+            - -m
+            - dynamo.vllm
+          args:
+            - --model
+            - Qwen/Qwen3-0.6B
+            - --block-size
+            - "64"
--- a/examples/backends/vllm/deploy/disagg_xpu_dra.yaml
+++ b/examples/backends/vllm/deploy/disagg_xpu_dra.yaml
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+apiVersion: resource.k8s.io/v1
+kind: ResourceClaimTemplate
+metadata:
+  name: gpu-template
+spec:
+  spec:
+    devices:
+      requests:
+        - name: gpu
+          exactly:
+            deviceClassName: gpu.intel.com
+            count: 1
+---
+apiVersion: nvidia.com/v1alpha1
+kind: DynamoGraphDeployment
+metadata:
+  name: vllm-disagg-xpu-dra
+spec:
+  services:
+    Frontend:
+      envFromSecret: hf-token-secret
+      componentType: frontend
+      replicas: 1
+      extraPodSpec:
+        mainContainer:
+          image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:my-tag
+    VllmDecodeWorker:
+      envFromSecret: hf-token-secret
+      componentType: worker
+      subComponentType: decode
+      replicas: 1
+      resources:
+        requests:
+          custom:
+            # Increase this value for larger models
+            ephemeral-storage: "2Gi"
+      extraPodSpec:
+        resourceClaims:
+          - name: gpu
+            resourceClaimTemplateName: gpu-template
+        mainContainer:
+          image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:my-tag
+          resources:
+            claims:
+              - name: gpu
+          env:
+            - name: VLLM_TARGET_DEVICE
+              value: xpu
+          workingDir: /workspace/examples/backends/vllm
+          command:
+            - python3
+            - -m
+            - dynamo.vllm
+          args:
+            - --model
+            - Qwen/Qwen3-0.6B
+            - --disaggregation-mode
+            - decode
+            - --block-size
+            - "64"
+    VllmPrefillWorker:
+      envFromSecret: hf-token-secret
+      componentType: worker
+      subComponentType: prefill
+      replicas: 1
+      resources:
+        requests:
+          custom:
+            # Increase this value for larger models
+            ephemeral-storage: "2Gi"
+      extraPodSpec:
+        resourceClaims:
+          - name: gpu
+            resourceClaimTemplateName: gpu-template
+        mainContainer:
+          image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:my-tag
+          resources:
+            claims:
+              - name: gpu
+          env:
+            - name: VLLM_TARGET_DEVICE
+              value: xpu
+          workingDir: /workspace/examples/backends/vllm
+          command:
+            - python3
+            - -m
+            - dynamo.vllm
+          args:
+            - --model
+            - Qwen/Qwen3-0.6B
+            - --disaggregation-mode
+            - prefill
+            - --kv-transfer-config
+            - '{"kv_connector":"NixlConnector","kv_role":"kv_both","kv_buffer_device":"xpu"}'
+            - --block-size
+            - "64"