"components/vscode:/vscode.git/clone" did not exist on "9055b2d362d1085159ba08bf53a034bbb6ad47b5"
Unverified Commit 310f8ca9 authored by Yao Qing's avatar Yao Qing Committed by GitHub
Browse files

feat: add Intel XPU deployment templates with Kubernetes DRA support (#7464)


Signed-off-by: default avatarYao, Qing <qing.yao@intel.com>
parent a2154ba5
......@@ -38,6 +38,21 @@ Advanced disaggregated deployment with KV cache routing capabilities.
### 5. **Global Planner Deployments** (see [`examples/global_planner/`](../../../global_planner/))
Centralized scaling across multiple DGDs via GlobalPlanner. Examples include single-endpoint multi-pool and multi-model GPU budget patterns. See the [global planner examples](../../../global_planner/) for details.
### 6. **Deployments with Intel XPU (Optional)** (`agg_xpu_dra.yaml` or `disagg_xpu_dra.yaml`)
Hardware-specific aggregated/disaggregated deployment using Kubernetes Dynamic Resource Allocation (DRA).
**Aggregated Architecture:**
- `Frontend`: OpenAI-compatible API server
- `VllmDecodeWorker`: Single worker with XPU target (`VLLM_TARGET_DEVICE=xpu`)
- GPU allocation via `ResourceClaimTemplate` and pod-level `resourceClaims`
**Disaggregated Architecture:**
- `Frontend`: HTTP API server coordinating between workers
- `VllmDecodeWorker`: Specialized decode-only worker with XPU target
- `VllmPrefillWorker`: Specialized prefill-only worker with XPU target
- GPU allocation via `ResourceClaimTemplate` and pod-level `resourceClaims`
- Communication via NIXL transfer backend with XPU buffer
## CRD Structure
All templates use the **DynamoGraphDeployment** CRD:
......@@ -97,7 +112,7 @@ Before using these templates, ensure you have:
1. **Dynamo Kubernetes Platform installed** - See [Quickstart Guide](../../../../docs/kubernetes/README.md)
2. **Kubernetes cluster with GPU support**
3. **Container registry access** for vLLM runtime images
3. **Container registry access** for vLLM runtime images (optional for default NGC CUDA images - `nvcr.io/nvidia/ai-dynamo/*` images are publicly accessible; Intel XPU users should build custom images with `--device xpu`)
4. **HuggingFace token secret** (referenced as `envFromSecret: hf-token-secret`)
### Container Images
......@@ -124,6 +139,8 @@ Select the deployment pattern that matches your requirements:
- Use `disagg.yaml` for maximum performance
- Use `disagg_router.yaml` for high-performance with KV cache routing
- Use `disagg_planner.yaml` for SLA-optimized performance
- Use `agg_xpu_dra.yaml` for aggregated deployment on Intel XPU clusters using Kubernetes DRA
- Use `disagg_xpu_dra.yaml` for disaggregated deployment on Intel XPU clusters using Kubernetes DRA
- Use [global planner examples](../../../global_planner/) for centralized scaling across multiple DGDs
### 2. Customize Configuration
......@@ -162,6 +179,31 @@ export DEPLOYMENT_FILE=agg.yaml
kubectl apply -f $DEPLOYMENT_FILE -n $NAMESPACE
```
#### Deploy with Intel XPU (Optional)
If your cluster uses Intel GPU devices via Kubernetes Dynamic Resource Allocation (DRA), ensure:
- Your Kubernetes cluster is **v1.34+** (required for DRA API v1), and
- The [Intel XPU Resource Driver](https://github.com/intel/intel-resource-drivers-for-kubernetes) is installed.
Deploy the XPU template (includes the ResourceClaimTemplate):
```bash
cd <dynamo-source-root>/examples/backends/vllm/deploy
# For aggregated deployment
kubectl apply -f agg_xpu_dra.yaml -n $NAMESPACE
# OR for disaggregated deployment
kubectl apply -f disagg_xpu_dra.yaml -n $NAMESPACE
```
Verify claim allocation:
```bash
kubectl get resourceclaim -n $NAMESPACE
kubectl get dynamographdeployment -n $NAMESPACE
```
`agg_xpu_dra.yaml` and `disagg_xpu_dra.yaml` are optional hardware-specific templates and do not change the default deployment paths defined by `agg.yaml` and `disagg.yaml`.
### 4. Using Custom Dynamo Frameworks Image for vLLM
To use a custom dynamo frameworks image for vLLM, you can update the deployment file using yq:
......
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
apiVersion: resource.k8s.io/v1
kind: ResourceClaimTemplate
metadata:
name: gpu-template
spec:
spec:
devices:
requests:
- name: gpu
exactly:
deviceClassName: gpu.intel.com
count: 1
---
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
name: vllm-agg-xpu-dra
spec:
services:
Frontend:
envFromSecret: hf-token-secret
componentType: frontend
replicas: 1
extraPodSpec:
mainContainer:
image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:my-tag
VllmDecodeWorker:
envFromSecret: hf-token-secret
componentType: worker
replicas: 1
resources:
requests:
custom:
# Increase this value for larger models
ephemeral-storage: "2Gi"
extraPodSpec:
resourceClaims:
- name: gpu
resourceClaimTemplateName: gpu-template
mainContainer:
image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:my-tag
resources:
claims:
- name: gpu
env:
- name: VLLM_TARGET_DEVICE
value: xpu
workingDir: /workspace/examples/backends/vllm
command:
- python3
- -m
- dynamo.vllm
args:
- --model
- Qwen/Qwen3-0.6B
- --block-size
- "64"
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
apiVersion: resource.k8s.io/v1
kind: ResourceClaimTemplate
metadata:
name: gpu-template
spec:
spec:
devices:
requests:
- name: gpu
exactly:
deviceClassName: gpu.intel.com
count: 1
---
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
name: vllm-disagg-xpu-dra
spec:
services:
Frontend:
envFromSecret: hf-token-secret
componentType: frontend
replicas: 1
extraPodSpec:
mainContainer:
image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:my-tag
VllmDecodeWorker:
envFromSecret: hf-token-secret
componentType: worker
subComponentType: decode
replicas: 1
resources:
requests:
custom:
# Increase this value for larger models
ephemeral-storage: "2Gi"
extraPodSpec:
resourceClaims:
- name: gpu
resourceClaimTemplateName: gpu-template
mainContainer:
image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:my-tag
resources:
claims:
- name: gpu
env:
- name: VLLM_TARGET_DEVICE
value: xpu
workingDir: /workspace/examples/backends/vllm
command:
- python3
- -m
- dynamo.vllm
args:
- --model
- Qwen/Qwen3-0.6B
- --disaggregation-mode
- decode
- --block-size
- "64"
VllmPrefillWorker:
envFromSecret: hf-token-secret
componentType: worker
subComponentType: prefill
replicas: 1
resources:
requests:
custom:
# Increase this value for larger models
ephemeral-storage: "2Gi"
extraPodSpec:
resourceClaims:
- name: gpu
resourceClaimTemplateName: gpu-template
mainContainer:
image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:my-tag
resources:
claims:
- name: gpu
env:
- name: VLLM_TARGET_DEVICE
value: xpu
workingDir: /workspace/examples/backends/vllm
command:
- python3
- -m
- dynamo.vllm
args:
- --model
- Qwen/Qwen3-0.6B
- --disaggregation-mode
- prefill
- --kv-transfer-config
- '{"kv_connector":"NixlConnector","kv_role":"kv_both","kv_buffer_device":"xpu"}'
- --block-size
- "64"
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment