README.md 11.7 KB
Newer Older
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
# vLLM Kubernetes Deployment Configurations

This directory contains Kubernetes Custom Resource Definition (CRD) templates for deploying vLLM inference graphs using the **DynamoGraphDeployment** resource.

## Available Deployment Patterns

### 1. **Aggregated Deployment** (`agg.yaml`)
Basic deployment pattern with frontend and a single decode worker.

**Architecture:**
- `Frontend`: OpenAI-compatible API server (with kv router mode disabled)
- `VLLMDecodeWorker`: Single worker handling both prefill and decode

### 2. **Aggregated Router Deployment** (`agg_router.yaml`)
Enhanced aggregated deployment with KV cache routing capabilities.

**Architecture:**
- `Frontend`: OpenAI-compatible API server (with kv router mode enabled)
- `VLLMDecodeWorker`: Single worker handling both prefill and decode

### 3. **Disaggregated Deployment** (`disagg.yaml`)
High-performance deployment with separated prefill and decode workers.

**Architecture:**
- `Frontend`: HTTP API server coordinating between workers
- `VLLMDecodeWorker`: Specialized decode-only worker
27
- `VLLMPrefillWorker`: Specialized prefill-only worker (`--disaggregation-mode prefill`)
28
29
30
31
32
33
34
35
- Communication via NIXL transfer backend

### 4. **Disaggregated Router Deployment** (`disagg_router.yaml`)
Advanced disaggregated deployment with KV cache routing capabilities.

**Architecture:**
- `Frontend`: HTTP API server with KV-aware routing
- `VLLMDecodeWorker`: Specialized decode-only worker
36
- `VLLMPrefillWorker`: Specialized prefill-only worker (`--disaggregation-mode prefill`)
37

38
39
40
### 5. **Global Planner Deployments** (see [`examples/global_planner/`](../../../global_planner/))
Centralized scaling across multiple DGDs via GlobalPlanner. Examples include single-endpoint multi-pool and multi-model GPU budget patterns. See the [global planner examples](../../../global_planner/) for details.

41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
### 6. **Deployments with Intel XPU (Optional)** (`agg_xpu_dra.yaml` or `disagg_xpu_dra.yaml`)
Hardware-specific aggregated/disaggregated deployment using Kubernetes Dynamic Resource Allocation (DRA).

**Aggregated Architecture:**
- `Frontend`: OpenAI-compatible API server
- `VllmDecodeWorker`: Single worker with XPU target (`VLLM_TARGET_DEVICE=xpu`)
- GPU allocation via `ResourceClaimTemplate` and pod-level `resourceClaims`

**Disaggregated Architecture:**
- `Frontend`: HTTP API server coordinating between workers
- `VllmDecodeWorker`: Specialized decode-only worker with XPU target
- `VllmPrefillWorker`: Specialized prefill-only worker with XPU target
- GPU allocation via `ResourceClaimTemplate` and pod-level `resourceClaims`
- Communication via NIXL transfer backend with XPU buffer

56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
## CRD Structure

All templates use the **DynamoGraphDeployment** CRD:

```yaml
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
  name: <deployment-name>
spec:
  services:
    <ServiceName>:
      # Service configuration
```

### Key Configuration Options

**Resource Management:**
```yaml
resources:
  requests:
    cpu: "10"
    memory: "20Gi"
    gpu: "1"
  limits:
    cpu: "10"
    memory: "20Gi"
    gpu: "1"
```

**Container Configuration:**
```yaml
extraPodSpec:
  mainContainer:
    image: my-registry/vllm-runtime:my-tag
91
    workingDir: /workspace/examples/backends/vllm
92
93
94
95
    args:
      - "python3"
      - "-m"
      - "dynamo.vllm"
96
97
98
99
100
      - "--model"
      - "Qwen/Qwen3-0.6B"
      # Optional: Enable prompt embeddings feature
      # - "--enable-prompt-embeds"
      # Other model-specific arguments
101
102
```

103
104
105
**Common vLLM Flags:**
- `--enable-prompt-embeds`: Enable prompt embeddings feature
- `--enable-multimodal`: Enable multimodal (vision) support
106
- `--disaggregation-mode prefill`: Prefill-only mode for disaggregated serving
107
- `--kv-transfer-config '<json>'`: KV transfer backend configuration (e.g., `'{"kv_connector":"NixlConnector","kv_role":"kv_both"}'`)
108

109
110
111
112
## Prerequisites

Before using these templates, ensure you have:

113
1. **Dynamo Kubernetes Platform installed** - See [Quickstart Guide](../../../../docs/kubernetes/README.md)
114
2. **Kubernetes cluster with GPU support**
115
3. **Container registry access** for vLLM runtime images (optional for default NGC CUDA images - `nvcr.io/nvidia/ai-dynamo/*` images are publicly accessible; Intel XPU users should build custom images with `--device xpu`)
116
117
118
119
120
121
122
4. **HuggingFace token secret** (referenced as `envFromSecret: hf-token-secret`)

### Container Images

We have public images available on [NGC Catalog](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/collections/ai-dynamo/artifacts). If you'd prefer to use your own registry, build and push your own image:

```bash
123
python container/render.py --framework=vllm --output-short-filename
124
docker build -f container/rendered.Dockerfile .
125
126
127
128
129
130
# Tag and push to your container registry
# Update the image references in the YAML files
```

### Pre-Deployment Profiling (SLA Planner Only)

131
If using the SLA Planner deployment (`disagg_planner.yaml`), follow the [pre-deployment profiling guide](../../../../docs/components/profiler/profiler-guide.md) to run pre-deployment profiling.
132
133
134
135
136
137
138
139
140
141

## Usage

### 1. Choose Your Template
Select the deployment pattern that matches your requirements:
- Use `agg.yaml` for simple testing
- Use `agg_router.yaml` for production with load balancing
- Use `disagg.yaml` for maximum performance
- Use `disagg_router.yaml` for high-performance with KV cache routing
- Use `disagg_planner.yaml` for SLA-optimized performance
142
143
- Use `agg_xpu_dra.yaml` for aggregated deployment on Intel XPU clusters using Kubernetes DRA
- Use `disagg_xpu_dra.yaml` for disaggregated deployment on Intel XPU clusters using Kubernetes DRA
144
- Use [global planner examples](../../../global_planner/) for centralized scaling across multiple DGDs
145
146
147
148
149
150

### 2. Customize Configuration
Edit the template to match your environment:

```yaml
# Update image registry and tag
151
image: my-registry/vllm-runtime:my-tag
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172

# Configure your model
args:
  - "--model"
  - "your-org/your-model"
```

### 3. Deploy

Use the following command to deploy the deployment file.

First, create a secret for the HuggingFace token.
```bash
export HF_TOKEN=your_hf_token
kubectl create secret generic hf-token-secret \
  --from-literal=HF_TOKEN=${HF_TOKEN} \
  -n ${NAMESPACE}
```

Then, deploy the model using the deployment file.

173
Export the NAMESPACE you used in your Dynamo Kubernetes Platform Installation.
174
175

```bash
176
cd <dynamo-source-root>/examples/backends/vllm/deploy
177
178
179
180
181
export DEPLOYMENT_FILE=agg.yaml

kubectl apply -f $DEPLOYMENT_FILE -n $NAMESPACE
```

182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
#### Deploy with Intel XPU  (Optional)
If your cluster uses Intel GPU devices via Kubernetes Dynamic Resource Allocation (DRA), ensure:
- Your Kubernetes cluster is **v1.34+** (required for DRA API v1), and
- The [Intel XPU Resource Driver](https://github.com/intel/intel-resource-drivers-for-kubernetes) is installed.

Deploy the XPU template (includes the ResourceClaimTemplate):
```bash
cd <dynamo-source-root>/examples/backends/vllm/deploy

# For aggregated deployment
kubectl apply -f agg_xpu_dra.yaml -n $NAMESPACE

# OR for disaggregated deployment
kubectl apply -f disagg_xpu_dra.yaml -n $NAMESPACE
```

Verify claim allocation:

```bash
kubectl get resourceclaim -n $NAMESPACE
kubectl get dynamographdeployment -n $NAMESPACE
```

`agg_xpu_dra.yaml` and `disagg_xpu_dra.yaml` are optional hardware-specific templates and do not change the default deployment paths defined by `agg.yaml` and `disagg.yaml`.

207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
### 4. Using Custom Dynamo Frameworks Image for vLLM

To use a custom dynamo frameworks image for vLLM, you can update the deployment file using yq:

```bash
export DEPLOYMENT_FILE=agg.yaml
export FRAMEWORK_RUNTIME_IMAGE=<vllm-image>

yq '.spec.services.[].extraPodSpec.mainContainer.image = env(FRAMEWORK_RUNTIME_IMAGE)' $DEPLOYMENT_FILE  > $DEPLOYMENT_FILE.generated
kubectl apply -f $DEPLOYMENT_FILE.generated -n $NAMESPACE
```

### 5. Port Forwarding

After deployment, forward the frontend service to access the API:

```bash
kubectl port-forward deployment/vllm-v1-disagg-frontend-<pod-uuid-info> 8000:8000
```

## Configuration Options

### Environment Variables

To change `DYN_LOG` level, edit the yaml file by adding:

```yaml
...
spec:
  envs:
    - name: DYN_LOG
      value: "debug" # or other log levels
  ...
```

### vLLM Worker Configuration

vLLM workers are configured through command-line arguments. Key parameters include:

- `--model`: Model to serve (e.g., `Qwen/Qwen3-0.6B`)
247
- `--disaggregation-mode prefill`: Enable prefill-only mode for disaggregated serving
248
249
250
251
252
253
254
255
256
- `--metrics-endpoint-port`: Port for publishing KV metrics to Dynamo

See the [vLLM CLI documentation](https://docs.vllm.ai/en/v0.9.2/configuration/serve_args.html?h=serve+arg) for the full list of configuration options.

## Testing the Deployment

Send a test request to verify your deployment:

```bash
257
curl localhost:8000/v1/chat/completions \
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen3-0.6B",
    "messages": [
    {
        "role": "user",
        "content": "In the heart of Eldoria, an ancient land of boundless magic and mysterious creatures, lies the long-forgotten city of Aeloria. Once a beacon of knowledge and power, Aeloria was buried beneath the shifting sands of time, lost to the world for centuries. You are an intrepid explorer, known for your unparalleled curiosity and courage, who has stumbled upon an ancient map hinting at ests that Aeloria holds a secret so profound that it has the potential to reshape the very fabric of reality. Your journey will take you through treacherous deserts, enchanted forests, and across perilous mountain ranges. Your Task: Character Background: Develop a detailed background for your character. Describe their motivations for seeking out Aeloria, their skills and weaknesses, and any personal connections to the ancient city or its legends. Are they driven by a quest for knowledge, a search for lost familt clue is hidden."
    }
    ],
    "stream": false,
    "max_tokens": 30
  }'
```

## Model Configuration

All templates use **Qwen/Qwen3-0.6B** as the default model, but you can use any vLLM-supported LLM model and configuration arguments.

## Monitoring and Health

- **Frontend health endpoint**: `http://<frontend-service>:8000/health`
- **Liveness probes**: Check process health regularly
- **KV metrics**: Published via metrics endpoint port

## Request Migration

284
You can enable [request migration](../../../../docs/fault-tolerance/request-migration.md) to handle worker failures gracefully by adding the migration limit argument to worker configurations:
285
286
287
288
289
290
291
292
293

```yaml
args:
  - "--migration-limit"
  - "3"
```

## Further Reading

294
295
296
297
- **Deployment Guide**: [Creating Kubernetes Deployments](../../../../docs/kubernetes/deployment/create-deployment.md)
- **Quickstart**: [Deployment Quickstart](../../../../docs/kubernetes/README.md)
- **Platform Setup**: [Dynamo Kubernetes Platform Installation](../../../../docs/kubernetes/installation-guide.md)
- **SLA Planner**: [SLA Planner Quickstart Guide](../../../../docs/components/planner/planner-guide.md)
298
- **Global Planner**: [Global Planner Deployment Guide](../../../../docs/components/planner/global-planner.md)
299
300
- **Examples**: [Deployment Examples](../../../../docs/getting-started/examples.md)
- **Architecture Docs**: [Disaggregated Serving](../../../../docs/design-docs/disagg-serving.md), [KV-Aware Routing](../../../../docs/components/router/README.md)
301
302
303
304
305
306
307
308
309
310
311

## Troubleshooting

Common issues and solutions:

1. **Pod fails to start**: Check image registry access and HuggingFace token secret
2. **GPU not allocated**: Verify cluster has GPU nodes and proper resource limits
3. **Health check failures**: Review model loading logs and increase `initialDelaySeconds`
4. **Out of memory**: Increase memory limits or reduce model batch size
5. **Port forwarding issues**: Ensure correct pod UUID in port-forward command

312
For additional support, refer to the [deployment troubleshooting guide](../../../../docs/kubernetes/README.md).