"csrc/vscode:/vscode.git/clone" did not exist on "d5c4800112c12bbcd4955858ef1b415c16ae16e7"
README.md 9.77 KB
Newer Older
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
# TensorRT-LLM Kubernetes Deployment Configurations

This directory contains Kubernetes Custom Resource Definition (CRD) templates for deploying TensorRT-LLM inference graphs using the **DynamoGraphDeployment** resource.

## Available Deployment Patterns

### 1. **Aggregated Deployment** (`agg.yaml`)
Basic deployment pattern with frontend and a single worker.

**Architecture:**
- `Frontend`: OpenAI-compatible API server (with kv router mode disabled)
- `TRTLLMWorker`: Single worker handling both prefill and decode

### 2. **Aggregated Router Deployment** (`agg_router.yaml`)
Enhanced aggregated deployment with KV cache routing capabilities.

**Architecture:**
- `Frontend`: OpenAI-compatible API server (with kv router mode enabled)
- `TRTLLMWorker`: Multiple workers handling both prefill and decode (2 replicas for load balancing)

### 3. **Disaggregated Deployment** (`disagg.yaml`)
High-performance deployment with separated prefill and decode workers.

**Architecture:**
- `Frontend`: HTTP API server coordinating between workers
- `TRTLLMDecodeWorker`: Specialized decode-only worker
- `TRTLLMPrefillWorker`: Specialized prefill-only worker

### 4. **Disaggregated Router Deployment** (`disagg_router.yaml`)
Advanced disaggregated deployment with KV cache routing capabilities.

**Architecture:**
- `Frontend`: HTTP API server (with kv router mode enabled)
- `TRTLLMDecodeWorker`: Specialized decode-only worker
- `TRTLLMPrefillWorker`: Specialized prefill-only worker (2 replicas for load balancing)

37
38
39
40
41
42
43
44
### 5. **Aggregated Deployment with Config** (`agg-with-config.yaml`)
Aggregated deployment with custom configuration.

**Architecture:**
- `nvidia-config`: ConfigMap containing a custom trtllm configuration
- `Frontend`: OpenAI-compatible API server (with kv router mode disabled)
- `TRTLLMWorker`: Single worker handling both prefill and decode with custom configuration mounted from the configmap

45
46
47
48
49
50
51
52
53
54
55
### 6. **Disaggregated Planner Deployment** (`disagg_planner.yaml`)
Advanced disaggregated deployment with SLA-based automatic scaling.

**Architecture:**
- `Frontend`: HTTP API server coordinating between workers
- `Planner`: SLA-based planner that monitors performance and scales workers automatically
- `Prometheus`: Metrics collection and monitoring
- `TRTLLMDecodeWorker`: Specialized decode-only worker
- `TRTLLMPrefillWorker`: Specialized prefill-only worker

> [!NOTE]
56
> This deployment requires pre-deployment profiling to be completed first. See [Pre-Deployment Profiling](../../../../docs/benchmarks/sla_driven_profiling.md) for detailed instructions.
57

58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
## CRD Structure

All templates use the **DynamoGraphDeployment** CRD:

```yaml
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
  name: <deployment-name>
spec:
  services:
    <ServiceName>:
      # Service configuration
```

### Key Configuration Options

**Resource Management:**
```yaml
resources:
  requests:
    cpu: "10"
    memory: "20Gi"
    gpu: "1"
  limits:
    cpu: "10"
    memory: "20Gi"
    gpu: "1"
```

**Container Configuration:**
```yaml
extraPodSpec:
  mainContainer:
92
    image: my-registry/tensorrtllm-runtime:my-tag
93
    workingDir: /workspace/examples/backends/trtllm
94
95
96
97
98
99
100
101
102
103
104
    args:
      - "python3"
      - "-m"
      - "dynamo.trtllm"
      # Model-specific arguments
```

## Prerequisites

Before using these templates, ensure you have:

105
1. **Dynamo Kubernetes Platform installed** - See [Quickstart Guide](../../../../docs/kubernetes/README.md)
106
107
108
109
110
111
2. **Kubernetes cluster with GPU support**
3. **Container registry access** for TensorRT-LLM runtime images
4. **HuggingFace token secret** (referenced as `envFromSecret: hf-token-secret`)

### Container Images

112
The deployment files currently require access to `my-registry/tensorrtllm-runtime`. If you don't have access, build and push your own image:
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143

```bash
./container/build.sh --framework tensorrtllm
# Tag and push to your container registry
# Update the image references in the YAML files
```

**Note:** TensorRT-LLM uses git-lfs, which needs to be installed in advance:
```bash
apt-get update && apt-get -y install git git-lfs
```

For ARM machines, use:
```bash
./container/build.sh --framework tensorrtllm --platform linux/arm64
```

## Usage

### 1. Choose Your Template
Select the deployment pattern that matches your requirements:
- Use `agg.yaml` for simple testing
- Use `agg_router.yaml` for production with KV cache routing and load balancing
- Use `disagg.yaml` for maximum performance with separated workers
- Use `disagg_router.yaml` for high-performance with KV cache routing and disaggregation

### 2. Customize Configuration
Edit the template to match your environment:

```yaml
# Update image registry and tag
144
image: my-registry/tensorrtllm-runtime:my-tag
145
146
147
148
149
150
151
152
153
154
155

# Configure your model and deployment settings
args:
  - "python3"
  - "-m"
  - "dynamo.trtllm"
  # Add your model-specific arguments
```

### 3. Deploy

156
See the [Create Deployment Guide](../../../../docs/kubernetes/deployment/create_deployment.md) to learn how to deploy the deployment file.
157
158
159
160
161
162
163
164
165
166
167

First, create a secret for the HuggingFace token.
```bash
export HF_TOKEN=your_hf_token
kubectl create secret generic hf-token-secret \
  --from-literal=HF_TOKEN=${HF_TOKEN} \
  -n ${NAMESPACE}
```

Then, deploy the model using the deployment file.

168
Export the NAMESPACE you used in your Dynamo Kubernetes Platform Installation.
169
170

```bash
171
cd dynamo/examples/backends/trtllm/deploy
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
export DEPLOYMENT_FILE=agg.yaml
kubectl apply -f $DEPLOYMENT_FILE -n $NAMESPACE
```

### 4. Using Custom Dynamo Frameworks Image for TensorRT-LLM

To use a custom dynamo frameworks image for TensorRT-LLM, you can update the deployment file using yq:

```bash
export DEPLOYMENT_FILE=agg.yaml
export FRAMEWORK_RUNTIME_IMAGE=<trtllm-image>

yq '.spec.services.[].extraPodSpec.mainContainer.image = env(FRAMEWORK_RUNTIME_IMAGE)' $DEPLOYMENT_FILE  > $DEPLOYMENT_FILE.generated
kubectl apply -f $DEPLOYMENT_FILE.generated -n $NAMESPACE
```

### 5. Port Forwarding

After deployment, forward the frontend service to access the API:

```bash
kubectl port-forward deployment/trtllm-v1-disagg-frontend-<pod-uuid-info> 8000:8000
```

## Configuration Options

### Environment Variables

To change `DYN_LOG` level, edit the yaml file by adding:

```yaml
...
spec:
  envs:
    - name: DYN_LOG
      value: "debug" # or other log levels
  ...
```

### TensorRT-LLM Worker Configuration

TensorRT-LLM workers are configured through command-line arguments in the deployment YAML. Key configuration areas include:

- **KV Cache Transfer**: Choose between UCX (default) or NIXL for disaggregated serving
- **Request Migration**: Enable graceful failure handling with `--migration-limit`

## Testing the Deployment

220
Send a test request to verify your deployment. See the [client section](../../../../docs/backends/vllm/README.md#client) for detailed instructions.
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241

**Note:** For multi-node deployments, target the node running `python3 -m dynamo.frontend <args>`.

## Model Configuration

The deployment templates support various TensorRT-LLM models and configurations. You can customize model-specific arguments in the worker configuration sections of the YAML files.

## Monitoring and Health

- **Frontend health endpoint**: `http://<frontend-service>:8000/health`
- **Worker health endpoints**: `http://<worker-service>:9090/health`
- **Liveness probes**: Check process health every 5 seconds
- **Readiness probes**: Check service readiness with configurable delays

## KV Cache Transfer Methods

TensorRT-LLM supports two methods for KV cache transfer in disaggregated serving:

- **UCX** (default): Standard method for KV cache transfer
- **NIXL** (experimental): Alternative transfer method

242
For detailed configuration instructions, see the [KV cache transfer guide](../../../../docs/backends/trtllm/kv-cache-transfer.md).
243
244
245

## Request Migration

246
You can enable [request migration](../../../../docs/fault_tolerance/request_migration.md) to handle worker failures gracefully by adding the migration limit argument to worker configurations:
247
248
249
250
251
252
253
254
255
256
257
258

```yaml
args:
  - "python3"
  - "-m"
  - "dynamo.trtllm"
  - "--migration-limit"
  - "3"
```

## Benchmarking

259
To benchmark your deployment with AIPerf, see this utility script: [perf.sh](../../../../benchmarks/llm/perf.sh)
260
261
262
263
264

Configure the `model` name and `host` based on your deployment.

## Further Reading

265
- **Deployment Guide**: [Creating Kubernetes Deployments](../../../../docs/kubernetes/deployment/create_deployment.md)
266
- **Quickstart**: [Deployment Quickstart](../../../../docs/kubernetes/README.md)
267
- **Platform Setup**: [Dynamo Kubernetes Platform Installation](../../../../docs/kubernetes/installation_guide.md)
268
- **Examples**: [Deployment Examples](../../../../docs/examples/README.md)
269
- **Architecture Docs**: [Disaggregated Serving](../../../../docs/design_docs/disagg_serving.md), [KV-Aware Routing](../../../../docs/router/README.md)
270
271
- **Multinode Deployment**: [Multinode Examples](../../../../docs/backends/trtllm/multinode/multinode-examples.md)
- **Speculative Decoding**: [Llama 4 + Eagle Guide](../../../../docs/backends/trtllm/llama4_plus_eagle.md)
272
273
274
275
276
277
278
279
280
281
282
283
284
285
- **Kubernetes CRDs**: [Custom Resources Documentation](https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/custom-resources/)

## Troubleshooting

Common issues and solutions:

1. **Pod fails to start**: Check image registry access and HuggingFace token secret
2. **GPU not allocated**: Verify cluster has GPU nodes and proper resource limits
3. **Health check failures**: Review model loading logs and increase `initialDelaySeconds`
4. **Out of memory**: Increase memory limits or reduce model batch size
5. **Port forwarding issues**: Ensure correct pod UUID in port-forward command
6. **Git LFS issues**: Ensure git-lfs is installed before building containers
7. **ARM deployment**: Use `--platform linux/arm64` when building on ARM machines

286
For additional support, refer to the [deployment troubleshooting guide](../../../../docs/kubernetes/README.md).