README.md 15.3 KB
Newer Older
1
2
## Inference Gateway Setup with Dynamo

3
When integrating Dynamo with the Inference Gateway you must use the custom Dynamo EPP image.
4

5
The custom Dynamo EPP image integrates the Dynamo router directly into the gateway's endpoint picker. Using the `dyn-kv` plugin, it selects the optimal worker based on KV cache state and tokenized prompt before routing the request. The integration moves intelligent routing upstream to the gateway layer.
6

7
EPP's default kv-routing approach is not token-aware because the prompt is not tokenized. But the Dynamo plugin uses a token-aware KV algorithm. It employs the dynamo router which implements kv routing by running your model's tokenizer inline. The EPP plugin configuration lives in [`helm/dynamo-gaie/epp-config-dynamo.yaml`](helm/dynamo-gaie/epp-config-dynamo.yaml) per EPP [convention](https://gateway-api-inference-extension.sigs.k8s.io/guides/epp-configuration/config-text/).
8

9
10
Dynamo Integration with the Inference Gateway supports Aggregated and Disaggregated Serving.
If you want to use LoRA deploy Dynamo without the Inference Gateway or in the BlackBox approach with the Inference Gateway.
11

12
Currently, these setups are only supported with the kGateway based Inference Gateway.
13
14
15
16
17

## Table of Contents

- [Prerequisites](#prerequisites)
- [Installation Steps](#installation-steps)
18
19
20
  - [1. Install Dynamo Platform](#1-install-dynamo-platform)
  - [2. Deploy Inference Gateway](#2-deploy-inference-gateway)
  - [3. Deploy Your Model](#3-deploy-your-model)
21
22
  - [4. Build EPP image (Optional)](#4-build-epp-image-optional)
  - [5. Deploy](#5-deploy)
23
24
25
26
  - [6. Verify Installation](#6-verify-installation)
  - [7. Usage](#7-usage)
  - [8. Deleting the installation](#8-deleting-the-installation)
- [Gateway API Inference Extension Details](#gateway-api-inference-extension-integration)
27
28
  - [Router bookkeeping operations](#router-bookkeeping-operations)
  - [Header Routing Hints](#header-routing-hints)
29

30
31
32
33
34
35
36
37

## Prerequisites

- Kubernetes cluster with kubectl configured
- NVIDIA GPU drivers installed on worker nodes

## Installation Steps

38
### 1. Install Dynamo Platform ###
39

40
[See Quickstart Guide](../../docs/kubernetes/README.md) to install Dynamo Kubernetes Platform.
41

42
### 2. Deploy Inference Gateway ###
43
44

First, deploy an inference gateway service. In this example, we'll install `kgateway` based gateway implementation.
45

46
```bash
47
cd deploy/inference-gateway
48
export NAMESPACE=my-model # You can put the inference gateway into another namespace and then adjust your http-route.yaml
49
./scripts/install_gaie_crd_kgateway.sh
50
```
51
**Note**: The manifest at `config/manifests/gateway/kgateway/gateway.yaml` uses `gatewayClassName: agentgateway`, but kGateway's helm chart creates a GatewayClass named `kgateway`. The patch command in the script fixes this mismatch.
52

53
#### f. Verify the Gateway is running
54
55

```bash
56
kubectl get gateway inference-gateway
57
58
59

# Sample output
# NAME                CLASS      ADDRESS   PROGRAMMED   AGE
60
# inference-gateway   kgateway             True         1m
61
62
```

63

64
### 3. Setup secrets ###
65

66
Follow the steps in [model deployment](../../examples/backends/vllm/deploy/README.md) to deploy `Qwen/Qwen3-0.6B` model in aggregate mode using [agg.yaml](../../examples/backends/vllm/deploy/agg.yaml) in `my-model` kubernetes namespace.
67
68
69
70
71
72
73
74
Make sure to enable kv-routing by adding the env var in the FrontEnd.
```bash
    mainContainer:
      image: ...
      env:
        - name: DYN_ROUTER_MODE
          value: "kv"
```
75
76

Sample commands to deploy model:
77

78
```bash
79
80
cd <dynamo-source-root>
cd examples/backends/vllm/deploy
81
82
83
kubectl apply -f agg.yaml -n my-model
```

84
Take a note of or change the DYNAMO_IMAGE in the model deployment file.
85
86

Do not forget docker registry secret if needed.
87

88
89
90
91
92
93
94
95
```bash
kubectl create secret docker-registry docker-imagepullsecret \
  --docker-server=$DOCKER_SERVER \
  --docker-username=$DOCKER_USERNAME \
  --docker-password=$DOCKER_PASSWORD \
  --namespace=$NAMESPACE
```

96
97
Do not forget to include the HuggingFace token if required.

98
99
100
101
102
103
104
```bash
export HF_TOKEN=your_hf_token
kubectl create secret generic hf-token-secret \
  --from-literal=HF_TOKEN=${HF_TOKEN} \
  -n ${NAMESPACE}
```

105
Create a model configuration file similar to the vllm_agg_qwen.yaml for your model.
106
This file demonstrates the values needed for the Vllm Agg setup in [agg.yaml](../../examples/backends/vllm/deploy/agg.yaml)
107
108
Take a note of the model's block size provided in the model card.

109
### 4. Build EPP image (Optional)
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136

You can either use the provided Dynamo FrontEnd image for the EPP image or you need to build your own Dynamo EPP custom image following the steps below.

```bash
# export env vars
export DOCKER_SERVER=ghcr.io/nvidia/dynamo	# Container registry
export IMAGE_TAG=YOUR-TAG # Or auto from git tag
cd deploy/inference-gateway/epp
make all # Do everything in one command
# or make all-push to also push


# Or step-by-step
make dynamo-lib # Build Dynamo library and copy to project
make image-load # Build Docker image and load locally
make image-push # Build and push to registry
make info # Check image tag
```

#### All-in-one Targets

| Target | Description |
|--------|-------------|
| `make dynamo-lib` | Build Dynamo static library and copy to project |
| `make all` | Build Dynamo lib + Docker image + load locally |
| `make all-push` | Build Dynamo lib + Docker image + push to registry |

137
### 5. Deploy
138

139
140
We recommend deploying Inference Gateway's Endpoint Picker as a Dynamo operator's managed component. Alternatively,
you could deploy it as a standalone pod
141

142
#### 5.a. Deploy as a DGD component
143

144
```bash
145
146
147
148
kubectl apply -f operator-managed/examples/agg.yaml -n ${NAMESPACE}
kubectl apply -f operator-managed/examples/http-route.yaml -n ${NAMESPACE}
```

149
150
151
152
153
154
155
156
157
158
159
160
**Startup Probe Timeout:** The EPP has a default startup probe timeout of 30 minutes (10s × 180 failures).
If your model takes longer to load, increase the `failureThreshold` in the EPP's `startupProbe`. For example,
to allow 60 minutes for startup:

```yaml
extraPodSpec:
  mainContainer:
    startupProbe:
      failureThreshold: 360  # 10s × 360 = 60 minutes
```

**Gateway Namespace**
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
Note that this assumes your gateway is installed into `NAMESPACE=my-model` (examples' default)
If you installed it into a different namespace, you need to adjust the HttpRoute entry in http-route.yaml.


#### 5.b. Deploy as a standalone pod

##### 5.b.1 Deploy Your Model ###

Follow the steps in [model deployment](../../examples/backends/vllm/deploy/README.md) to deploy `Qwen/Qwen3-0.6B` model in aggregate mode using [agg.yaml](../../examples/backends/vllm/deploy/agg.yaml) in `my-model` kubernetes namespace.

Sample commands to deploy model:

```bash
cd <dynamo-source-root>
cd examples/backends/vllm/deploy
kubectl apply -f agg.yaml -n my-model
```

Take a note of or change the DYNAMO_IMAGE in the model deployment file.

Do not forget docker registry secret if needed.

##### 5.b.2 Install Dynamo GIE helm chart ###

```bash
cd deploy/inference-gateway/standalone
187

188
189
# Export the EPP image - use the Dynamo FrontEnd image or build your own EPP image (see section 4)
export EPP_IMAGE=<the-epp-image>
190
```
191
192

```bash
193
helm upgrade --install dynamo-gaie ./helm/dynamo-gaie -n my-model -f ./vllm_agg_qwen.yaml --set-string extension.image=$EPP_IMAGE
194
195
```

196
197
198
199
200
201
By default, the Kubernetes discovery mechanism is used. If you prefer etcd, please use the `--set epp.dynamo.useEtcd=true` flag below.

```bash
helm upgrade --install dynamo-gaie ./helm/dynamo-gaie -n my-model -f ./vllm_agg_qwen.yaml --set-string extension.image=$EPP_IMAGE --set epp.dynamo.useEtcd=true
```

202
203
204
205
206
207
Key configurations include:

- An InferenceModel resource for the Qwen model
- A service for the inference gateway
- Required RBAC roles and bindings
- RBAC permissions
208
- dynamoGraphDeploymentName - the name of the Dynamo Graph where your model is deployed.
209
210
211


**Configuration**
212
You can configure the plugin by setting environment variables in the EPP component of your DGD in case of the operator-managed installation or in your [values.yaml](standalone/helm/dynamo-gaie/values.yaml).
213

214
215
216
217
Common Vars for Routing Configuration:
- Set `DYN_BUSY_THRESHOLD` to configure the upper bound on how "full" a worker can be (often derived from kv_active_blocks or other load metrics) before the router skips it. If the selected worker exceeds this value, routing falls back to the next best candidate. By default the value is negative meaning this is not enabled.
- Set `DYN_ENFORCE_DISAGG=true` if you want to enforce every request being served in the disaggregated manner. By default it is false meaning if the the prefill worker is not available the request will be served in the aggregated manner.
- By default the Dynamo plugin uses KV routing. You can expose `DYN_USE_KV_ROUTING=false` in your [values.yaml](standalone/helm/dynamo-gaie/values.yaml) if you prefer to route in the round-robin fashion.
218
- If using kv-routing:
219
220
221
222
  - Overwrite the `DYN_KV_BLOCK_SIZE` in your [values-dynamo-epp.yaml](./values-dynamo-epp.yaml) to match your model's block size.The `DYN_KV_BLOCK_SIZE` env var is ***MANDATORY*** to prevent silent KV routing failures.
  - Set `DYNAMO_OVERLAP_SCORE_WEIGHT` to weigh how heavily the score uses token overlap (predicted KV cache hits) versus other factors (load, historical hit rate). Higher weight biases toward reusing workers with similar cached prefixes.
  - Set `DYNAMO_ROUTER_TEMPERATURE` to soften or sharpen the selection curve when combining scores. Low temperature makes the router pick the top candidate deterministically; higher temperature lets lower-scoring workers through more often (exploration).
  - Set `DYNAMO_USE_KV_EVENTS=false` if you want to disable the workers sending KV events while using kv-routing
223
  - See the [Router Guide](../../docs/components/router/router_guide.md) for details.
224
225


226
227
Stand-Alone installation only:
- Overwrite the `DYN_NAMESPACE` env var if needed to match your model's dynamo namespace.
228

229
### 6. Verify Installation ###
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251

Check that all resources are properly deployed:

```bash
kubectl get inferencepool
kubectl get httproute
kubectl get service
kubectl get gateway
```

Sample output:

```bash
# kubectl get inferencepool
NAME        AGE
qwen-pool   33m

# kubectl get httproute
NAME        HOSTNAMES   AGE
qwen-route               33m
```

252
### 7. Usage ###
253
254
255

The Inference Gateway provides HTTP endpoints for model inference.

256
#### 1: Populate gateway URL for your k8s cluster ####
257

258
259
To test the gateway in minikube, use the following command:
a. User minikube tunnel to expose the gateway to the host
260
   This requires `sudo` access to the host machine. alternatively, you can use port-forward to expose the gateway to the host as shown in alternative (b).
261

262
263
```bash
# in first terminal
264
ps aux | grep "minikube tunnel" | grep -v grep # make sure minikube tunnel is not already running.
265
minikube tunnel # start the tunnel
266
267

# in second terminal where you want to send inference requests
268
GATEWAY_URL=$(kubectl get svc inference-gateway -n my-model -o jsonpath='{.spec.clusterIP}') & echo $GATEWAY_URL
269
270
271
```

b. use port-forward to expose the gateway to the host
272

273
274
275
276
277
278
279
280
```bash
# in first terminal
kubectl port-forward svc/inference-gateway 8000:80 -n my-model

# in second terminal where you want to send inference requests
GATEWAY_URL=http://localhost:8000
```

281
#### 2: Check models deployed to inference gateway ####
282
283

a. Query models:
284

285
286
287
288
289
```bash
# in the second terminal where you GATEWAY_URL is set

curl $GATEWAY_URL/v1/models | jq .
```
290

291
Sample output:
292

293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
```json
{
  "data": [
    {
      "created": 1753768323,
      "id": "Qwen/Qwen3-0.6B",
      "object": "object",
      "owned_by": "nvidia"
    }
  ],
  "object": "list"
}
```

b. Send inference request to gateway:

```bash
MODEL_NAME="Qwen/Qwen3-0.6B"
curl $GATEWAY_URL/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
      "model": "'"${MODEL_NAME}"'",
      "messages": [
      {
          "role": "user",
          "content": "In the heart of Eldoria, an ancient land of boundless magic and mysterious creatures, lies the long-forgotten city of Aeloria. Once a beacon of knowledge and power, Aeloria was buried beneath the shifting sands of time, lost to the world for centuries. You are an intrepid explorer, known for your unparalleled curiosity and courage, who has stumbled upon an ancient map hinting at ests that Aeloria holds a secret so profound that it has the potential to reshape the very fabric of reality. Your journey will take you through treacherous deserts, enchanted forests, and across perilous mountain ranges. Your Task: Character Background: Develop a detailed background for your character. Describe their motivations for seeking out Aeloria, their skills and weaknesses, and any personal connections to the ancient city or its legends. Are they driven by a quest for knowledge, a search for lost familt clue is hidden."
      }
      ],
      "stream":false,
      "max_tokens": 30,
      "temperature": 0.0
    }'
```

Sample inference output:

```json
{
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "logprobs": null,
      "message": {
        "audio": null,
        "content": "<think>\nOkay, I need to develop a character background for the user's query. Let me start by understanding the requirements. The character is an",
        "function_call": null,
        "refusal": null,
        "role": "assistant",
        "tool_calls": null
      }
    }
  ],
  "created": 1753768682,
  "id": "chatcmpl-772289b8-5998-4f6d-bd61-3659b684b347",
  "model": "Qwen/Qwen3-0.6B",
  "object": "chat.completion",
  "service_tier": null,
  "system_fingerprint": null,
  "usage": {
    "completion_tokens": 29,
    "completion_tokens_details": null,
    "prompt_tokens": 196,
    "prompt_tokens_details": null,
    "total_tokens": 225
  }
}
```
361

362
### 8. Deleting the installation ###
363

364
365
366
367
368
If you need to uninstall run:

```bash
kubectl delete dynamoGraphDeployment vllm-agg
helm uninstall dynamo-gaie -n my-model
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388

# To uninstall GAIE
# 1. Delete the inference-gateway
kubectl delete gateway inference-gateway --ignore-not-found

# 2. Uninstall kgateway helm releases
helm uninstall kgateway -n kgateway-system
helm uninstall kgateway-crds -n kgateway-system

# 3. Delete the kgateway-system namespace (optional, cleans up everything in it)
helm uninstall kgateway --namespace kgateway-system
kubectl delete namespace kgateway-system --ignore-not-found

# 4. Delete the Inference Extension CRDs
IGW_LATEST_RELEASE=v1.2.1
kubectl delete -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/releases/download/${IGW_LATEST_RELEASE}/manifests.yaml --ignore-not-found

# 5. Delete the Gateway API CRDs
GATEWAY_API_VERSION=v1.4.1
kubectl delete -f https://github.com/kubernetes-sigs/gateway-api/releases/download/$GATEWAY_API_VERSION/standard-install.yaml --ignore-not-found
389
```
390
391
392
393
394

## Gateway API Inference Extension Integration

This section documents the updated plugin implementation for Gateway API Inference Extension **v1.2.1**.

395
### Router bookkeeping operations
396

397
EPP performs Dynamo router book keeping operations so the FrontEnd's Router does not have to sync its state.
398
399


400
### Header Routing Hints
401

402
Since v1.2.1, the EPP uses a **header-only approach** for communicating routing decisions.
403
404
405
406
407
408
409
410
The plugins set HTTP headers that are forwarded to the backend workers.

#### Headers Set by Dynamo Plugins

| Header | Description | Set By |
|--------|-------------|--------|
| `x-worker-instance-id` | Primary worker ID (decode worker in disagg mode) | kv-aware-scorer |
| `x-prefill-instance-id` | Prefill worker ID (disaggregated mode only) | kv-aware-scorer |