@@ -63,26 +63,6 @@ kubectl get gateway inference-gateway
...
@@ -63,26 +63,6 @@ kubectl get gateway inference-gateway
### 3. Setup secrets ###
### 3. Setup secrets ###
Follow the steps in [model deployment](../../examples/backends/vllm/deploy/README.md) to deploy `Qwen/Qwen3-0.6B` model in aggregate mode using [agg.yaml](../../examples/backends/vllm/deploy/agg.yaml) in `my-model` kubernetes namespace.
Make sure to enable kv-routing by adding the env var in the FrontEnd.
```bash
mainContainer:
image: ...
env:
- name: DYN_ROUTER_MODE
value: "kv"
```
Sample commands to deploy model:
```bash
cd <dynamo-source-root>
cd examples/backends/vllm/deploy
kubectl apply -f agg.yaml -n my-model
```
Take a note of or change the DYNAMO_IMAGE in the model deployment file.
- When using GAIE the FrontEnd does not choose the workers. The routing is determined in the EPP.
- You must enable the flag in the FrontEnd cli as below.
```bash
command:
- python3
args:
- -m
- dynamo.frontend
- --router-mode
- direct
```
- The pre-selected worker (decode and prefill in case of the disaggregated serving) are passed in the request headers.
- The flag assures the routing respects this selection.
**Startup Probe Timeout:** The EPP has a default startup probe timeout of 30 minutes (10s × 180 failures).
**Startup Probe Timeout:** The EPP has a default startup probe timeout of 30 minutes (10s × 180 failures).
If your model takes longer to load, increase the `failureThreshold` in the EPP's `startupProbe`. For example,
If your model takes longer to load, increase the `failureThreshold` in the EPP's `startupProbe`. For example,
to allow 60 minutes for startup:
to allow 60 minutes for startup:
...
@@ -166,6 +167,18 @@ If you installed it into a different namespace, you need to adjust the HttpRoute
...
@@ -166,6 +167,18 @@ If you installed it into a different namespace, you need to adjust the HttpRoute
##### 5.b.1 Deploy Your Model ###
##### 5.b.1 Deploy Your Model ###
We provide an example for Qwen vLLM below.
Before deploying you must enable the `--direct-route` flag in the FrontEnd cli in your Dynamo Graph.
```bash
command:
- python3
args:
- -m
- dynamo.frontend
- --router-mode
- direct
```
Follow the steps in [model deployment](../../examples/backends/vllm/deploy/README.md) to deploy `Qwen/Qwen3-0.6B` model in aggregate mode using [agg.yaml](../../examples/backends/vllm/deploy/agg.yaml) in `my-model` kubernetes namespace.
Follow the steps in [model deployment](../../examples/backends/vllm/deploy/README.md) to deploy `Qwen/Qwen3-0.6B` model in aggregate mode using [agg.yaml](../../examples/backends/vllm/deploy/agg.yaml) in `my-model` kubernetes namespace.
Sample commands to deploy model:
Sample commands to deploy model:
...
@@ -176,10 +189,6 @@ cd examples/backends/vllm/deploy
...
@@ -176,10 +189,6 @@ cd examples/backends/vllm/deploy
kubectl apply -f agg.yaml -n my-model
kubectl apply -f agg.yaml -n my-model
```
```
Take a note of or change the DYNAMO_IMAGE in the model deployment file.
Do not forget docker registry secret if needed.
##### 5.b.2 Install Dynamo GIE helm chart ###
##### 5.b.2 Install Dynamo GIE helm chart ###
```bash
```bash
...
@@ -214,14 +223,14 @@ You can configure the plugin by setting environment variables in the EPP compone
...
@@ -214,14 +223,14 @@ You can configure the plugin by setting environment variables in the EPP compone
Common Vars for Routing Configuration:
Common Vars for Routing Configuration:
- Set `DYN_BUSY_THRESHOLD` to configure the upper bound on how "full" a worker can be (often derived from kv_active_blocks or other load metrics) before the router skips it. If the selected worker exceeds this value, routing falls back to the next best candidate. By default the value is negative meaning this is not enabled.
- Set `DYN_BUSY_THRESHOLD` to configure the upper bound on how "full" a worker can be (often derived from kv_active_blocks or other load metrics) before the router skips it. If the selected worker exceeds this value, routing falls back to the next best candidate. By default the value is negative meaning this is not enabled.
- Set `DYN_ENFORCE_DISAGG=true` if you want to enforce every request being served in the disaggregated manner. By default it is false meaning if the the prefill worker is not available the request will be served in the aggregated manner.
- Set `DYN_ENFORCE_DISAGG=true` if you want to enforce every request being served in the disaggregated manner. By default it is false meaning if the the prefill worker is not available the request will be served in the aggregated manner.
-By default the Dynamo plugin uses KV routing. You can expose `DYN_USE_KV_ROUTING=false` in your [values.yaml](standalone/helm/dynamo-gaie/values.yaml) if you prefer to route in the round-robin fashion.
-Set `DYN_OVERLAP_SCORE_WEIGHT` to weigh how heavily the score uses token overlap (predicted KV cache hits) versus other factors (load, historical hit rate). Higher weight biases toward reusing workers with similar cached prefixes. (default: 1)
-If using kv-routing:
-Set `DYN_ROUTER_TEMPERATURE` to soften or sharpen the selection curve when combining scores. Low temperature makes the router pick the top candidate deterministically; higher temperature lets lower-scoring workers through more often (exploration).
- Overwrite the `DYN_KV_BLOCK_SIZE` in your [values-dynamo-epp.yaml](./values-dynamo-epp.yaml) to match your model's block size.The `DYN_KV_BLOCK_SIZE` env var is ***MANDATORY*** to prevent silent KV routing failures.
- Set `DYN_USE_KV_EVENTS=false` if you want to disable the workers sending KV events while using kv-routing (default: true)
- Set `DYNAMO_OVERLAP_SCORE_WEIGHT` to weigh how heavily the score uses token overlap (predicted KV cache hits) versus other factors (load, historical hit rate). Higher weight biases toward reusing workers with similar cached prefixes.
-`DYN_ROUTER_TEMPERATURE` — Temperature for worker sampling via softmax (default: 0.0)
- Set`DYNAMO_ROUTER_TEMPERATURE` to soften or sharpen the selection curve when combining scores. Low temperature makes the router pick the top candidate deterministically; higher temperature lets lower-scoring workers through more often (exploration).