Create a model configuration file similar to the vllm_agg_qwen.yaml for your model.
This file demonstrates the values needed for the vLLM aggregated setup in [agg.yaml](https://github.com/ai-dynamo/dynamo/blob/main/examples/backends/vllm/deploy/agg.yaml)
Take a note of the model's block size provided in the model card.
### 4. Build EPP image (Optional)
### 4. Build EPP image (Optional)
You can either use the provided Dynamo FrontEnd image for the EPP image or you need to build your own Dynamo EPP custom image following the steps below.
You can either use the provided Dynamo FrontEnd image for the EPP image or you need to build your own Dynamo EPP custom image following the steps below.
...
@@ -124,6 +120,7 @@ For the HttpRoute service make sure to specify the namespace where your gateway
...
@@ -124,6 +120,7 @@ For the HttpRoute service make sure to specify the namespace where your gateway
```bash
```bash
cd <dynamo-source-root>
cd <dynamo-source-root>
# kubectl get httproutes -n my-model # Make sure you do not have an incompatible HttpRoute running, delete if so.
We provide examples for llama-3-70b vLLM under the `recipes/llama-3-70b/vllm/agg/gaie/` for aggregated and `recipes/llama-3-70b/vllm/disagg-single-node/gaie/` for disaggregated serving.
We provide examples for llama-3-70b vLLM under the `recipes/llama-3-70b/vllm/agg/gaie/` for aggregated and `recipes/llama-3-70b/vllm/disagg-single-node/gaie/` for disaggregated serving.
Note for the aggregated serving you need to disable DYN_ENFORCE_DISAGG in epp config.
```bash
- name: DYN_ENFORCE_DISAGG
value: "false"
```
Use the proper folder in commands below.
Use the proper folder in commands below.
```bash
```bash
...
@@ -143,7 +145,7 @@ Use the proper folder in commands below.
...
@@ -143,7 +145,7 @@ Use the proper folder in commands below.
You can configure the plugin by setting environment variables in the EPP component of your DGD in case of the operator-managed installation or in your [values.yaml](https://github.com/ai-dynamo/dynamo/blob/main/deploy/inference-gateway/standalone/helm/dynamo-gaie/values.yaml).
You can configure the plugin by setting environment variables in the EPP component of your DGD in case of the operator-managed installation or in your [values.yaml](https://github.com/ai-dynamo/dynamo/blob/main/deploy/inference-gateway/standalone/helm/dynamo-gaie/values.yaml).
Common Vars for Routing Configuration:
Common Vars for Routing Configuration:
**Enabling KV-Aware Routing (most precise)**
KV-aware routing uses live KV cache block events from workers so the EPP can route requests to the worker with the best prefix cache overlap. To enable it (default):
1.**Workers — enable prefix caching and KV event publishing.** Each worker must publish KV cache events to event plane (NATS/ZMQ) so the EPP's router can track per-worker cache state.
-**vLLM:** Pass `--enable-prefix-caching` and `--kv-events-config '{"enable_kv_cache_events":true}'`.
-**SGLang:** Pass `--kv-events-config` with the appropriate endpoint.
-**Disaggregated vLLM (prefill/decode separation):** Do **not** pass `--disaggregation-mode decode` on decode workers — this flag hardcodes KV event publishing to off. Instead, omit the flag (defaults to aggregated mode) so decode workers also publish their cache state.
2.**EPP — leave `DYN_USE_KV_EVENTS` at its default (`true`).** The EPP subscribes to worker KV events via event plane (NATS/ZMQ) and uses them for prefix-overlap scoring.
3.**Block size — must be consistent.** The `--block-size` on all workers must match `DYN_KV_CACHE_BLOCK_SIZE` on the EPP (default: 128). Mismatched block sizes cause incorrect block hash computation.
**Disabling KV-Aware Routing**
To disable the EPP from listening for KV events (e.g., when prefix caching is off on workers, or for simpler load-balanced routing):
1.**EPP:** Set `DYN_USE_KV_EVENTS=false`. The router falls back to approximate mode (routing decisions are tracked locally with TTL decay instead of live KV events from workers).
2.**Workers:** Pass `--no-enable-prefix-caching` to disable prefix caching entirely. Without prefix caching, no KV events are generated regardless of other flags.
3.**Optionally** set `DYN_OVERLAP_SCORE_WEIGHT=0` on the EPP to skip prefix-overlap scoring altogether, making the router select workers based on load only.
- Set `DYN_BUSY_THRESHOLD` to configure the upper bound on how "full" a worker can be (often derived from kv_active_blocks or other load metrics) before the router skips it. If the selected worker exceeds this value, routing falls back to the next best candidate. By default the value is negative meaning this is not enabled.
- Set `DYN_BUSY_THRESHOLD` to configure the upper bound on how "full" a worker can be (often derived from kv_active_blocks or other load metrics) before the router skips it. If the selected worker exceeds this value, routing falls back to the next best candidate. By default the value is negative meaning this is not enabled.
- Set `DYN_ENFORCE_DISAGG=true` to strictly enforce disaggregated mode. When enabled, requests fail if prefill workers have not registered yet. Without this, requests arriving before prefill workers are discovered fall through to decode-only routing. Prefill errors always fail requests regardless of this setting.
- Set `DYN_ENFORCE_DISAGG=true` to strictly enforce disaggregated mode. When enabled, requests fail if prefill workers have not registered yet. Without this, requests arriving before prefill workers are discovered fall through to decode-only routing. Prefill errors always fail requests regardless of this setting.
- Set `DYN_OVERLAP_SCORE_WEIGHT` to weigh how heavily the score uses token overlap (predicted KV cache hits) versus other factors (load, historical hit rate). Higher weight biases toward reusing workers with similar cached prefixes. (default: 1)
- Set `DYN_OVERLAP_SCORE_WEIGHT` to weigh how heavily the score uses token overlap (predicted KV cache hits) versus other factors (load, historical hit rate). Higher weight biases toward reusing workers with similar cached prefixes. (default: 1)
- Set `DYN_ROUTER_TEMPERATURE` to soften or sharpen the selection curve when combining scores. Low temperature makes the router pick the top candidate deterministically; higher temperature lets lower-scoring workers through more often (exploration).
- Set `DYN_ROUTER_TEMPERATURE` to soften or sharpen the selection curve when combining scores. Low temperature makes the router pick the top candidate deterministically; higher temperature lets lower-scoring workers through more often (exploration).
- Set `DYN_USE_KV_EVENTS=false` if you want to disable the router listening for KV events while using kv-routing (default: true). SGLang workers require `--kv-events-config` and TRT-LLM workers require `--publish-events-and-metrics` to publish KV events. For vLLM, KV events are auto-configured when prefix caching is active (deprecated — use `--kv-events-config` explicitly)
-`DYN_ROUTER_TEMPERATURE` — Temperature for worker sampling via softmax (default: 0.0)
-`DYN_ROUTER_TEMPERATURE` — Temperature for worker sampling via softmax (default: 0.0)
-`DYN_ROUTER_TRACK_ACTIVE_BLOCKS` — Track active blocks (default: true)
-`DYN_ROUTER_TRACK_ACTIVE_BLOCKS` — Track active blocks (default: true)
...
@@ -264,9 +287,8 @@ The Inference Gateway provides HTTP endpoints for model inference.
...
@@ -264,9 +287,8 @@ The Inference Gateway provides HTTP endpoints for model inference.
#### 1: Populate gateway URL for your k8s cluster ####
#### 1: Populate gateway URL for your k8s cluster ####
To test the gateway in minikube, use the following command:
a. To test the integration in minikube, proceed as below:
a. User minikube tunnel to expose the gateway to the host
Use minikube tunnel to expose the gateway to the host. This requires `sudo` access to the host machine. Alternatively, you can use port-forward to expose the gateway to the host as shown in alternative (b).
This requires `sudo` access to the host machine. alternatively, you can use port-forward to expose the gateway to the host as shown in alternative (b).
```bash
```bash
# in first terminal
# in first terminal
...
@@ -277,11 +299,13 @@ minikube tunnel # start the tunnel
...
@@ -277,11 +299,13 @@ minikube tunnel # start the tunnel
GATEWAY_URL=$(kubectl get svc inference-gateway -n my-model -ojsonpath='{.spec.clusterIP}')&&echo$GATEWAY_URL
GATEWAY_URL=$(kubectl get svc inference-gateway -n my-model -ojsonpath='{.spec.clusterIP}')&&echo$GATEWAY_URL
```
```
b. use port-forward to expose the gateway to the host
b. To test on a cluster use commands below:
use port-forward to expose the gateway to the host
```bash
```bash
# in first terminal
# in first terminal
kubectl port-forward svc/inference-gateway 8000:80 -n{NAMESPACE}# for NAMESPACE put wherever you installed thee gateway i.e. kgateway-system
kubectl port-forward svc/inference-gateway 8000:80 -n${NAMESPACE}# for NAMESPACE put wherever you installed the gateway i.e. kgateway-system or my-model8
# in second terminal where you want to send inference requests
# in second terminal where you want to send inference requests
"content": "In the heart of Eldoria, an ancient land of boundless magic and mysterious creatures, lies the long-forgotten city of Aeloria. Once a beacon of knowledge and power, Aeloria was buried beneath the shifting sands of time, lost to the world for centuries. You are an intrepid explorer, known for your unparalleled curiosity and courage, who has stumbled upon an ancient map hinting at ests that Aeloria holds a secret so profound that it has the potential to reshape the very fabric of reality. Your journey will take you through treacherous deserts, enchanted forests, and across perilous mountain ranges. Your Task: Character Background: Develop a detailed background for your character. Describe their motivations for seeking out Aeloria, their skills and weaknesses, and any personal connections to the ancient city or its legends. Are they driven by a quest for knowledge, a search for lost familt clue is hidden."
}
],
"stream":false,
"max_tokens": 30,
"temperature": 0.0
}'
```
Sample inference output:
Sample inference output:
...
@@ -369,7 +412,9 @@ Sample inference output:
...
@@ -369,7 +412,9 @@ Sample inference output:
```
```
***If you have more than one HttpRoute running on the cluster***
***If you have more than one HttpRoute running on the cluster***
Add the host to your HttpRoute.yaml and add the header `curl -H "Host: llama3-70b-agg.example.com" ...` to every request.
Add the host to your HttpRoute.yaml and add the header