When integrating Dynamo with the Inference Gateway you could either use the default EPP image provided by the extension or use the custom Dynamo image.
When integrating Dynamo with the Inference Gateway it is recommended to use the custom Dynamo EPP image.
1.When using the Dynamo custom EPP image you will take advantage of the Dynamo router when EPP chooses the best worker to route the request to. This setup uses a custom Dynamo plugin `dyn-kv` to pick the best worker. In this case the Dynamo routing logic is moved upstream. We recommend this approach.
1.**Dynamo EPP (Recommended):** The custom Dynamo EPP image integrates the Dynamo router directly into the gateway's endpoint picker. Using the `dyn-kv` plugin, it selects the optimal worker based on KV cache state and tokenized prompt before routing the request. The integration moves intelligent routing upstream to the gateway layer.
2. When using the GAIE-provided image for the EPP, the Dynamo deployment is treated as a black box and the EPP would route round-robin. In this case GAIE just fans out the traffic, and the smarts only remain within the Dynamo graph. Use this if you have one Dynamo graph and do not want to obtain the Dynamo EPP image. This is a "backup" approach.
2.**Standard EPP (Fallback):** You can use the default GAIE EPP image, which treats the Dynamo deployment as a black box and routes requests round-robin. Routing intelligence remains within the Dynamo graph itself. Use this approach if you have a single Dynamo graph and don't need the custom EPP image.
EPP’s default kv-routing approach is not token-aware because the prompt is not tokenized. But the Dynamo plugin uses a token-aware KV algorithm. It employs the dynamo router which implements kv routing by running your model’s tokenizer inline. The EPP plugin configuration lives in [`helm/dynamo-gaie/epp-config-dynamo.yaml`](helm/dynamo-gaie/epp-config-dynamo.yaml) per EPP [convention](https://gateway-api-inference-extension.sigs.k8s.io/guides/epp-configuration/config-text/).
The setup provided here uses the Dynamo custom EPP by default. Set `epp.useDynamo=false` in your deployment to pick the approach 2.
EPP’s default kv-routing approach is not token-aware because the prompt is hashed without tokenization. But the Dynamo plugin uses a token-aware KV algorithm. It employs the dynamo router which implements kv routing by running your model’s tokenizer inline. The EPP plugin configuration lives in [`helm/dynamo-gaie/epp-config-dynamo.yaml`](helm/dynamo-gaie/epp-config-dynamo.yaml) per EPP [convention](https://gateway-api-inference-extension.sigs.k8s.io/guides/epp-configuration/config-text/).
Dynamo Integration with the Inference Gateway supports Aggregated and Disaggregated Serving.
If you want to use LoRA deploy Dynamo without the Inference Gateway or in the BlackBox approach with the Inference Gateway.
Currently, these setups are only supported with the kGateway based Inference Gateway.
...
...
@@ -16,7 +19,19 @@ Currently, these setups are only supported with the kGateway based Inference Gat
-[Prerequisites](#prerequisites)
-[Installation Steps](#installation-steps)
-[Usage](#6-usage)
-[1. Install Dynamo Platform](#1-install-dynamo-platform)
-[5. Install Dynamo GAIE helm chart](#5-install-dynamo-gaie-helm-chart)
-[6. Verify Installation](#6-verify-installation)
-[7. Usage](#7-usage)
-[8. Deleting the installation](#8-deleting-the-installation)
-[Gateway API Inference Extension Details](#gateway-api-inference-extension-integration)
-[v1.2.1 API Changes](#v121-api-changes)
-[Building for v1.2.1](#building-for-v121)
-[Header-Only Routing for v1.2.1](#header-only-routing-for-v121)
## Prerequisites
...
...
@@ -34,19 +49,22 @@ Currently, these setups are only supported with the kGateway based Inference Gat
First, deploy an inference gateway service. In this example, we'll install `kgateway` based gateway implementation.
```bash
./install_gaie_crd_kgateway.sh
cd deploy/inference-gateway
./scripts/install_gaie_crd_kgateway.sh
```
**Note**: The manifest at `config/manifests/gateway/kgateway/gateway.yaml` uses `gatewayClassName: agentgateway`, but kGateway's helm chart creates a GatewayClass named `kgateway`. The patch command in the script fixes this mismatch.
Verify installation:
#### f. Verify the Gateway is running
```bash
kubectl get gateway inference-gateway-n my-model
kubectl get gateway inference-gateway
# Sample output
# NAME CLASS ADDRESS PROGRAMMED AGE
# inference-gateway kgateway x.x.x.x True 1m
# inference-gateway kgateway True 1m
```
### 3. Deploy Your Model ###
Follow the steps in [model deployment](../../examples/backends/vllm/deploy/README.md) to deploy `Qwen/Qwen3-0.6B` model in aggregate mode using [agg.yaml](../../examples/backends/vllm/deploy/agg.yaml) in `my-model` kubernetes namespace.
...
...
@@ -54,7 +72,8 @@ Follow the steps in [model deployment](../../examples/backends/vllm/deploy/READM
Sample commands to deploy model:
```bash
cd <dynamo-source-root>/examples/backends/vllm/deploy
cd <dynamo-source-root>
cd examples/backends/vllm/deploy
kubectl apply -f agg.yaml -n my-model
```
...
...
@@ -83,14 +102,42 @@ Create a model configuration file similar to the vllm_agg_qwen.yaml for your mod
This file demonstrates the values needed for the Vllm Agg setup in [agg.yaml](../../examples/backends/vllm/deploy/agg.yaml)
Take a note of the model's block size provided in the model card.
### 4. Install Dynamo GAIE helm chart ###
### 4. Build EPP image
You can either use the provided Dynamo FrontEnd image for the EPP image or you need to build your own Dynamo EPP custom image following the steps below.
@@ -122,7 +169,7 @@ You can configure the plugin by setting environment vars in your [values-dynamo-
- Overwrite the `DYN_NAMESPACE` env var if needed to match your model's dynamo namespace.
- Set `DYNAMO_BUSY_THRESHOLD` to configure the upper bound on how “full” a worker can be (often derived from kv_active_blocks or other load metrics) before the router skips it. If the selected worker exceeds this value, routing falls back to the next best candidate. By default the value is negative meaning this is not enabled.
- Set `DYNAMO_ROUTER_REPLICA_SYNC=true` to enable a background watcher to keep multiple router instances in sync (important if you run more than one KV router per component).
- Set `DYNAMO_ENFORCE_DISAGG=true` if you want to enforce every request being served in the disaggregated manner. By default it is false meaning if the the prefill worker is not available the request will be served in the aggregated manner.
- By default the Dynamo plugin uses KV routing. You can expose `DYNAMO_USE_KV_ROUTING=false` in your [values-dynamo-epp.yaml] if you prefer to route in the round-robin fashion.
- If using kv-routing:
- Overwrite the `DYNAMO_KV_BLOCK_SIZE` in your [values-dynamo-epp.yaml](./values-dynamo-epp.yaml) to match your model's block size.The `DYNAMO_KV_BLOCK_SIZE` env var is ***MANDATORY*** to prevent silent KV routing failures.
...
...
@@ -132,52 +179,25 @@ You can configure the plugin by setting environment vars in your [values-dynamo-
- See the [KV cache routing design](../../docs/router/kv_cache_routing.md) for details.
Dynamo provides a custom routing plugin `pkg/epp/scheduling/plugins/dynamo_kv_scorer/plugin.go` to perform efficient kv routing.
The Dynamo router is built as a static library, the EPP router will call to provide fast inference.
You can either use the special FrontEnd image for the EPP_IMAGE in the Helm deployment command and proceed to the step 2 or you can build the image yourself following the steps below.
##### 1. Build the custom EPP image #####
If you choose to build your own image, use the `container/build.sh` script with the `--target frontend` option:
- Clones the Gateway API Inference Extension (GAIE) repository at the correct version
- Builds the Dynamo Router static library
- Applies the necessary patches to the EPP codebase
- Builds the custom EPP image with Dynamo KV routing support
- Builds the frontend image with the EPP binary and Dynamo runtime components
Re-tag the freshly built image and push it to your registry:
```bash
docker images
docker tag <your-new-id> <your-image-tag>
docker push <your-image-tag>
```
**Note**
You can also use the standard EPP image`us-central1-docker.pkg.dev/k8s-staging-images/gateway-api-inference-extension/epp:v0.4.0`. For the basic black box integration run:
You can also use the standard EPP image i.e. `us-central1-docker.pkg.dev/k8s-artifacts-prod/images/gateway-api-inference-extension/epp:v1.2.1` for the basic black box integration.