When integrating Dynamo with the Inference Gateway you must use the custom Dynamo EPP image.
# Inference Gateway (GAIE)
The custom Dynamo EPP image integrates the Dynamo router directly into the gateway's endpoint picker. Using the `dyn-kv` plugin, it selects the optimal worker based on KV cache state and tokenized prompt before routing the request. The integration moves intelligent routing upstream to the gateway layer.
Integrate Dynamo with the Gateway API Inference Extension for intelligent KV-aware request routing at the gateway layer.
EPP's default kv-routing approach is not token-aware because the prompt is not tokenized. But the Dynamo plugin uses a token-aware KV algorithm. It employs the dynamo router which implements kv routing by running your model's tokenizer inline. The EPP plugin configuration lives in [`helm/dynamo-gaie/epp-config-dynamo.yaml`](../../../deploy/inference-gateway/helm/dynamo-gaie/epp-config-dynamo.yaml) per EPP [convention](https://gateway-api-inference-extension.sigs.k8s.io/guides/epp-configuration/config-text/).
EPP's default kv-routing approach is not token-aware because the prompt is not tokenized. But the Dynamo plugin uses a token-aware KV algorithm. It employs the dynamo router which implements kv routing by running your model's tokenizer inline. The EPP plugin configuration lives in [`helm/dynamo-gaie/epp-config-dynamo.yaml`](helm/dynamo-gaie/epp-config-dynamo.yaml) per EPP [convention](https://gateway-api-inference-extension.sigs.k8s.io/guides/epp-configuration/config-text/).
Dynamo Integration with the Inference Gateway supports Aggregated and Disaggregated Serving. The epp config is the same for both. If no prefill workers found the service degrades gracefully to perform aggregated serving.
If you want to use LoRA deploy Dynamo without the Inference Gateway.
Currently, these setups are only supported with the kGateway based Inference Gateway.
Currently, these setups are only supported with the kGateway based Inference Gateway.
We provide examples for llama-3-70b vLLM under the `recipes/llama-3-70b/vllm/agg/gaie/` for aggregated and `recipes/llama-3-70b/vllm/disagg-single-node/gaie/` for disaggregated serving.
Use the proper folder in commands below.
```bash
```bash
# Deploy PVC, first Update `storageClassName` in recipes/llama-3-70b/model-cache/model-cache.yaml to match your cluster before deploying
- When using GAIE the FrontEnd does not choose the workers. The routing is determined in the EPP.
- When using GAIE the FrontEnd does not choose the workers. The routing is determined in the EPP.
...
@@ -168,29 +189,9 @@ If you installed it into a different namespace, you need to adjust the HttpRoute
...
@@ -168,29 +189,9 @@ If you installed it into a different namespace, you need to adjust the HttpRoute
#### 5.b. Deploy as a standalone pod
#### 5.b. Deploy as a standalone pod
##### 5.b.1 Deploy Your Model ###
We do not recommend this method but there are hints on how to do this here.
We provide an example for Qwen vLLM below.
##### 5.b.1 Deploy Your Model ###
Before deploying you must enable the `--direct-route` flag in the FrontEnd cli in your Dynamo Graph.
```bash
command:
- python3
args:
- -m
- dynamo.frontend
- --router-mode
- direct
```
Follow the steps in [model deployment](../../../examples/backends/vllm/deploy/README.md) to deploy `Qwen/Qwen3-0.6B` model in aggregate mode using [agg.yaml](../../../examples/backends/vllm/deploy/agg.yaml) in `my-model` kubernetes namespace.
Sample commands to deploy model:
```bash
cd <dynamo-source-root>
cd examples/backends/vllm/deploy
kubectl apply -f agg.yaml -n my-model
```
##### 5.b.2 Install Dynamo GIE helm chart ###
##### 5.b.2 Install Dynamo GIE helm chart ###
...
@@ -284,7 +285,7 @@ b. use port-forward to expose the gateway to the host
...
@@ -284,7 +285,7 @@ b. use port-forward to expose the gateway to the host