@@ -5,7 +5,7 @@ This guide demonstrates two setups.
...
@@ -5,7 +5,7 @@ This guide demonstrates two setups.
- The basic setup treats each Dynamo deployment as a black box and routes traffic randomly among the deployments.
- The basic setup treats each Dynamo deployment as a black box and routes traffic randomly among the deployments.
- The EPP-aware setup uses a custom Dynamo plugin `dyn-kv` to pick the best worker.
- The EPP-aware setup uses a custom Dynamo plugin `dyn-kv` to pick the best worker.
EPP’s default approach is token-aware only `by approximation` because it relies on the non-tokenized text in the prompt. But the Dynamo plugin uses a token-aware KV algorithm. It employs the dynamo router which implements kv routing by running your model’s tokenizer inline. The EPP plugin configuration lives in [`helm/dynamo-gaie/epp-config-dynamo.yaml`](helm/dynamo-gaie/epp-config-dynamo.yaml) per EPP [convention](https://gateway-api-inference-extension.sigs.k8s.io/guides/epp-configuration/config-text/).
EPP’s default kv-routing approach is token-aware only `by approximation` because the prompt is tokenized with a generic tokenizer unaware of the model deployed. But the Dynamo plugin uses a token-aware KV algorithm. It employs the dynamo router which implements kv routing by running your model’s tokenizer inline. The EPP plugin configuration lives in [`helm/dynamo-gaie/epp-config-dynamo.yaml`](helm/dynamo-gaie/epp-config-dynamo.yaml) per EPP [convention](https://gateway-api-inference-extension.sigs.k8s.io/guides/epp-configuration/config-text/).
Currently, these setups are only supported with the kGateway based Inference Gateway.
Currently, these setups are only supported with the kGateway based Inference Gateway.
Do not forget to include the the HuggingFace token if required.
```bash
export HF_TOKEN=your_hf_token
kubectl create secret generic hf-token-secret \
--from-literal=HF_TOKEN=${HF_TOKEN}\
-n${NAMESPACE}
```
Create a model configuration file similar to the vllm_agg_qwen.yaml for you model.
This file demonstrates the values needed for the Vllm Agg setup in [agg.yaml](../../components/backends/vllm/deploy/agg.yaml)
Take a note of the model's block size provided in the model card.
### 4. Install Dynamo GAIE helm chart ###
### 4. Install Dynamo GAIE helm chart ###
The Inference Gateway is configured through the `inference-gateway-resources.yaml` file.
The Inference Gateway is configured through the `inference-gateway-resources.yaml` file.
...
@@ -95,7 +117,7 @@ Deploy the Inference Gateway resources to your Kubernetes cluster by running one
...
@@ -95,7 +117,7 @@ Deploy the Inference Gateway resources to your Kubernetes cluster by running one
#### Basic Black Box Integration ####
#### Basic Black Box Integration ####
For the basic black box integration run:
The basic black box integration uses a standard EPP image`us-central1-docker.pkg.dev/k8s-staging-images/gateway-api-inference-extension/epp:v0.4.0`. For the basic black box integration run:
#### EPP-aware Integration with the custom Dynamo Plugin ####
#### EPP-aware Integration with the custom Dynamo Plugin ####
Dynamo provides a custom routing plugin `pkg/epp/scheduling/plugins/dynamo_kv_scorer/plugin.go` to perform efficient kv routing.
The Dynamo router is built as a static library, the EPP router will call to provide fast inference.
You can either use the image `nvcr.io/nvstaging/ai-dynamo/epp-inference-extension-dynamo:v0.6.0-1` for the EPP_IMAGE in the Helm deployment command and proceed to the step 2 or you can build the image yourself following the steps below.
##### 1. Build the custom EPP image #####
##### 1. Build the custom EPP image #####
We provide git patches for you to use.
If you choose to build your own image use the steps below. Proceed to step 2 otherwise to deploy with Helm.
##### 1.1 Clone the official GAIE repo in a separate folder #####
##### 1.1 Clone the official GAIE repo in a separate folder #####
...
@@ -116,44 +142,74 @@ cd gateway-api-inference-extension
...
@@ -116,44 +142,74 @@ cd gateway-api-inference-extension
git checkout v0.5.1
git checkout v0.5.1
```
```
##### 1.2 Apply patch(es) #####
##### 1.2 Build the Dynamo Custom EPP #####
###### 1.2.1 Clone the official EPP repo ######
```bash
# Clone the official GAIE repo in a separate folder
export EPP_IMAGE=<the-epp-image-you-built> # i.e. docker.io/lambda108/epp-inference-extension-dynamo:v0.5.1-1
# Export the image tag you have used when building the EPP i.e. docker.io/lambda108/epp-inference-extension-dynamo:v0.5.1-2
export EPP_IMAGE=<the-epp-image-you-built>
```
**Configuration**
You can configure the plugin by setting environment vars in your [values-epp-aware.yaml].
- Overwrite the `DYNAMO_NAMESPACE` env var if needed to match your model's dynamo namespace.
- Set `DYNAMO_BUSY_THRESHOLD` to configure the upper bound on how “full” a worker can be (often derived from kv_active_blocks or other load metrics) before the router skips it. If the selected worker exceeds this value, routing falls back to the next best candidate. By default the value is negative meaning this is not enabled.
- Set `DYNAMO_ROUTER_REPLICA_SYNC=true` to enable a background watcher to keep multiple router instances in sync (important if you run more than one KV router per component).
- By default the Dynamo plugin uses KV routing. You can expose `DYNAMO_USE_KV_ROUTING=false` in your [values-epp-aware.yaml] if you prefer to route in the round-robin fashion.
- If using kv-routing:
- Overwrite the `DYNAMO_KV_BLOCK_SIZE` in your [values-epp-aware.yaml](./values-epp-aware.yaml) to match your model's block size.The `DYNAMO_KV_BLOCK_SIZE` env var is ***MANDATORY*** to prevent silent KV routing failures.
- Set `DYNAMO_OVERLAP_SCORE_WEIGHT` to weigh how heavily the score uses token overlap (predicted KV cache hits) versus other factors (load, historical hit rate). Higher weight biases toward reusing workers with similar cached prefixes.
- Set `DYNAMO_ROUTER_TEMPERATURE` to soften or sharpen the selection curve when combining scores. Low temperature makes the router pick the top candidate deterministically; higher temperature lets lower-scoring workers through more often (exploration).
- Set `DYNAMO_USE_KV_EVENTS=false` if you want to disable KV event tracking while using kv-routing
- See the [KV cache routing design](../../docs/architecture/kv_cache_routing.md) for details.
- values-epp-aware.yaml sets eppAware.dynamoNamespace=vllm-agg for the bundled example. Point it at your actual Dynamo namespace by editing that file or adding --set eppAware.dynamoNamespace=<namespace> (and likewise for dynamoComponent, dynamoKvBlockSize if they differ).