feat: Streamline GAIE deployment (blackbox is still available with a simple flag) (#3591)

Signed-off-by: Anna Tchernych <atchernych@nvidia.com>

feat: Streamline GAIE deployment (blackbox is still available with a simple flag) (#3591)
Signed-off-by: Anna Tchernych <atchernych@nvidia.com>
15a01f75 · atchernych · GitHub · 8388e162 · 15a01f75 · 15a01f75
Unverified Commit 15a01f75 authored Oct 15, 2025 by atchernych Committed by GitHub Oct 15, 2025
4 changed files
--- a/deploy/inference-gateway/README.md
+++ b/deploy/inference-gateway/README.md
 ## Inference Gateway Setup with Dynamo

-This guide demonstrates two setups.
+When integrating Dynamo with the Inference Gateway you could either use the default EPP image provided by the extension or use the custom Dynamo image.

- The basic setup treats each Dynamo deployment as a black box and routes traffic randomly among the deployments.
- The EPP-aware setup uses a custom Dynamo plugin `dyn-kv` to pick the best worker.
+1. When using the Dynamo custom EPP image you will take advantage of the Dynamo router when EPP chooses the best worker to route the request to. This setup uses a custom Dynamo plugin `dyn-kv` to pick the best worker. In this case the Dynamo routing logic is moved upstream. We recommend this approach.
+
+2. When using the GAIE-provided image for the EPP, the Dynamo deployment is treated as a black box and the EPP would route round-robin. In this case GAIE just fans out the traffic, and the smarts only remain within the Dynamo graph. Use this if you have one Dynamo graph and do not want to obtain the Dynamo EPP image. This is a "backup" approach.
+
+The setup provided here uses the Dynamo custom EPP by default. Set `epp.useDynamo=false` in your deployment to pick the approach 2.

 EPP’s default kv-routing approach is token-aware only `by approximation` because the prompt is tokenized with a generic tokenizer unaware of the model deployed. But the Dynamo plugin uses a token-aware KV algorithm. It employs the dynamo router which implements kv routing by running your model’s tokenizer inline. The EPP plugin configuration lives in [`helm/dynamo-gaie/epp-config-dynamo.yaml`](helm/dynamo-gaie/epp-config-dynamo.yaml) per EPP [convention](https://gateway-api-inference-extension.sigs.k8s.io/guides/epp-configuration/config-text/).

@@ -26,13 +29,13 @@ Currently, these setups are only supported with the kGateway based Inference Gat

 [See Quickstart Guide](../../docs/kubernetes/README.md) to install Dynamo Cloud.

-
 ### 2. Deploy Inference Gateway ###

 First, deploy an inference gateway service. In this example, we'll install `kgateway` based gateway implementation.
 You can use the script below or follow the steps manually.

 Script:
+
 ```bash
 ./install_gaie_crd_kgateway.sh
 ```
@@ -40,18 +43,21 @@ Script:
 Manual steps:

 a. Deploy the Gateway API CRDs:
+
 ```bash
 GATEWAY_API_VERSION=v1.3.0
 kubectl apply -f https://github.com/kubernetes-sigs/gateway-api/releases/download/$GATEWAY_API_VERSION/standard-install.yaml
 ```

 b. Install the Inference Extension CRDs (Inference Model and Inference Pool CRDs)
+
 ```bash
 INFERENCE_EXTENSION_VERSION=v0.5.1
 kubectl apply -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/releases/download/$INFERENCE_EXTENSION_VERSION/manifests.yaml -n  my-model
 ```

 c. Install `kgateway` CRDs and kgateway.
+
 ```bash
 KGATEWAY_VERSION=v2.0.3

@@ -63,6 +69,7 @@ helm upgrade -i --namespace kgateway-system --version $KGATEWAY_VERSION kgateway
 ```

 d. Deploy the Gateway Instance
+
 ```bash
 kubectl create namespace my-model
 kubectl apply -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/main/config/manifests/gateway/kgateway/gateway.yaml -n  my-model
@@ -81,14 +88,16 @@ kubectl get gateway inference-gateway -n my-model
 Follow the steps in [model deployment](../../components/backends/vllm/deploy/README.md) to deploy `Qwen/Qwen3-0.6B` model in aggregate mode using [agg.yaml](../../components/backends/vllm/deploy/agg.yaml) in `my-model` kubernetes namespace.

 Sample commands to deploy model:
+
 ```bash
 cd <dynamo-source-root>/components/backends/vllm/deploy
 kubectl apply -f agg.yaml -n my-model
 ```
-Take a note of or change the DYNAMO_IMAGE in the model deployment file.

+Take a note of or change the DYNAMO_IMAGE in the model deployment file.

 Do not forget docker registry secret if needed.
+
 ```bash
 kubectl create secret docker-registry docker-imagepullsecret \
  --docker-server=$DOCKER_SERVER \
@@ -97,7 +106,8 @@ kubectl create secret docker-registry docker-imagepullsecret \
  --namespace=$NAMESPACE
 ```

-Do not forget to include the the HuggingFace token if required.
+Do not forget to include the HuggingFace token if required.
+
 ```bash
 export HF_TOKEN=your_hf_token
 kubectl create secret generic hf-token-secret \
@@ -105,7 +115,7 @@ kubectl create secret generic hf-token-secret \
  -n ${NAMESPACE}
 ```

-Create a model configuration file similar to the vllm_agg_qwen.yaml for you model.
+Create a model configuration file similar to the vllm_agg_qwen.yaml for your model.
 This file demonstrates the values needed for the Vllm Agg setup in [agg.yaml](../../components/backends/vllm/deploy/agg.yaml)
 Take a note of the model's block size provided in the model card.

@@ -113,18 +123,46 @@ Take a note of the model's block size provided in the model card.

 The Inference Gateway is configured through the `inference-gateway-resources.yaml` file.

-Deploy the Inference Gateway resources to your Kubernetes cluster by running one of the commands below.
+Deploy the Inference Gateway resources to your Kubernetes cluster by running the command below.

-#### Basic Black Box Integration ####
+```bash
+cd deploy/inference-gateway

-The basic black box integration uses a standard EPP image`us-central1-docker.pkg.dev/k8s-staging-images/gateway-api-inference-extension/epp:v0.4.0`. For the basic black box integration run:
+# Export the Dynamo image you have used when deploying your model in Step 3.
+export DYNAMO_IMAGE=<the-dynamo-image-you-have-used-when-deploying-the-model>
+# Export the image tag provided by Dynamo (nvcr.io/nvstaging/ai-dynamo/epp-inference-extension-dynamo:v0.6.0-1) or you can build the Dynamo EPP image by following the commands later in this README.
+export EPP_IMAGE=<the-epp-image-you-built>
+```

 ```bash
-cd deploy/inference-gateway
-helm install dynamo-gaie ./helm/dynamo-gaie -n my-model -f ./vllm_agg_qwen.yaml
+helm upgrade --install dynamo-gaie ./helm/dynamo-gaie -n my-model -f ./vllm_agg_qwen.yaml --set-string extension.image=$EPP_IMAGE
+# do not include --set-string extension.image=$EPP_IMAGE to use the default images
 ```

-#### EPP-aware Integration with the custom Dynamo Plugin ####
+Key configurations include:
+
+- An InferenceModel resource for the Qwen model
+- A service for the inference gateway
+- Required RBAC roles and bindings
+- RBAC permissions
+- values-dynamo-epp.yaml sets epp.dynamo.namespace=vllm-agg for the bundled example. Point it at your actual Dynamo namespace by editing that file or adding --set epp.dynamo.namespace=<namespace> (and likewise for epp.dynamo.component, epp.dynamo.kvBlockSize if they differ).
+
+
+**Configuration**
+You can configure the plugin by setting environment vars in your [values-dynamo-epp.yaml].
+
+- Overwrite the `DYN_NAMESPACE` env var if needed to match your model's dynamo namespace.
+- Set `DYNAMO_BUSY_THRESHOLD` to configure the upper bound on how “full” a worker can be (often derived from kv_active_blocks or other load metrics) before the router skips it. If the selected worker exceeds this value, routing falls back to the next best candidate. By default the value is negative meaning this is not enabled.
+- Set `DYNAMO_ROUTER_REPLICA_SYNC=true` to enable a background watcher to keep multiple router instances in sync (important if you run more than one KV router per component).
+- By default the Dynamo plugin uses KV routing. You can expose `DYNAMO_USE_KV_ROUTING=false`  in your [values-dynamo-epp.yaml] if you prefer to route in the round-robin fashion.
+- If using kv-routing:
+  - Overwrite the `DYNAMO_KV_BLOCK_SIZE` in your [values-dynamo-epp.yaml](./values-dynamo-epp.yaml) to match your model's block size.The `DYNAMO_KV_BLOCK_SIZE` env var is ***MANDATORY*** to prevent silent KV routing failures.
+  - Set `DYNAMO_OVERLAP_SCORE_WEIGHT` to weigh how heavily the score uses token overlap (predicted KV cache hits) versus other factors (load, historical hit rate). Higher weight biases toward reusing workers with similar cached prefixes.
+  - Set `DYNAMO_ROUTER_TEMPERATURE` to soften or sharpen the selection curve when combining scores. Low temperature makes the router pick the top candidate deterministically; higher temperature lets lower-scoring workers through more often (exploration).
+  - Set `DYNAMO_USE_KV_EVENTS=false` if you want to disable KV event tracking while using kv-routing
+  - See the [KV cache routing design](../../docs/architecture/kv_cache_routing.md) for details.
+
+

 Dynamo provides a custom routing plugin `pkg/epp/scheduling/plugins/dynamo_kv_scorer/plugin.go` to perform efficient kv routing.
 The Dynamo router is built as a static library, the EPP router will call to provide fast inference.
@@ -132,7 +170,7 @@ You can either use the image `nvcr.io/nvstaging/ai-dynamo/epp-inference-extensio

 ##### 1. Build the custom EPP image #####

-If you choose to build your own image use the steps below. Proceed to step 2 otherwise to deploy with Helm.
+If you choose to build your own image use the steps below.

 ##### 1.1 Clone the official GAIE repo in a separate folder #####

@@ -144,8 +182,6 @@ git checkout v0.5.1

 ##### 1.2 Build the Dynamo Custom EPP #####

-
-
 ###### 1.2.1 Clone the official EPP repo ######

 ```bash
@@ -178,48 +214,18 @@ docker tag <your-new-id> <your-image-tag>
 docker push  <your-image-tag>
 ```

-##### 2. Deploy through helm #####
-
-```bash
-cd deploy/inference-gateway
-
-# Export the Dynamo image you have used when deploying your model in Step 3.
-export DYNAMO_IMAGE=<the-dynamo-image-you-have-used-when-deploying-the-model>
-# Export the image tag you have used when building the EPP i.e. docker.io/lambda108/epp-inference-extension-dynamo:v0.5.1-2
-export EPP_IMAGE=<the-epp-image-you-built>
-```
-
-**Configuration**
-You can configure the plugin by setting environment vars in your [values-epp-aware.yaml].
- Overwrite the `DYN_NAMESPACE` env var if needed to match your model's dynamo namespace.
- Set `DYNAMO_BUSY_THRESHOLD` to configure the upper bound on how “full” a worker can be (often derived from kv_active_blocks or other load metrics) before the router skips it. If the selected worker exceeds this value, routing falls back to the next best candidate. By default the value is negative meaning this is not enabled.
- Set `DYNAMO_ROUTER_REPLICA_SYNC=true` to enable a background watcher to keep multiple router instances in sync (important if you run more than one KV router per component).
- By default the Dynamo plugin uses KV routing. You can expose `DYNAMO_USE_KV_ROUTING=false`  in your [values-epp-aware.yaml] if you prefer to route in the round-robin fashion.
- If using kv-routing:
-  - Overwrite the `DYNAMO_KV_BLOCK_SIZE` in your [values-epp-aware.yaml](./values-epp-aware.yaml) to match your model's block size.The `DYNAMO_KV_BLOCK_SIZE` env var is ***MANDATORY*** to prevent silent KV routing failures.
-  - Set `DYNAMO_OVERLAP_SCORE_WEIGHT` to weigh how heavily the score uses token overlap (predicted KV cache hits) versus other factors (load, historical hit rate). Higher weight biases toward reusing workers with similar cached prefixes.
-  - Set `DYNAMO_ROUTER_TEMPERATURE` to soften or sharpen the selection curve when combining scores. Low temperature makes the router pick the top candidate deterministically; higher temperature lets lower-scoring workers through more often (exploration).
-  - Set `DYNAMO_USE_KV_EVENTS=false` if you want to disable KV event tracking while using kv-routing
-  - See the [KV cache routing design](../../docs/architecture/kv_cache_routing.md) for details.

+**Note**
+You can also use the standard EPP image`us-central1-docker.pkg.dev/k8s-staging-images/gateway-api-inference-extension/epp:v0.4.0`. For the basic black box integration run:

 ```bash
-helm upgrade --install dynamo-gaie ./helm/dynamo-gaie \
-  -n my-model \
-  -f ./vllm_agg_qwen.yaml \
-  -f ./values-epp-aware.yaml \
-  --set eppAware.enabled=true \
-  --set-string eppAware.eppImage=$EPP_IMAGE
+cd deploy/inference-gateway
+# Optionally export the standard EPP image if you do not want to use the default we suggest.
+export EPP_IMAGE=us-central1-docker.pkg.dev/k8s-artifacts-prod/images/gateway-api-inference-extension/epp:v0.4.0
+helm upgrade --install dynamo-gaie ./helm/dynamo-gaie -n my-model -f ./vllm_agg_qwen.yaml --set epp.useDynamo=false
+# Optionally overwrite the image --set-string extension.image=$EPP_IMAGE
 ```

-
-Key configurations include:
- An InferenceModel resource for the Qwen model
- A service for the inference gateway
- Required RBAC roles and bindings
- RBAC permissions
- values-epp-aware.yaml sets eppAware.dynamoNamespace=vllm-agg for the bundled example. Point it at your actual Dynamo namespace by editing that file or adding --set eppAware.dynamoNamespace=<namespace> (and likewise for dynamoComponent, dynamoKvBlockSize if they differ).
-
 ### 5. Verify Installation ###

 Check that all resources are properly deployed:
@@ -248,12 +254,12 @@ NAME        HOSTNAMES   AGE
 qwen-route               33m
 ```

-
 ### 6. Usage ###

 The Inference Gateway provides HTTP endpoints for model inference.

 #### 1: Populate gateway URL for your k8s cluster ####
+
 ```bash
 export GATEWAY_URL=<Gateway-URL>
 ```
@@ -261,6 +267,7 @@ export GATEWAY_URL=<Gateway-URL>
 To test the gateway in minikube, use the following command:
 a. User minikube tunnel to expose the gateway to the host
   This requires `sudo` access to the host machine. alternatively, you can use port-forward to expose the gateway to the host as shown in alternative (b).
+
 ```bash
 # in first terminal
 ps aux | grep "minikube tunnel" | grep -v grep # make sure minikube tunnel is not already running.
@@ -272,6 +279,7 @@ echo $GATEWAY_URL
 ```

 b. use port-forward to expose the gateway to the host
+
 ```bash
 # in first terminal
 kubectl port-forward svc/inference-gateway 8000:80 -n my-model
@@ -282,14 +290,16 @@ GATEWAY_URL=http://localhost:8000

 #### 2: Check models deployed to inference gateway ####

-
 a. Query models:
+
 ```bash
 # in the second terminal where you GATEWAY_URL is set

 curl $GATEWAY_URL/v1/models | jq .
 ```
+
 Sample output:
+
 ```json
 {
  "data": [
@@ -360,6 +370,7 @@ Sample inference output:
 ```

 ### 7. Deleting the installation ###
+
 If you need to uninstall run:

 ```bash

--- a/deploy/inference-gateway/helm/dynamo-gaie/templates/dynamo-epp.yaml
+++ b/deploy/inference-gateway/helm/dynamo-gaie/templates/dynamo-epp.yaml
@@ -12,6 +12,21 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
+{{- /* ------------ file-scope vars (no output) ------------ */ -}}
+{{- $platformNs   := default .Release.Namespace .Values.platformNamespace -}}
+{{- $platformName := default "dynamo-platform" .Values.platformReleaseName -}}
+{{- $useDynamo    := default false .Values.epp.useDynamo -}}
+{{- $dynNsAll     := default .Values.dynamoNamespace .Values.epp.dynamo.namespace -}}
+{{- $ns           := ternary (required "set epp.dynamo.namespace (or top-level dynamoNamespace) when epp.useDynamo=true" $dynNsAll) "" $useDynamo -}}
+{{- $comp         := default "backend" .Values.epp.dynamo.component -}}
+{{- $kv           := default "16" .Values.epp.dynamo.kvBlockSize -}}
+{{- $std          := .Values.extension.standardImage -}}
+{{- $dyn          := .Values.extension.dynamoImage -}}
+{{- $fallback     := ternary $dyn $std .Values.epp.useDynamo -}}
+{{- $eppImage     := default $fallback .Values.extension.image -}}
+
+
+---  # <-- start of actual YAML document
 apiVersion: apps/v1
 kind: Deployment
 metadata:
@@ -29,7 +44,6 @@ spec:
      labels:
        app: {{ .Values.model.shortName }}-epp
    spec:
-      # Conservatively, this timeout should mirror the longest grace period of the pods within the pool
      terminationGracePeriodSeconds: 130

      {{- if .Values.imagePullSecrets }}
@@ -41,7 +55,7 @@ spec:

      containers:
      - name: epp
-        image: "{{ if .Values.eppAware.enabled }}{{ default .Values.extension.image .Values.eppAware.eppImage }}{{ else }}{{ .Values.extension.image }}{{ end }}"
+        image: "{{ $eppImage }}"
        imagePullPolicy: {{ default "IfNotPresent" .Values.epp.imagePullPolicy }}
        args:
        {{- if .Values.epp.argsOverride }}
@@ -59,19 +73,13 @@ spec:
          - "9002"
          - -grpcHealthPort
          - "9003"
-        {{- if .Values.eppAware.enabled }}
+          {{- if $useDynamo }}
          - -configFile
-          - "/etc/epp/epp-config-dynamo.yaml"
-        {{- end }}
+          - "{{ .Values.epp.configFile }}"
+          {{- end }}
        {{- end }}

-        {{- $platformNs := default .Release.Namespace .Values.platformNamespace -}}
-        {{- $platformName := default "dynamo-platform" .Values.platformReleaseName -}}
-        {{- $ns := required "set eppAware.dynamoNamespace via values" .Values.eppAware.dynamoNamespace -}}
-        {{- $comp := default "backend" .Values.eppAware.dynamoComponent -}}
-        {{- $kv := default "16" .Values.eppAware.dynamoKvBlockSize -}}
-
-        {{- if .Values.eppAware.enabled }}
+        {{- if $useDynamo }}
        volumeMounts:
          - name: epp-config
            mountPath: /etc/epp
@@ -79,7 +87,7 @@ spec:
        {{- end }}

        env:
-        {{- if .Values.eppAware.enabled }}
+        {{- if $useDynamo }}
          - name: ETCD_ENDPOINTS
            value: "{{ $platformName }}-etcd.{{ $platformNs }}:2379"
          - name: NATS_SERVER
@@ -90,6 +98,8 @@ spec:
            value: "{{ $comp }}"
          - name: DYNAMO_KV_BLOCK_SIZE
            value: "{{ $kv }}"
+          - name: USE_STREAMING
+            value: "true"
        {{- end }}
        {{- range .Values.epp.extraEnv }}
          - name: {{ .name }}
@@ -114,7 +124,7 @@ spec:
          initialDelaySeconds: 5
          periodSeconds: 10

-      {{- if .Values.eppAware.enabled }}
+      {{- if $useDynamo }}
      volumes:
        - name: epp-config
          configMap:
@@ -122,4 +132,4 @@ spec:
            items:
              - key: epp-config-dynamo.yaml
                path: epp-config-dynamo.yaml
-      {{- end }}
\ No newline at end of file
+      {{- end }}
--- a/deploy/inference-gateway/helm/dynamo-gaie/values.yaml
+++ b/deploy/inference-gateway/helm/dynamo-gaie/values.yaml
@@ -49,24 +49,33 @@ httpRoute:
    request: "300s"

 extension:
-  # default (non-epp-aware) EPP image for the GAIE extension
-  image: us-central1-docker.pkg.dev/k8s-artifacts-prod/images/gateway-api-inference-extension/epp:v0.4.0
+  # EPP image for the GAIE extension (Dynamo EPP image by default)
+  image: "" # leave empty to use defaults below
+  standardImage: us-central1-docker.pkg.dev/k8s-artifacts-prod/images/gateway-api-inference-extension/epp:v0.4.0
+  dynamoImage: nvcr.io/nvstaging/ai-dynamo/gaie-epp-dynamo:v0.6.0-1

 # generic knobs you may want in both modes
-imagePullSecrets: []     # e.g. ["docker-imagepullsecret"]
+imagePullSecrets:
+  - docker-imagepullsecret
+
 epp:
  imagePullPolicy: IfNotPresent
  # Add env in name/value pairs
-  extraEnv: []           # e.g. [{name: USE_STREAMING, value: "true"}]
+  extraEnv: []
  # If you ever want to completely override args, supply a list here.
  # When empty, chart will render sane defaults
  argsOverride: []

-# epp-aware mode toggle + specific settings
-eppAware:
-  enabled: false
-  # Optional: override EPP image when epp-aware=true
-  eppImage: nvcr.io/nvstaging/ai-dynamo/gaie-epp-dynamo:v0.6.0-1
-  dynamoNamespace: ""
-  dynamoComponent: ""
-  dynamoKvBlockSize: ""
+  # Dynamo routing mode - set to true to enable KV-aware routing via Dynamo EPP image
+  useDynamo: true
+
+  # Dynamo-specific settings (only used when useDynamo: true)
+  configFile: "/etc/epp/epp-config-dynamo.yaml"
+  dynamo:
+    namespace: "vllm-agg" # Required when useDynamo: true.
+    component: "backend"
+    kvBlockSize: "16"
+
+# Platform configuration (for Dynamo mode)
+platformReleaseName: dynamo-platform
+platformNamespace: "my-model"
--- a/deploy/inference-gateway/values-epp-aware.yaml
+++ b/deploy/inference-gateway/values-epp-aware.yaml
-# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-eppAware:
-  enabled: true
-  eppImage: nvcr.io/nvstaging/ai-dynamo/gaie-epp-dynamo:v0.6.0-1
-  dynamoNamespace: vllm-agg
-  dynamoComponent: backend
-  dynamoKvBlockSize: "16"
-
-imagePullSecrets:
-  - docker-imagepullsecret
-
-platformReleaseName: dynamo-platform
-platformNamespace: "my-model"
-
-epp:
-  extraEnv:
-    - name: USE_STREAMING
-      value: "true"