feat: Deployment for Dynamo EPP - aware gateway (#2633)

Signed-off-by: atchernych <atchernych@nvidia.com> Co-authored-by: hhzhang16 <54051230+hhzhang16@users.noreply.github.com>

feat: Deployment for Dynamo EPP - aware gateway (#2633)
Signed-off-by: atchernych <atchernych@nvidia.com> Co-authored-by: hhzhang16 <54051230+hhzhang16@users.noreply.github.com>
490cdc18 · atchernych · GitHub · ac9665c2 · 490cdc18 · 490cdc18
Unverified Commit 490cdc18 authored Aug 26, 2025 by atchernych Committed by GitHub Aug 26, 2025
7 changed files
--- a/deploy/inference-gateway/README.md
+++ b/deploy/inference-gateway/README.md
 ## Inference Gateway Setup with Dynamo
 This guide demonstrates two setups.
-The EPP-unaware setup treats each Dynamo deployment as a black box and routes traffic randomly among the deployments.
-The EPP-aware setup first uses Dynamo Router to pick the worker instance id for serving the model. Then traffic gets directed straight to the selected worker.
+- The basic setup treats each Dynamo deployment as a black box and routes traffic randomly among the deployments.
+- The EPP-aware setup uses a custom Dynamo plugin `dyn-kv` to pick the best worker.
+EPP’s default approach is token-aware only `by approximation` because it relies on the non-tokenized text in the prompt. But the Dynamo plugin uses a token-aware KV algorithm. It employs the dynamo router which implements kv routing by running your model’s tokenizer inline. The EPP plugin configuration lives in [`helm/dynamo-gaie/epp-config-dynamo.yaml`](helm/dynamo-gaie/epp-config-dynamo.yaml) per EPP [convention](https://gateway-api-inference-extension.sigs.k8s.io/guides/epp-configuration/config-text/).
 Currently, these setups are only supported with the kGateway based Inference Gateway.
 ## Table of Contents
@@ -18,12 +22,12 @@ Currently, these setups are only supported with the kGateway based Inference Gat
 ## Installation Steps
-1. **Install Dynamo Platform**
+### 1. Install Dynamo Platform ###
 [See Quickstart Guide](../../docs/guides/dynamo_deploy/README.md) to install Dynamo Cloud.
-2. **Deploy Inference Gateway**
+### 2. Deploy Inference Gateway ###
 First, deploy an inference gateway service. In this example, we'll install `kgateway` based gateway implementation.
 You can use the script below or follow the steps manually.
@@ -72,7 +76,7 @@ kubectl get gateway inference-gateway -n my-model
 # inference-gateway   kgateway   x.x.x.x   True         1m
 ```
-3. **Deploy model**
+### 3. Deploy Your Model ###
 Follow the steps in [model deployment](../../components/backends/vllm/deploy/README.md) to deploy `Qwen/Qwen3-0.6B` model in aggregate mode using [agg.yaml](../../components/backends/vllm/deploy/agg.yaml) in `my-model` kubernetes namespace.
@@ -81,51 +85,85 @@ Sample commands to deploy model:
 cd <dynamo-source-root>/components/backends/vllm/deploy
 kubectl apply -f agg.yaml -n my-model
 ```
+Take a note of or change the DYNAMO_IMAGE in the model deployment file.
-4. **Install Dynamo GAIE helm chart**
+### 4. Install Dynamo GAIE helm chart ###
 The Inference Gateway is configured through the `inference-gateway-resources.yaml` file.
 Deploy the Inference Gateway resources to your Kubernetes cluster by running one of the commands below.
-For the EPP-unaware black box integration run:
+#### Basic Black Box Integration ####
+For the basic black box integration run:
 ```bash
 cd deploy/inference-gateway
 helm install dynamo-gaie ./helm/dynamo-gaie -n my-model -f ./vllm_agg_qwen.yaml
 ```
-For the EPP-aware integration run:
+#### EPP-aware Integration with the custom Dynamo Plugin ####
+##### 1. Build the custom EPP image #####
+We provide git patches for you to use.
+##### 1.1 Clone the official GAIE repo in a separate folder #####
 ```bash
-cd deploy/inference-gateway
+git clone https://github.com/kubernetes-sigs/gateway-api-inference-extension.git
+cd gateway-api-inference-extension
+git checkout v0.5.1
+```
-helm install dynamo-gaie ./helm/dynamo-gaie \
+##### 1.2 Apply patch(es) #####
-  -n my-model \
-  -f ./vllm_agg_qwen.yaml \
+```bash
-  -f ./values-epp-aware.yaml
+git apply <dynamo-folder>/deploy/inference-gateway/epp-patches/v0.5.1-1/epp-v0.5.1-dyn1.patch
+```
+##### 1.3 Build the custom EPP image #####
+```bash
+# Build the image <your-docker-registry/dynamo-custom-epp:<your-tag> and then manually push
+make image-local-load \
+  IMAGE_REGISTRY=<your-docker-registry> \
+  IMAGE_NAME=dynamo-custom-epp \
+  EXTRA_TAG=<your-tag>
+# Or run the command below to build push to your registry
+make image-local-push \
+  IMAGE_REGISTRY=<your-docker-registry> \
+  IMAGE_NAME=dynamo-custom-epp \
+  EXTRA_TAG=<your-tag>
 ```
-Or customize the EPP further using flags, i.e:
+##### 2. Install through helm #####
 ```bash
-helm install dynamo-gaie ./helm/dynamo-gaie \
+cd deploy/inference-gateway
+# Export the Dynamo image you have used when deploying your model in Step 3.
+export DYNAMO_IMAGE=<the-dynamo-image-you-have-used-when-deploying-the-model>
+export EPP_IMAGE=<the-epp-image-you-built>  # i.e. docker.io/lambda108/epp-inference-extension-dynamo:v0.5.1-1
+helm upgrade --install dynamo-gaie ./helm/dynamo-gaie \
  -n my-model \
  -f ./vllm_agg_qwen.yaml \
+  -f ./values-epp-aware.yaml \
  --set eppAware.enabled=true \
-  --set eppAware.eppImage=docker.io/lambda108/epp-inference-extension-dynamo:1.0.0 \
+  --set-string eppAware.eppImage=$EPP_IMAGE \
-  --set imagePullSecrets='{docker-imagepullsecret}' \
+  --set-string eppAware.sidecar.image=$DYNAMO_IMAGE
-  --set-string epp.extraEnv[0].name=USE_STREAMING \
-  --set-string epp.extraEnv[0].value=true
 ```
 Key configurations include:
 - An InferenceModel resource for the Qwen model
 - A service for the inference gateway
 - Required RBAC roles and bindings
 - RBAC permissions
-5. **Verify Installation**
+### 5. Verify Installation ###
 Check that all resources are properly deployed:
@@ -153,11 +191,11 @@ NAME        HOSTNAMES   AGE
 qwen-route               33m
 ```
-## Usage
+### 6. Usage ###
 The Inference Gateway provides HTTP endpoints for model inference.
-### 1: Populate gateway URL for your k8s cluster
+#### 1: Populate gateway URL for your k8s cluster ####
 ```bash
 export GATEWAY_URL=<Gateway-URL>
 ```
@@ -183,7 +221,7 @@ kubectl port-forward svc/inference-gateway 8000:80 -n my-model
 GATEWAY_URL=http://localhost:8000
 ```
-### 2: Check models deployed to inference gateway
+#### 2: Check models deployed to inference gateway ####
 a. Query models:

--- a/deploy/inference-gateway/epp-patches/v0.5.1-1/epp-v0.5.1-dyn1.patch
+++ b/deploy/inference-gateway/epp-patches/v0.5.1-1/epp-v0.5.1-dyn1.patch
--- a/deploy/inference-gateway/helm/dynamo-gaie/epp-config-dynamo.yaml
+++ b/deploy/inference-gateway/helm/dynamo-gaie/epp-config-dynamo.yaml
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+apiVersion: inference.networking.x-k8s.io/v1alpha1
+kind: EndpointPickerConfig
+plugins:
+  # Required: tells EPP which profile to use (even if you only have one)
+  - type: single-profile-handler
+  # Picker: chooses the final endpoint after scoring
+  - name: picker
+    type: max-score-picker
+  - name: dyn-pre
+    type: dynamo-inject-workerid
+    parameters: {}
+  - name: dyn-kv
+    type: kv-aware-scorer
+    parameters:
+      frontendURL: http://127.0.0.1:8000/v1/chat/completions
+      timeoutMS: 10000
+schedulingProfiles:
+  - name: default
+    plugins:
+      - pluginRef: dyn-kv
+        weight: 1
+      - pluginRef: picker
--- a/deploy/inference-gateway/helm/dynamo-gaie/templates/dynamo-epp.yaml
+++ b/deploy/inference-gateway/helm/dynamo-gaie/templates/dynamo-epp.yaml
@@ -41,11 +41,7 @@ spec:
      containers:
      - name: epp
-        image: {{ if .Values.eppAware.enabled }}
+        image: {{ if .Values.eppAware.enabled }}{{ default .Values.extension.image .Values.eppAware.eppImage }}{{ else }}{{ .Values.extension.image }}{{ end }}
-          {{ default .Values.extension.image .Values.eppAware.eppImage }}
-        {{ else }}
-          {{ .Values.extension.image }}
-        {{ end }}
        imagePullPolicy: {{ .Values.epp.imagePullPolicy | default "IfNotPresent" }}
        args:
        {{- if .Values.epp.argsOverride }}
@@ -63,6 +59,14 @@ spec:
          - "9002"
          - -grpcHealthPort
          - "9003"
+          - -configFile
+          - "/etc/epp/epp-config-dynamo.yaml"
+        {{- end }}
+        {{- if .Values.eppAware.enabled }}
+        volumeMounts:
+            - name: epp-config
+              mountPath: /etc/epp
+              readOnly: true
        {{- end }}
        env:
        {{- range .Values.epp.extraEnv }}
@@ -107,4 +111,13 @@ spec:
        {{- toYaml .Values.eppAware.sidecar.ports | nindent 8 }}
        resources:
        {{- toYaml .Values.eppAware.sidecar.resources | nindent 10 }}
+      {{- end }}
+      {{- if .Values.eppAware.enabled }}
+      volumes:
+        - name: epp-config
+          configMap:
+            name: {{ include "dynamo-gaie.fullname" . }}-epp-config
+            items:
+              - key: epp-config-dynamo.yaml
+                path: epp-config-dynamo.yaml
      {{- end }}
\ No newline at end of file
--- a/deploy/inference-gateway/helm/dynamo-gaie/templates/epp-configmap.yaml
+++ b/deploy/inference-gateway/helm/dynamo-gaie/templates/epp-configmap.yaml
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+apiVersion: v1
+kind: ConfigMap
+metadata:
+  name: {{ include "dynamo-gaie.fullname" . }}-epp-config
+  labels:
+    app.kubernetes.io/name: {{ include "dynamo-gaie.name" . }}
+    app.kubernetes.io/instance: {{ .Release.Name }}
+data:
+  epp-config-dynamo.yaml: |
+{{ (.Files.Get "epp-config-dynamo.yaml") | indent 4 }}
--- a/deploy/inference-gateway/helm/dynamo-gaie/values.yaml
+++ b/deploy/inference-gateway/helm/dynamo-gaie/values.yaml
@@ -66,7 +66,7 @@ epp:
 eppAware:
  enabled: false
  # Optional: override EPP image when epp-aware=true
-  eppImage: docker.io/lambda108/epp-inference-extension-dynamo:1.0.0
+  eppImage: docker.io/lambda108/epp-inference-extension-dynamo:v0.5.1-1
  # Sidecar (frontend-router)
  sidecar:

--- a/deploy/inference-gateway/values-epp-aware.yaml
+++ b/deploy/inference-gateway/values-epp-aware.yaml
@@ -15,7 +15,7 @@
 eppAware:
  enabled: true
-  eppImage: docker.io/lambda108/epp-inference-extension-dynamo:1.0.0
+  eppImage: docker.io/lambda108/epp-inference-extension-dynamo:v0.5.1-1
 imagePullSecrets:
  - docker-imagepullsecret