Unverified Commit 490cdc18 authored by atchernych's avatar atchernych Committed by GitHub
Browse files

feat: Deployment for Dynamo EPP - aware gateway (#2633)


Signed-off-by: default avataratchernych <atchernych@nvidia.com>
Co-authored-by: default avatarhhzhang16 <54051230+hhzhang16@users.noreply.github.com>
parent ac9665c2
## Inference Gateway Setup with Dynamo ## Inference Gateway Setup with Dynamo
This guide demonstrates two setups. This guide demonstrates two setups.
The EPP-unaware setup treats each Dynamo deployment as a black box and routes traffic randomly among the deployments.
The EPP-aware setup first uses Dynamo Router to pick the worker instance id for serving the model. Then traffic gets directed straight to the selected worker. - The basic setup treats each Dynamo deployment as a black box and routes traffic randomly among the deployments.
- The EPP-aware setup uses a custom Dynamo plugin `dyn-kv` to pick the best worker.
EPP’s default approach is token-aware only `by approximation` because it relies on the non-tokenized text in the prompt. But the Dynamo plugin uses a token-aware KV algorithm. It employs the dynamo router which implements kv routing by running your model’s tokenizer inline. The EPP plugin configuration lives in [`helm/dynamo-gaie/epp-config-dynamo.yaml`](helm/dynamo-gaie/epp-config-dynamo.yaml) per EPP [convention](https://gateway-api-inference-extension.sigs.k8s.io/guides/epp-configuration/config-text/).
Currently, these setups are only supported with the kGateway based Inference Gateway. Currently, these setups are only supported with the kGateway based Inference Gateway.
## Table of Contents ## Table of Contents
...@@ -18,12 +22,12 @@ Currently, these setups are only supported with the kGateway based Inference Gat ...@@ -18,12 +22,12 @@ Currently, these setups are only supported with the kGateway based Inference Gat
## Installation Steps ## Installation Steps
1. **Install Dynamo Platform** ### 1. Install Dynamo Platform ###
[See Quickstart Guide](../../docs/guides/dynamo_deploy/README.md) to install Dynamo Cloud. [See Quickstart Guide](../../docs/guides/dynamo_deploy/README.md) to install Dynamo Cloud.
2. **Deploy Inference Gateway** ### 2. Deploy Inference Gateway ###
First, deploy an inference gateway service. In this example, we'll install `kgateway` based gateway implementation. First, deploy an inference gateway service. In this example, we'll install `kgateway` based gateway implementation.
You can use the script below or follow the steps manually. You can use the script below or follow the steps manually.
...@@ -72,7 +76,7 @@ kubectl get gateway inference-gateway -n my-model ...@@ -72,7 +76,7 @@ kubectl get gateway inference-gateway -n my-model
# inference-gateway kgateway x.x.x.x True 1m # inference-gateway kgateway x.x.x.x True 1m
``` ```
3. **Deploy model** ### 3. Deploy Your Model ###
Follow the steps in [model deployment](../../components/backends/vllm/deploy/README.md) to deploy `Qwen/Qwen3-0.6B` model in aggregate mode using [agg.yaml](../../components/backends/vllm/deploy/agg.yaml) in `my-model` kubernetes namespace. Follow the steps in [model deployment](../../components/backends/vllm/deploy/README.md) to deploy `Qwen/Qwen3-0.6B` model in aggregate mode using [agg.yaml](../../components/backends/vllm/deploy/agg.yaml) in `my-model` kubernetes namespace.
...@@ -81,51 +85,85 @@ Sample commands to deploy model: ...@@ -81,51 +85,85 @@ Sample commands to deploy model:
cd <dynamo-source-root>/components/backends/vllm/deploy cd <dynamo-source-root>/components/backends/vllm/deploy
kubectl apply -f agg.yaml -n my-model kubectl apply -f agg.yaml -n my-model
``` ```
Take a note of or change the DYNAMO_IMAGE in the model deployment file.
4. **Install Dynamo GAIE helm chart** ### 4. Install Dynamo GAIE helm chart ###
The Inference Gateway is configured through the `inference-gateway-resources.yaml` file. The Inference Gateway is configured through the `inference-gateway-resources.yaml` file.
Deploy the Inference Gateway resources to your Kubernetes cluster by running one of the commands below. Deploy the Inference Gateway resources to your Kubernetes cluster by running one of the commands below.
For the EPP-unaware black box integration run: #### Basic Black Box Integration ####
For the basic black box integration run:
```bash ```bash
cd deploy/inference-gateway cd deploy/inference-gateway
helm install dynamo-gaie ./helm/dynamo-gaie -n my-model -f ./vllm_agg_qwen.yaml helm install dynamo-gaie ./helm/dynamo-gaie -n my-model -f ./vllm_agg_qwen.yaml
``` ```
For the EPP-aware integration run: #### EPP-aware Integration with the custom Dynamo Plugin ####
##### 1. Build the custom EPP image #####
We provide git patches for you to use.
##### 1.1 Clone the official GAIE repo in a separate folder #####
```bash ```bash
cd deploy/inference-gateway git clone https://github.com/kubernetes-sigs/gateway-api-inference-extension.git
cd gateway-api-inference-extension
git checkout v0.5.1
```
helm install dynamo-gaie ./helm/dynamo-gaie \ ##### 1.2 Apply patch(es) #####
-n my-model \
-f ./vllm_agg_qwen.yaml \ ```bash
-f ./values-epp-aware.yaml git apply <dynamo-folder>/deploy/inference-gateway/epp-patches/v0.5.1-1/epp-v0.5.1-dyn1.patch
```
##### 1.3 Build the custom EPP image #####
```bash
# Build the image <your-docker-registry/dynamo-custom-epp:<your-tag> and then manually push
make image-local-load \
IMAGE_REGISTRY=<your-docker-registry> \
IMAGE_NAME=dynamo-custom-epp \
EXTRA_TAG=<your-tag>
# Or run the command below to build push to your registry
make image-local-push \
IMAGE_REGISTRY=<your-docker-registry> \
IMAGE_NAME=dynamo-custom-epp \
EXTRA_TAG=<your-tag>
``` ```
Or customize the EPP further using flags, i.e: ##### 2. Install through helm #####
```bash ```bash
helm install dynamo-gaie ./helm/dynamo-gaie \ cd deploy/inference-gateway
# Export the Dynamo image you have used when deploying your model in Step 3.
export DYNAMO_IMAGE=<the-dynamo-image-you-have-used-when-deploying-the-model>
export EPP_IMAGE=<the-epp-image-you-built> # i.e. docker.io/lambda108/epp-inference-extension-dynamo:v0.5.1-1
helm upgrade --install dynamo-gaie ./helm/dynamo-gaie \
-n my-model \ -n my-model \
-f ./vllm_agg_qwen.yaml \ -f ./vllm_agg_qwen.yaml \
-f ./values-epp-aware.yaml \
--set eppAware.enabled=true \ --set eppAware.enabled=true \
--set eppAware.eppImage=docker.io/lambda108/epp-inference-extension-dynamo:1.0.0 \ --set-string eppAware.eppImage=$EPP_IMAGE \
--set imagePullSecrets='{docker-imagepullsecret}' \ --set-string eppAware.sidecar.image=$DYNAMO_IMAGE
--set-string epp.extraEnv[0].name=USE_STREAMING \
--set-string epp.extraEnv[0].value=true
``` ```
Key configurations include: Key configurations include:
- An InferenceModel resource for the Qwen model - An InferenceModel resource for the Qwen model
- A service for the inference gateway - A service for the inference gateway
- Required RBAC roles and bindings - Required RBAC roles and bindings
- RBAC permissions - RBAC permissions
5. **Verify Installation** ### 5. Verify Installation ###
Check that all resources are properly deployed: Check that all resources are properly deployed:
...@@ -153,11 +191,11 @@ NAME HOSTNAMES AGE ...@@ -153,11 +191,11 @@ NAME HOSTNAMES AGE
qwen-route 33m qwen-route 33m
``` ```
## Usage ### 6. Usage ###
The Inference Gateway provides HTTP endpoints for model inference. The Inference Gateway provides HTTP endpoints for model inference.
### 1: Populate gateway URL for your k8s cluster #### 1: Populate gateway URL for your k8s cluster ####
```bash ```bash
export GATEWAY_URL=<Gateway-URL> export GATEWAY_URL=<Gateway-URL>
``` ```
...@@ -183,7 +221,7 @@ kubectl port-forward svc/inference-gateway 8000:80 -n my-model ...@@ -183,7 +221,7 @@ kubectl port-forward svc/inference-gateway 8000:80 -n my-model
GATEWAY_URL=http://localhost:8000 GATEWAY_URL=http://localhost:8000
``` ```
### 2: Check models deployed to inference gateway #### 2: Check models deployed to inference gateway ####
a. Query models: a. Query models:
......
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
apiVersion: inference.networking.x-k8s.io/v1alpha1
kind: EndpointPickerConfig
plugins:
# Required: tells EPP which profile to use (even if you only have one)
- type: single-profile-handler
# Picker: chooses the final endpoint after scoring
- name: picker
type: max-score-picker
- name: dyn-pre
type: dynamo-inject-workerid
parameters: {}
- name: dyn-kv
type: kv-aware-scorer
parameters:
frontendURL: http://127.0.0.1:8000/v1/chat/completions
timeoutMS: 10000
schedulingProfiles:
- name: default
plugins:
- pluginRef: dyn-kv
weight: 1
- pluginRef: picker
...@@ -41,11 +41,7 @@ spec: ...@@ -41,11 +41,7 @@ spec:
containers: containers:
- name: epp - name: epp
image: {{ if .Values.eppAware.enabled }} image: {{ if .Values.eppAware.enabled }}{{ default .Values.extension.image .Values.eppAware.eppImage }}{{ else }}{{ .Values.extension.image }}{{ end }}
{{ default .Values.extension.image .Values.eppAware.eppImage }}
{{ else }}
{{ .Values.extension.image }}
{{ end }}
imagePullPolicy: {{ .Values.epp.imagePullPolicy | default "IfNotPresent" }} imagePullPolicy: {{ .Values.epp.imagePullPolicy | default "IfNotPresent" }}
args: args:
{{- if .Values.epp.argsOverride }} {{- if .Values.epp.argsOverride }}
...@@ -63,6 +59,14 @@ spec: ...@@ -63,6 +59,14 @@ spec:
- "9002" - "9002"
- -grpcHealthPort - -grpcHealthPort
- "9003" - "9003"
- -configFile
- "/etc/epp/epp-config-dynamo.yaml"
{{- end }}
{{- if .Values.eppAware.enabled }}
volumeMounts:
- name: epp-config
mountPath: /etc/epp
readOnly: true
{{- end }} {{- end }}
env: env:
{{- range .Values.epp.extraEnv }} {{- range .Values.epp.extraEnv }}
...@@ -107,4 +111,13 @@ spec: ...@@ -107,4 +111,13 @@ spec:
{{- toYaml .Values.eppAware.sidecar.ports | nindent 8 }} {{- toYaml .Values.eppAware.sidecar.ports | nindent 8 }}
resources: resources:
{{- toYaml .Values.eppAware.sidecar.resources | nindent 10 }} {{- toYaml .Values.eppAware.sidecar.resources | nindent 10 }}
{{- end }}
{{- if .Values.eppAware.enabled }}
volumes:
- name: epp-config
configMap:
name: {{ include "dynamo-gaie.fullname" . }}-epp-config
items:
- key: epp-config-dynamo.yaml
path: epp-config-dynamo.yaml
{{- end }} {{- end }}
\ No newline at end of file
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
apiVersion: v1
kind: ConfigMap
metadata:
name: {{ include "dynamo-gaie.fullname" . }}-epp-config
labels:
app.kubernetes.io/name: {{ include "dynamo-gaie.name" . }}
app.kubernetes.io/instance: {{ .Release.Name }}
data:
epp-config-dynamo.yaml: |
{{ (.Files.Get "epp-config-dynamo.yaml") | indent 4 }}
...@@ -66,7 +66,7 @@ epp: ...@@ -66,7 +66,7 @@ epp:
eppAware: eppAware:
enabled: false enabled: false
# Optional: override EPP image when epp-aware=true # Optional: override EPP image when epp-aware=true
eppImage: docker.io/lambda108/epp-inference-extension-dynamo:1.0.0 eppImage: docker.io/lambda108/epp-inference-extension-dynamo:v0.5.1-1
# Sidecar (frontend-router) # Sidecar (frontend-router)
sidecar: sidecar:
......
...@@ -15,7 +15,7 @@ ...@@ -15,7 +15,7 @@
eppAware: eppAware:
enabled: true enabled: true
eppImage: docker.io/lambda108/epp-inference-extension-dynamo:1.0.0 eppImage: docker.io/lambda108/epp-inference-extension-dynamo:v0.5.1-1
imagePullSecrets: imagePullSecrets:
- docker-imagepullsecret - docker-imagepullsecret
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment