Unverified Commit 6d62fc74 authored by atchernych's avatar atchernych Committed by GitHub
Browse files

feat: Deployment for EPP as a static library (#3314)


Signed-off-by: default avatarAnna Tchernych <atchernych@nvidia.com>
parent af7a41c3
...@@ -5,7 +5,7 @@ This guide demonstrates two setups. ...@@ -5,7 +5,7 @@ This guide demonstrates two setups.
- The basic setup treats each Dynamo deployment as a black box and routes traffic randomly among the deployments. - The basic setup treats each Dynamo deployment as a black box and routes traffic randomly among the deployments.
- The EPP-aware setup uses a custom Dynamo plugin `dyn-kv` to pick the best worker. - The EPP-aware setup uses a custom Dynamo plugin `dyn-kv` to pick the best worker.
EPP’s default approach is token-aware only `by approximation` because it relies on the non-tokenized text in the prompt. But the Dynamo plugin uses a token-aware KV algorithm. It employs the dynamo router which implements kv routing by running your model’s tokenizer inline. The EPP plugin configuration lives in [`helm/dynamo-gaie/epp-config-dynamo.yaml`](helm/dynamo-gaie/epp-config-dynamo.yaml) per EPP [convention](https://gateway-api-inference-extension.sigs.k8s.io/guides/epp-configuration/config-text/). EPP’s default kv-routing approach is token-aware only `by approximation` because the prompt is tokenized with a generic tokenizer unaware of the model deployed. But the Dynamo plugin uses a token-aware KV algorithm. It employs the dynamo router which implements kv routing by running your model’s tokenizer inline. The EPP plugin configuration lives in [`helm/dynamo-gaie/epp-config-dynamo.yaml`](helm/dynamo-gaie/epp-config-dynamo.yaml) per EPP [convention](https://gateway-api-inference-extension.sigs.k8s.io/guides/epp-configuration/config-text/).
Currently, these setups are only supported with the kGateway based Inference Gateway. Currently, these setups are only supported with the kGateway based Inference Gateway.
...@@ -87,6 +87,28 @@ kubectl apply -f agg.yaml -n my-model ...@@ -87,6 +87,28 @@ kubectl apply -f agg.yaml -n my-model
``` ```
Take a note of or change the DYNAMO_IMAGE in the model deployment file. Take a note of or change the DYNAMO_IMAGE in the model deployment file.
Do not forget docker registry secret if needed.
```bash
kubectl create secret docker-registry docker-imagepullsecret \
--docker-server=$DOCKER_SERVER \
--docker-username=$DOCKER_USERNAME \
--docker-password=$DOCKER_PASSWORD \
--namespace=$NAMESPACE
```
Do not forget to include the the HuggingFace token if required.
```bash
export HF_TOKEN=your_hf_token
kubectl create secret generic hf-token-secret \
--from-literal=HF_TOKEN=${HF_TOKEN} \
-n ${NAMESPACE}
```
Create a model configuration file similar to the vllm_agg_qwen.yaml for you model.
This file demonstrates the values needed for the Vllm Agg setup in [agg.yaml](../../components/backends/vllm/deploy/agg.yaml)
Take a note of the model's block size provided in the model card.
### 4. Install Dynamo GAIE helm chart ### ### 4. Install Dynamo GAIE helm chart ###
The Inference Gateway is configured through the `inference-gateway-resources.yaml` file. The Inference Gateway is configured through the `inference-gateway-resources.yaml` file.
...@@ -95,7 +117,7 @@ Deploy the Inference Gateway resources to your Kubernetes cluster by running one ...@@ -95,7 +117,7 @@ Deploy the Inference Gateway resources to your Kubernetes cluster by running one
#### Basic Black Box Integration #### #### Basic Black Box Integration ####
For the basic black box integration run: The basic black box integration uses a standard EPP image`us-central1-docker.pkg.dev/k8s-staging-images/gateway-api-inference-extension/epp:v0.4.0`. For the basic black box integration run:
```bash ```bash
cd deploy/inference-gateway cd deploy/inference-gateway
...@@ -104,9 +126,13 @@ helm install dynamo-gaie ./helm/dynamo-gaie -n my-model -f ./vllm_agg_qwen.yaml ...@@ -104,9 +126,13 @@ helm install dynamo-gaie ./helm/dynamo-gaie -n my-model -f ./vllm_agg_qwen.yaml
#### EPP-aware Integration with the custom Dynamo Plugin #### #### EPP-aware Integration with the custom Dynamo Plugin ####
Dynamo provides a custom routing plugin `pkg/epp/scheduling/plugins/dynamo_kv_scorer/plugin.go` to perform efficient kv routing.
The Dynamo router is built as a static library, the EPP router will call to provide fast inference.
You can either use the image `nvcr.io/nvstaging/ai-dynamo/epp-inference-extension-dynamo:v0.6.0-1` for the EPP_IMAGE in the Helm deployment command and proceed to the step 2 or you can build the image yourself following the steps below.
##### 1. Build the custom EPP image ##### ##### 1. Build the custom EPP image #####
We provide git patches for you to use. If you choose to build your own image use the steps below. Proceed to step 2 otherwise to deploy with Helm.
##### 1.1 Clone the official GAIE repo in a separate folder ##### ##### 1.1 Clone the official GAIE repo in a separate folder #####
...@@ -116,44 +142,74 @@ cd gateway-api-inference-extension ...@@ -116,44 +142,74 @@ cd gateway-api-inference-extension
git checkout v0.5.1 git checkout v0.5.1
``` ```
##### 1.2 Apply patch(es) ##### ##### 1.2 Build the Dynamo Custom EPP #####
###### 1.2.1 Clone the official EPP repo ######
```bash
# Clone the official GAIE repo in a separate folder
cd path/to/gateway-api-inference-extension
git clone git@github.com:kubernetes-sigs/gateway-api-inference-extension.git
git checkout v0.5.1
```
###### 1.2.2 Run the script to build the EPP image ######
The script will apply a custom patch to the code with your GAIE repo and build the image for you to use.
```bash ```bash
git apply <dynamo-folder>/deploy/inference-gateway/epp-patches/v0.5.1-1/epp-v0.5.1-dyn1.patch # Use your custom paths
export DYNAMO_DIR=/path/to/dynamo
export EPP_DIR=/path/to/gateway-api-inference-extension
# Run the script
cd deploy/inference-gateway
./build-epp-dynamo.sh
``` ```
##### 1.3 Build the custom EPP image ##### Under the hood the script applies the Dynamo Patch to the EPP code base; creates a Dynamo Router static library and builds a custom EPP image with it.
Re-tag the freshly built image and push it to your registry.
```bash ```bash
# Build the image <your-docker-registry/dynamo-custom-epp:<your-tag> and then manually push docker images
make image-local-load \ docker tag <your-new-id> <your-image-tag>
IMAGE_REGISTRY=<your-docker-registry> \ docker push <your-image-tag>
IMAGE_NAME=dynamo-custom-epp \
EXTRA_TAG=<your-tag>
# Or run the command below to build push to your registry
make image-local-push \
IMAGE_REGISTRY=<your-docker-registry> \
IMAGE_NAME=dynamo-custom-epp \
EXTRA_TAG=<your-tag>
``` ```
##### 2. Install through helm ##### ##### 2. Deploy through helm #####
```bash ```bash
cd deploy/inference-gateway cd deploy/inference-gateway
# Export the Dynamo image you have used when deploying your model in Step 3. # Export the Dynamo image you have used when deploying your model in Step 3.
export DYNAMO_IMAGE=<the-dynamo-image-you-have-used-when-deploying-the-model> export DYNAMO_IMAGE=<the-dynamo-image-you-have-used-when-deploying-the-model>
export EPP_IMAGE=<the-epp-image-you-built> # i.e. docker.io/lambda108/epp-inference-extension-dynamo:v0.5.1-1 # Export the image tag you have used when building the EPP i.e. docker.io/lambda108/epp-inference-extension-dynamo:v0.5.1-2
export EPP_IMAGE=<the-epp-image-you-built>
```
**Configuration**
You can configure the plugin by setting environment vars in your [values-epp-aware.yaml].
- Overwrite the `DYNAMO_NAMESPACE` env var if needed to match your model's dynamo namespace.
- Set `DYNAMO_BUSY_THRESHOLD` to configure the upper bound on how “full” a worker can be (often derived from kv_active_blocks or other load metrics) before the router skips it. If the selected worker exceeds this value, routing falls back to the next best candidate. By default the value is negative meaning this is not enabled.
- Set `DYNAMO_ROUTER_REPLICA_SYNC=true` to enable a background watcher to keep multiple router instances in sync (important if you run more than one KV router per component).
- By default the Dynamo plugin uses KV routing. You can expose `DYNAMO_USE_KV_ROUTING=false` in your [values-epp-aware.yaml] if you prefer to route in the round-robin fashion.
- If using kv-routing:
- Overwrite the `DYNAMO_KV_BLOCK_SIZE` in your [values-epp-aware.yaml](./values-epp-aware.yaml) to match your model's block size.The `DYNAMO_KV_BLOCK_SIZE` env var is ***MANDATORY*** to prevent silent KV routing failures.
- Set `DYNAMO_OVERLAP_SCORE_WEIGHT` to weigh how heavily the score uses token overlap (predicted KV cache hits) versus other factors (load, historical hit rate). Higher weight biases toward reusing workers with similar cached prefixes.
- Set `DYNAMO_ROUTER_TEMPERATURE` to soften or sharpen the selection curve when combining scores. Low temperature makes the router pick the top candidate deterministically; higher temperature lets lower-scoring workers through more often (exploration).
- Set `DYNAMO_USE_KV_EVENTS=false` if you want to disable KV event tracking while using kv-routing
- See the [KV cache routing design](../../docs/architecture/kv_cache_routing.md) for details.
```bash
helm upgrade --install dynamo-gaie ./helm/dynamo-gaie \ helm upgrade --install dynamo-gaie ./helm/dynamo-gaie \
-n my-model \ -n my-model \
-f ./vllm_agg_qwen.yaml \ -f ./vllm_agg_qwen.yaml \
-f ./values-epp-aware.yaml \ -f ./values-epp-aware.yaml \
--set eppAware.enabled=true \ --set eppAware.enabled=true \
--set-string eppAware.eppImage=$EPP_IMAGE \ --set-string eppAware.eppImage=$EPP_IMAGE
--set-string eppAware.sidecar.image=$DYNAMO_IMAGE
``` ```
...@@ -162,6 +218,7 @@ Key configurations include: ...@@ -162,6 +218,7 @@ Key configurations include:
- A service for the inference gateway - A service for the inference gateway
- Required RBAC roles and bindings - Required RBAC roles and bindings
- RBAC permissions - RBAC permissions
- values-epp-aware.yaml sets eppAware.dynamoNamespace=vllm-agg for the bundled example. Point it at your actual Dynamo namespace by editing that file or adding --set eppAware.dynamoNamespace=<namespace> (and likewise for dynamoComponent, dynamoKvBlockSize if they differ).
### 5. Verify Installation ### ### 5. Verify Installation ###
......
#!/usr/bin/env bash
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
set -e # Exit on any error
# Configuration - Set these environment variables before running
if [[ -z "${DYNAMO_DIR}" ]]; then
echo "DYNAMO_DIR environment variable must be set"
echo " Example: export DYNAMO_DIR=/path/to/dynamo"
exit 1
fi
if [[ -z "${EPP_DIR}" ]]; then
echo "EPP_DIR environment variable must be set"
echo " Example: export EPP_DIR=/path/to/gateway-api-inference-extension-dynamo"
exit 1
fi
DYNAMO_LIB_DIR="${EPP_DIR}/pkg/epp/scheduling/plugins/dynamo_kv_scorer/lib"
DYNAMO_INCLUDE_DIR="${EPP_DIR}/pkg/epp/scheduling/plugins/dynamo_kv_scorer/include"
echo "🏗️ Building Dynamo KV Router C Library..."
# Step 1: Build the static library
echo "📦 Building static library..."
cd "${DYNAMO_DIR}"
cargo build --release -p libdynamo_llm
# Step 2: Generate header file (with fallback)
echo "📝 Generating C header..."
HEADER_OUTPUT="${DYNAMO_DIR}/lib/bindings/c/include/nvidia/dynamo_llm/llm_engine.h"
if ! cbindgen --config lib/bindings/c/cbindgen.toml --crate libdynamo_llm --output "${HEADER_OUTPUT}"; then
echo "cbindgen failed, using fallback header..."
cp "${DYNAMO_DIR}/lib/bindings/c/src/fallback_header.h" "${HEADER_OUTPUT}"
fi
# Step 3: Ensure EPP directories exist
echo "Preparing EPP directories..."
mkdir -p "${DYNAMO_LIB_DIR}"
mkdir -p "${DYNAMO_INCLUDE_DIR}"
# Step 4: Copy files to EPP
echo "Copying files to EPP..."
cp "${HEADER_OUTPUT}" "${DYNAMO_INCLUDE_DIR}/"
cp "${DYNAMO_DIR}/target/release/libdynamo_llm_capi.a" "${DYNAMO_LIB_DIR}/"
# Verify files were copied
if [[ ! -f "${DYNAMO_INCLUDE_DIR}/llm_engine.h" ]]; then
echo "Header file copy failed!"
exit 1
fi
if [[ ! -f "${DYNAMO_LIB_DIR}/libdynamo_llm_capi.a" ]]; then
echo "Library file copy failed!"
exit 1
fi
echo "Files copied successfully:"
echo " Header: ${DYNAMO_INCLUDE_DIR}/llm_engine.h"
echo " Library: ${DYNAMO_LIB_DIR}/libdynamo_llm_capi.a"
# Step 5: Apply Dynamo patch (if it exists)
echo "🔧 Applying Dynamo patch..."
cd "${EPP_DIR}"
PATCH_FILE="${DYNAMO_DIR}/deploy/inference-gateway/epp-patches/v0.5.1-2/epp-v0.5.1-dyn2.patch"
if [[ -f "${PATCH_FILE}" ]]; then
if git apply --check "${PATCH_FILE}" 2>/dev/null; then
git apply "${PATCH_FILE}"
echo "Patch applied successfully"
else
echo "Patch doesn't apply cleanly - may already be applied or need manual resolution"
fi
else
echo "No patch file found at ${PATCH_FILE}"
fi
# Step 6: Build the EPP image
echo "Building the EPP image..."
make dynamo-image-local-load
echo "EPP with Dynamo KV routing built"
...@@ -41,8 +41,8 @@ spec: ...@@ -41,8 +41,8 @@ spec:
containers: containers:
- name: epp - name: epp
image: {{ if .Values.eppAware.enabled }}{{ default .Values.extension.image .Values.eppAware.eppImage }}{{ else }}{{ .Values.extension.image }}{{ end }} image: "{{ if .Values.eppAware.enabled }}{{ default .Values.extension.image .Values.eppAware.eppImage }}{{ else }}{{ .Values.extension.image }}{{ end }}"
imagePullPolicy: {{ .Values.epp.imagePullPolicy | default "IfNotPresent" }} imagePullPolicy: {{ default "IfNotPresent" .Values.epp.imagePullPolicy }}
args: args:
{{- if .Values.epp.argsOverride }} {{- if .Values.epp.argsOverride }}
{{- toYaml .Values.epp.argsOverride | nindent 8 }} {{- toYaml .Values.epp.argsOverride | nindent 8 }}
...@@ -64,22 +64,43 @@ spec: ...@@ -64,22 +64,43 @@ spec:
- "/etc/epp/epp-config-dynamo.yaml" - "/etc/epp/epp-config-dynamo.yaml"
{{- end }} {{- end }}
{{- end }} {{- end }}
{{- $platformNs := default .Release.Namespace .Values.platformNamespace -}}
{{- $platformName := default "dynamo-platform" .Values.platformReleaseName -}}
{{- $ns := required "set eppAware.dynamoNamespace via values" .Values.eppAware.dynamoNamespace -}}
{{- $comp := default "backend" .Values.eppAware.dynamoComponent -}}
{{- $kv := default "16" .Values.eppAware.dynamoKvBlockSize -}}
{{- if .Values.eppAware.enabled }} {{- if .Values.eppAware.enabled }}
volumeMounts: volumeMounts:
- name: epp-config - name: epp-config
mountPath: /etc/epp mountPath: /etc/epp
readOnly: true readOnly: true
{{- end }} {{- end }}
env: env:
{{- if .Values.eppAware.enabled }}
- name: ETCD_ENDPOINTS
value: "{{ $platformName }}-etcd.{{ $platformNs }}:2379"
- name: NATS_SERVER
value: "nats://{{ $platformName }}-nats.{{ $platformNs }}:4222"
- name: DYNAMO_NAMESPACE
value: "{{ $ns }}"
- name: DYNAMO_COMPONENT
value: "{{ $comp }}"
- name: DYNAMO_KV_BLOCK_SIZE
value: "{{ $kv }}"
{{- end }}
{{- range .Values.epp.extraEnv }} {{- range .Values.epp.extraEnv }}
- name: {{ .name }} - name: {{ .name }}
value: {{ .value | quote }} value: {{ .value | quote }}
{{- end }} {{- end }}
ports: ports:
- containerPort: 9002 - containerPort: 9002
- containerPort: 9003 - containerPort: 9003
- name: metrics - name: metrics
containerPort: 9090 containerPort: 9090
livenessProbe: livenessProbe:
grpc: grpc:
port: 9003 port: 9003
...@@ -93,27 +114,6 @@ spec: ...@@ -93,27 +114,6 @@ spec:
initialDelaySeconds: 5 initialDelaySeconds: 5
periodSeconds: 10 periodSeconds: 10
{{- if .Values.eppAware.enabled }}
- name: {{ .Values.eppAware.sidecar.name }}
image: {{ .Values.eppAware.sidecar.image }}
imagePullPolicy: {{ .Values.eppAware.sidecar.imagePullPolicy | default "IfNotPresent" }}
command: {{- toYaml .Values.eppAware.sidecar.command | nindent 8 }}
args: {{- toYaml .Values.eppAware.sidecar.args | nindent 8 }}
env:
{{- range .Values.eppAware.sidecar.env }}
{{- if .valueFromDynamoNamespace }}
- name: {{ .name }}
value: "{{ $.Values.dynamoNamespace }}"
{{- else }}
- name: {{ .name }}
value: {{ .value | quote }}
{{- end }}
{{- end }}
ports:
{{- toYaml .Values.eppAware.sidecar.ports | nindent 8 }}
resources:
{{- toYaml .Values.eppAware.sidecar.resources | nindent 10 }}
{{- end }}
{{- if .Values.eppAware.enabled }} {{- if .Values.eppAware.enabled }}
volumes: volumes:
- name: epp-config - name: epp-config
......
...@@ -66,32 +66,7 @@ epp: ...@@ -66,32 +66,7 @@ epp:
eppAware: eppAware:
enabled: false enabled: false
# Optional: override EPP image when epp-aware=true # Optional: override EPP image when epp-aware=true
eppImage: docker.io/lambda108/epp-inference-extension-dynamo:v0.5.1-1 eppImage: nvcr.io/nvstaging/ai-dynamo/gaie-epp-dynamo:v0.6.0-1
dynamoNamespace: ""
# Sidecar (frontend-router) dynamoComponent: ""
sidecar: dynamoKvBlockSize: ""
# Container name for the sidecar
name: frontend-router
# Sidecar image
image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:my-tag
# Image pull policy for the sidecar
imagePullPolicy: IfNotPresent
# Command and args for running the frontend in router mode.
command: ["/bin/sh", "-c"]
args: ["python3 -m dynamo.frontend --http-port 8000 --router-mode kv"]
# Environment variables for the sidecar.
env:
- name: DYNAMO_NAMESPACE
valueFromDynamoNamespace: true
- name: ETCD_ENDPOINTS
value: "http://dynamo-platform-etcd:2379"
- name: NATS_SERVER
value: "nats://dynamo-platform-nats:4222"
# Resource requests/limits for the sidecar container.
resources:
requests:
cpu: "1"
memory: "2Gi"
# Ports exposed by the sidecar container.
ports:
- containerPort: 8000
...@@ -15,11 +15,17 @@ ...@@ -15,11 +15,17 @@
eppAware: eppAware:
enabled: true enabled: true
eppImage: docker.io/lambda108/epp-inference-extension-dynamo:v0.5.1-1 eppImage: nvcr.io/nvstaging/ai-dynamo/gaie-epp-dynamo:v0.6.0-1
dynamoNamespace: vllm-agg
dynamoComponent: backend
dynamoKvBlockSize: "16"
imagePullSecrets: imagePullSecrets:
- docker-imagepullsecret - docker-imagepullsecret
platformReleaseName: dynamo-platform
platformNamespace: "my-model"
epp: epp:
extraEnv: extraEnv:
- name: USE_STREAMING - name: USE_STREAMING
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment