Unverified Commit 575602bb authored by atchernych's avatar atchernych Committed by GitHub
Browse files

fix: Cleanup instructions for GAIE integrations fixes [DYN-284] (#5882)


Signed-off-by: default avatarAnna Tchernych <atchernych@nvidia.com>
parent f597b75b
## Inference Gateway Setup with Dynamo ## Inference Gateway Setup with Dynamo
When integrating Dynamo with the Inference Gateway it is recommended to use the custom Dynamo EPP image. When integrating Dynamo with the Inference Gateway you must use the custom Dynamo EPP image.
1. **Dynamo EPP (Recommended):** The custom Dynamo EPP image integrates the Dynamo router directly into the gateway's endpoint picker. Using the `dyn-kv` plugin, it selects the optimal worker based on KV cache state and tokenized prompt before routing the request. The integration moves intelligent routing upstream to the gateway layer. The custom Dynamo EPP image integrates the Dynamo router directly into the gateway's endpoint picker. Using the `dyn-kv` plugin, it selects the optimal worker based on KV cache state and tokenized prompt before routing the request. The integration moves intelligent routing upstream to the gateway layer.
2. **Standard EPP (Fallback):** You can use the default GAIE EPP image, which treats the Dynamo deployment as a black box and routes requests round-robin. Routing intelligence remains within the Dynamo graph itself. Use this approach if you have a single Dynamo graph and don't need the custom EPP image. EPP's default kv-routing approach is not token-aware because the prompt is not tokenized. But the Dynamo plugin uses a token-aware KV algorithm. It employs the dynamo router which implements kv routing by running your model's tokenizer inline. The EPP plugin configuration lives in [`helm/dynamo-gaie/epp-config-dynamo.yaml`](helm/dynamo-gaie/epp-config-dynamo.yaml) per EPP [convention](https://gateway-api-inference-extension.sigs.k8s.io/guides/epp-configuration/config-text/).
EPP’s default kv-routing approach is not token-aware because the prompt is not tokenized. But the Dynamo plugin uses a token-aware KV algorithm. It employs the dynamo router which implements kv routing by running your model’s tokenizer inline. The EPP plugin configuration lives in [`helm/dynamo-gaie/epp-config-dynamo.yaml`](helm/dynamo-gaie/epp-config-dynamo.yaml) per EPP [convention](https://gateway-api-inference-extension.sigs.k8s.io/guides/epp-configuration/config-text/).
The setup provided here uses the Dynamo custom EPP by default. Set `epp.useDynamo=false` in your deployment to pick the approach 2.
Dynamo Integration with the Inference Gateway supports Aggregated and Disaggregated Serving. Dynamo Integration with the Inference Gateway supports Aggregated and Disaggregated Serving.
If you want to use LoRA deploy Dynamo without the Inference Gateway or in the BlackBox approach with the Inference Gateway. If you want to use LoRA deploy Dynamo without the Inference Gateway or in the BlackBox approach with the Inference Gateway.
...@@ -177,10 +173,8 @@ Do not forget docker registry secret if needed. ...@@ -177,10 +173,8 @@ Do not forget docker registry secret if needed.
```bash ```bash
cd deploy/inference-gateway/standalone cd deploy/inference-gateway/standalone
# Export the Dynamo image you have used when deploying your model in Step 3. # Export the EPP image - use the Dynamo FrontEnd image or build your own EPP image (see section 4)
export DYNAMO_IMAGE=<the-dynamo-image-you-have-used-when-deploying-the-model> export EPP_IMAGE=<the-epp-image>
# Export the FrontEnd image tag provided by Dynamo (recommended) or build the Dynamo EPP image by following the commands later in this README.
export EPP_IMAGE=<the-epp-image-you-built>
``` ```
```bash ```bash
...@@ -203,32 +197,22 @@ Key configurations include: ...@@ -203,32 +197,22 @@ Key configurations include:
**Configuration** **Configuration**
You can configure the plugin by setting environment vars in your [values-dynamo-epp.yaml]. You can configure the plugin by setting environment variables in the EPP component of your DGD in case of the operator-managed installation or in your [values.yaml](standalone/helm/dynamo-gaie/values.yaml).
- Overwrite the `DYN_NAMESPACE` env var if needed to match your model's dynamo namespace. Common Vars for Routing Configuration:
- Set `DYNAMO_BUSY_THRESHOLD` to configure the upper bound on how full a worker can be (often derived from kv_active_blocks or other load metrics) before the router skips it. If the selected worker exceeds this value, routing falls back to the next best candidate. By default the value is negative meaning this is not enabled. - Set `DYN_BUSY_THRESHOLD` to configure the upper bound on how "full" a worker can be (often derived from kv_active_blocks or other load metrics) before the router skips it. If the selected worker exceeds this value, routing falls back to the next best candidate. By default the value is negative meaning this is not enabled.
- Set `DYNAMO_ENFORCE_DISAGG=true` if you want to enforce every request being served in the disaggregated manner. By default it is false meaning if the the prefill worker is not available the request will be served in the aggregated manner. - Set `DYN_ENFORCE_DISAGG=true` if you want to enforce every request being served in the disaggregated manner. By default it is false meaning if the the prefill worker is not available the request will be served in the aggregated manner.
- By default the Dynamo plugin uses KV routing. You can expose `DYNAMO_USE_KV_ROUTING=false` in your [values-dynamo-epp.yaml] if you prefer to route in the round-robin fashion. - By default the Dynamo plugin uses KV routing. You can expose `DYN_USE_KV_ROUTING=false` in your [values.yaml](standalone/helm/dynamo-gaie/values.yaml) if you prefer to route in the round-robin fashion.
- If using kv-routing: - If using kv-routing:
- Overwrite the `DYN_KV_BLOCK_SIZE` in your [values-dynamo-epp.yaml](./values-dynamo-epp.yaml) to match your model's block size.The `DYN_KV_BLOCK_SIZE` env var is ***MANDATORY*** to prevent silent KV routing failures. - Overwrite the `DYN_KV_BLOCK_SIZE` in your [values.yaml](standalone/helm/dynamo-gaie/values.yaml) to match your model's block size. The `DYN_KV_BLOCK_SIZE` env var is ***MANDATORY*** to prevent silent KV routing failures.
- Set `DYNAMO_OVERLAP_SCORE_WEIGHT` to weigh how heavily the score uses token overlap (predicted KV cache hits) versus other factors (load, historical hit rate). Higher weight biases toward reusing workers with similar cached prefixes. - Set `DYN_OVERLAP_SCORE_WEIGHT` to weigh how heavily the score uses token overlap (predicted KV cache hits) versus other factors (load, historical hit rate). Higher weight biases toward reusing workers with similar cached prefixes.
- Set `DYNAMO_ROUTER_TEMPERATURE` to soften or sharpen the selection curve when combining scores. Low temperature makes the router pick the top candidate deterministically; higher temperature lets lower-scoring workers through more often (exploration). - Set `DYN_ROUTER_TEMPERATURE` to soften or sharpen the selection curve when combining scores. Low temperature makes the router pick the top candidate deterministically; higher temperature lets lower-scoring workers through more often (exploration).
- Set `DYNAMO_USE_KV_EVENTS=false` if you want to disable the workers sending KV events while using kv-routing - Set `DYN_USE_KV_EVENTS=false` if you want to disable the workers sending KV events while using kv-routing
- See the [KV cache routing design](../../docs/router/kv_cache_routing.md) for details. - See the [KV cache routing design](../../docs/router/kv_cache_routing.md) for details.
**Note** Stand-Alone installation only:
You can also use the standard EPP image i.e. `us-central1-docker.pkg.dev/k8s-artifacts-prod/images/gateway-api-inference-extension/epp:v1.2.1` for the basic black box integration. - Overwrite the `DYN_NAMESPACE` env var if needed to match your model's dynamo namespace.
```bash
cd deploy/inference-gateway
helm upgrade --install dynamo-gaie ./helm/dynamo-gaie -n my-model -f ./vllm_agg_qwen.yaml
# Optionally export the standard EPP image if you do not want to use the default we suggest.
export EPP_IMAGE=us-central1-docker.pkg.dev/k8s-artifacts-prod/images/gateway-api-inference-extension/epp:v0.4.0
helm upgrade --install dynamo-gaie ./helm/dynamo-gaie -n my-model -f ./vllm_agg_qwen.yaml --set epp.useDynamo=false --set-string extension.image=$EPP_IMAGE
# Optionally overwrite the image --set-string extension.image=$EPP_IMAGE
```
### 6. Verify Installation ### ### 6. Verify Installation ###
......
...@@ -235,10 +235,10 @@ func loadDynamoConfig() { ...@@ -235,10 +235,10 @@ func loadDynamoConfig() {
ffiComponent = "backend" // The pipeline uses backend not DYN_COMPONENT which is epp ffiComponent = "backend" // The pipeline uses backend not DYN_COMPONENT which is epp
ffiModel = getEnvOrDefault("DYN_MODEL", "Qwen/Qwen3-0.6B") ffiModel = getEnvOrDefault("DYN_MODEL", "Qwen/Qwen3-0.6B")
ffiWorkerID = getEnvInt64OrDefault("DYNAMO_WORKER_ID", 1) ffiWorkerID = getEnvInt64OrDefault("DYNAMO_WORKER_ID", 1)
ffiEnforceDisagg = getEnvBoolOrDefault("DYNAMO_ENFORCE_DISAGG", false) ffiEnforceDisagg = getEnvBoolOrDefault("DYN_ENFORCE_DISAGG", false)
ffiOverlapScoreWeight = getEnvFloatOrDefault("DYNAMO_OVERLAP_SCORE_WEIGHT", -1.0) ffiOverlapScoreWeight = getEnvFloatOrDefault("DYN_OVERLAP_SCORE_WEIGHT", -1.0)
ffiRouterTemperature = getEnvFloatOrDefault("DYNAMO_ROUTER_TEMPERATURE", -1.0) ffiRouterTemperature = getEnvFloatOrDefault("DYN_ROUTER_TEMPERATURE", -1.0)
kvBlockSizeStr := os.Getenv("DYN_KV_BLOCK_SIZE") kvBlockSizeStr := os.Getenv("DYN_KV_BLOCK_SIZE")
if kvBlockSizeStr == "" { if kvBlockSizeStr == "" {
...@@ -324,11 +324,11 @@ func initFFI() error { ...@@ -324,11 +324,11 @@ func initFFI() error {
ns, ns,
cm, cm,
model, model,
C.bool(getEnvBoolOrDefault("DYNAMO_USE_KV_ROUTING", true)), C.bool(getEnvBoolOrDefault("DYN_USE_KV_ROUTING", true)),
C.double(getEnvFloatOrDefault("DYNAMO_BUSY_THRESHOLD", -1.0)), C.double(getEnvFloatOrDefault("DYN_BUSY_THRESHOLD", -1.0)),
C.double(ffiOverlapScoreWeight), C.double(ffiOverlapScoreWeight),
C.double(ffiRouterTemperature), C.double(ffiRouterTemperature),
C.bool(getEnvBoolOrDefault("DYNAMO_USE_KV_EVENTS", true)), C.bool(getEnvBoolOrDefault("DYN_USE_KV_EVENTS", true)),
C.bool(getEnvBoolOrDefault("DYNAMO_ROUTER_REPLICA_SYNC", false)), // no need as long as we call the Router Book keeping operations from the EPP. C.bool(getEnvBoolOrDefault("DYNAMO_ROUTER_REPLICA_SYNC", false)), // no need as long as we call the Router Book keeping operations from the EPP.
C.bool(ffiEnforceDisagg), C.bool(ffiEnforceDisagg),
&pipeline, &pipeline,
......
...@@ -21,10 +21,7 @@ ...@@ -21,10 +21,7 @@
{{- $ns := ternary (required "set dynamoGraphDeploymentName when epp.useDynamo=true" $resolvedDynNs) "" $useDynamo -}} {{- $ns := ternary (required "set dynamoGraphDeploymentName when epp.useDynamo=true" $resolvedDynNs) "" $useDynamo -}}
{{- $kv := default "16" .Values.epp.dynamo.kvBlockSize -}} {{- $kv := default "16" .Values.epp.dynamo.kvBlockSize -}}
{{- $useEtcd := default false .Values.epp.dynamo.useEtcd -}} {{- $useEtcd := default false .Values.epp.dynamo.useEtcd -}}
{{- $std := .Values.extension.standardImage -}} {{- $eppImage := required "extension.image is required - set via --set-string extension.image=$EPP_IMAGE or in values file" .Values.extension.image }}
{{- $dyn := .Values.extension.dynamoImage -}}
{{- $fallback := ternary $dyn $std .Values.epp.useDynamo -}}
{{- $eppImage := default $fallback .Values.extension.image }}
--- ---
# Deployment for the EPP (Endpoint Picker Plugin) # Deployment for the EPP (Endpoint Picker Plugin)
......
...@@ -50,10 +50,9 @@ httpRoute: ...@@ -50,10 +50,9 @@ httpRoute:
request: "300s" request: "300s"
extension: extension:
# EPP image for the GAIE extension (Dynamo EPP image by default) # EPP image for the GAIE extension (required - no default)
image: "" # leave empty to use defaults below # Set via --set-string extension.image=$EPP_IMAGE or in your values file
standardImage: us-central1-docker.pkg.dev/k8s-artifacts-prod/images/gateway-api-inference-extension/epp:v1.2.1 image: ""
dynamoImage: gitlab-master.nvidia.com:5005/dl/ai-dynamo/dynamo/epp-inference-extension-dynamo:new-build-1
# generic knobs you may want in both modes # generic knobs you may want in both modes
imagePullSecrets: imagePullSecrets:
......
...@@ -90,7 +90,7 @@ spec: ...@@ -90,7 +90,7 @@ spec:
value: "128" # UPDATE to match the --block-size in your deploy.yaml engine command value: "128" # UPDATE to match the --block-size in your deploy.yaml engine command
- name: USE_STREAMING - name: USE_STREAMING
value: "true" value: "true"
- name: DYNAMO_ENFORCE_DISAGG - name: DYN_ENFORCE_DISAGG
value: "false" value: "false"
- name: DYN_DISCOVERY_BACKEND - name: DYN_DISCOVERY_BACKEND
value: "kubernetes" value: "kubernetes"
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment