Unverified Commit 52271760 authored by atchernych's avatar atchernych Committed by GitHub
Browse files

feat: Decomposed pipeline for EPP integration [DEP-730] (#5446)


Signed-off-by: default avatarAnna Tchernych <atchernych@nvidia.com>
parent 16a28058
...@@ -167,7 +167,7 @@ class FrontendArgGroup(ArgGroup): ...@@ -167,7 +167,7 @@ class FrontendArgGroup(ArgGroup):
env_var="DYN_ROUTER_MODE", env_var="DYN_ROUTER_MODE",
default="round-robin", default="round-robin",
help="How to route the request.", help="How to route the request.",
choices=["round-robin", "random", "kv"], choices=["round-robin", "random", "kv", "direct"],
) )
add_argument( add_argument(
g, g,
......
...@@ -7,7 +7,7 @@ ...@@ -7,7 +7,7 @@
# - OpenAI HTTP server. # - OpenAI HTTP server.
# - Auto-discovery: Watches etcd for engine/worker registration (via `register_llm`). # - Auto-discovery: Watches etcd for engine/worker registration (via `register_llm`).
# - Pre-processor: Prompt templating and tokenization. # - Pre-processor: Prompt templating and tokenization.
# - Router, defaulting to round-robin. Use --router-mode to switch (round-robin, random, kv). # - Router, defaulting to round-robin. Use --router-mode to switch (round-robin, random, kv, direct).
# #
# Pass `--interactive` or `-i` for text chat instead of HTTP server. # Pass `--interactive` or `-i` for text chat instead of HTTP server.
# #
...@@ -197,6 +197,9 @@ async def async_main(): ...@@ -197,6 +197,9 @@ async def async_main():
elif config.router_mode == "random": elif config.router_mode == "random":
router_mode = RouterMode.Random router_mode = RouterMode.Random
kv_router_config = None kv_router_config = None
elif config.router_mode == "direct":
router_mode = RouterMode.Direct
kv_router_config = None
else: else:
router_mode = RouterMode.RoundRobin router_mode = RouterMode.RoundRobin
kv_router_config = None kv_router_config = None
......
...@@ -63,26 +63,6 @@ kubectl get gateway inference-gateway ...@@ -63,26 +63,6 @@ kubectl get gateway inference-gateway
### 3. Setup secrets ### ### 3. Setup secrets ###
Follow the steps in [model deployment](../../examples/backends/vllm/deploy/README.md) to deploy `Qwen/Qwen3-0.6B` model in aggregate mode using [agg.yaml](../../examples/backends/vllm/deploy/agg.yaml) in `my-model` kubernetes namespace.
Make sure to enable kv-routing by adding the env var in the FrontEnd.
```bash
mainContainer:
image: ...
env:
- name: DYN_ROUTER_MODE
value: "kv"
```
Sample commands to deploy model:
```bash
cd <dynamo-source-root>
cd examples/backends/vllm/deploy
kubectl apply -f agg.yaml -n my-model
```
Take a note of or change the DYNAMO_IMAGE in the model deployment file.
Do not forget docker registry secret if needed. Do not forget docker registry secret if needed.
```bash ```bash
...@@ -93,7 +73,7 @@ kubectl create secret docker-registry docker-imagepullsecret \ ...@@ -93,7 +73,7 @@ kubectl create secret docker-registry docker-imagepullsecret \
--namespace=$NAMESPACE --namespace=$NAMESPACE
``` ```
Do not forget to include the HuggingFace token if required. Do not forget to include the HuggingFace token.
```bash ```bash
export HF_TOKEN=your_hf_token export HF_TOKEN=your_hf_token
...@@ -139,13 +119,34 @@ make info # Check image tag ...@@ -139,13 +119,34 @@ make info # Check image tag
We recommend deploying Inference Gateway's Endpoint Picker as a Dynamo operator's managed component. Alternatively, We recommend deploying Inference Gateway's Endpoint Picker as a Dynamo operator's managed component. Alternatively,
you could deploy it as a standalone pod you could deploy it as a standalone pod
#### 5.a. Deploy as a DGD component #### 5.a. Deploy as a DGD component (recommended)
We provide an example for llama-3-70b vLLM below.
```bash ```bash
kubectl apply -f operator-managed/examples/agg.yaml -n ${NAMESPACE} # Deploy PVC, first Update `storageClassName` in recipes/llama-3-70b/model-cache/model-cache.yaml to match your cluster before deploying
kubectl apply -f operator-managed/examples/http-route.yaml -n ${NAMESPACE} kubectl apply -f recipes/llama-3-70b/model-cache/model-cache.yaml
kubectl apply -f recipes/llama-3-70b/model-cache/model-download.yaml
# Deploy your model
kubectl apply -f recipes/llama-3-70b/vllm/agg/gaie/deploy.yaml -n ${NAMESPACE}
# Deploy the GAIE http-route CR.
kubectl apply -f recipes/llama-3-70b/vllm/agg/gaie/http-route.yaml -n ${NAMESPACE}
``` ```
- When using GAIE the FrontEnd does not choose the workers. The routing is determined in the EPP.
- You must enable the flag in the FrontEnd cli as below.
```bash
command:
- python3
args:
- -m
- dynamo.frontend
- --router-mode
- direct
```
- The pre-selected worker (decode and prefill in case of the disaggregated serving) are passed in the request headers.
- The flag assures the routing respects this selection.
**Startup Probe Timeout:** The EPP has a default startup probe timeout of 30 minutes (10s × 180 failures). **Startup Probe Timeout:** The EPP has a default startup probe timeout of 30 minutes (10s × 180 failures).
If your model takes longer to load, increase the `failureThreshold` in the EPP's `startupProbe`. For example, If your model takes longer to load, increase the `failureThreshold` in the EPP's `startupProbe`. For example,
to allow 60 minutes for startup: to allow 60 minutes for startup:
...@@ -166,6 +167,18 @@ If you installed it into a different namespace, you need to adjust the HttpRoute ...@@ -166,6 +167,18 @@ If you installed it into a different namespace, you need to adjust the HttpRoute
##### 5.b.1 Deploy Your Model ### ##### 5.b.1 Deploy Your Model ###
We provide an example for Qwen vLLM below.
Before deploying you must enable the `--direct-route` flag in the FrontEnd cli in your Dynamo Graph.
```bash
command:
- python3
args:
- -m
- dynamo.frontend
- --router-mode
- direct
```
Follow the steps in [model deployment](../../examples/backends/vllm/deploy/README.md) to deploy `Qwen/Qwen3-0.6B` model in aggregate mode using [agg.yaml](../../examples/backends/vllm/deploy/agg.yaml) in `my-model` kubernetes namespace. Follow the steps in [model deployment](../../examples/backends/vllm/deploy/README.md) to deploy `Qwen/Qwen3-0.6B` model in aggregate mode using [agg.yaml](../../examples/backends/vllm/deploy/agg.yaml) in `my-model` kubernetes namespace.
Sample commands to deploy model: Sample commands to deploy model:
...@@ -176,10 +189,6 @@ cd examples/backends/vllm/deploy ...@@ -176,10 +189,6 @@ cd examples/backends/vllm/deploy
kubectl apply -f agg.yaml -n my-model kubectl apply -f agg.yaml -n my-model
``` ```
Take a note of or change the DYNAMO_IMAGE in the model deployment file.
Do not forget docker registry secret if needed.
##### 5.b.2 Install Dynamo GIE helm chart ### ##### 5.b.2 Install Dynamo GIE helm chart ###
```bash ```bash
...@@ -214,14 +223,14 @@ You can configure the plugin by setting environment variables in the EPP compone ...@@ -214,14 +223,14 @@ You can configure the plugin by setting environment variables in the EPP compone
Common Vars for Routing Configuration: Common Vars for Routing Configuration:
- Set `DYN_BUSY_THRESHOLD` to configure the upper bound on how "full" a worker can be (often derived from kv_active_blocks or other load metrics) before the router skips it. If the selected worker exceeds this value, routing falls back to the next best candidate. By default the value is negative meaning this is not enabled. - Set `DYN_BUSY_THRESHOLD` to configure the upper bound on how "full" a worker can be (often derived from kv_active_blocks or other load metrics) before the router skips it. If the selected worker exceeds this value, routing falls back to the next best candidate. By default the value is negative meaning this is not enabled.
- Set `DYN_ENFORCE_DISAGG=true` if you want to enforce every request being served in the disaggregated manner. By default it is false meaning if the the prefill worker is not available the request will be served in the aggregated manner. - Set `DYN_ENFORCE_DISAGG=true` if you want to enforce every request being served in the disaggregated manner. By default it is false meaning if the the prefill worker is not available the request will be served in the aggregated manner.
- By default the Dynamo plugin uses KV routing. You can expose `DYN_USE_KV_ROUTING=false` in your [values.yaml](standalone/helm/dynamo-gaie/values.yaml) if you prefer to route in the round-robin fashion. - Set `DYN_OVERLAP_SCORE_WEIGHT` to weigh how heavily the score uses token overlap (predicted KV cache hits) versus other factors (load, historical hit rate). Higher weight biases toward reusing workers with similar cached prefixes. (default: 1)
- If using kv-routing: - Set `DYN_ROUTER_TEMPERATURE` to soften or sharpen the selection curve when combining scores. Low temperature makes the router pick the top candidate deterministically; higher temperature lets lower-scoring workers through more often (exploration).
- Overwrite the `DYN_KV_BLOCK_SIZE` in your [values-dynamo-epp.yaml](./values-dynamo-epp.yaml) to match your model's block size.The `DYN_KV_BLOCK_SIZE` env var is ***MANDATORY*** to prevent silent KV routing failures. - Set `DYN_USE_KV_EVENTS=false` if you want to disable the workers sending KV events while using kv-routing (default: true)
- Set `DYNAMO_OVERLAP_SCORE_WEIGHT` to weigh how heavily the score uses token overlap (predicted KV cache hits) versus other factors (load, historical hit rate). Higher weight biases toward reusing workers with similar cached prefixes. - `DYN_ROUTER_TEMPERATURE` — Temperature for worker sampling via softmax (default: 0.0)
- Set `DYNAMO_ROUTER_TEMPERATURE` to soften or sharpen the selection curve when combining scores. Low temperature makes the router pick the top candidate deterministically; higher temperature lets lower-scoring workers through more often (exploration). - `DYN_ROUTER_REPLICA_SYNC` — Enable replica synchronization (default: false)
- Set `DYNAMO_USE_KV_EVENTS=false` if you want to disable the workers sending KV events while using kv-routing - `DYN_ROUTER_TRACK_ACTIVE_BLOCKS` — Track active blocks (default: true)
- See the [Router Guide](../../docs/pages/components/router/router-guide.md) for details. - `DYN_ROUTER_TRACK_OUTPUT_BLOCKS` — Track output blocks during generation (default: false)
- See the [KV cache routing design](../../docs/pages/design-docs/router-design.md) for details.
Stand-Alone installation only: Stand-Alone installation only:
- Overwrite the `DYN_NAMESPACE` env var if needed to match your model's dynamo namespace. - Overwrite the `DYN_NAMESPACE` env var if needed to match your model's dynamo namespace.
...@@ -272,7 +281,7 @@ b. use port-forward to expose the gateway to the host ...@@ -272,7 +281,7 @@ b. use port-forward to expose the gateway to the host
```bash ```bash
# in first terminal # in first terminal
kubectl port-forward svc/inference-gateway 8000:80 -n my-model kubectl port-forward svc/inference-gateway 8000:80 -n kgateway-system
# in second terminal where you want to send inference requests # in second terminal where you want to send inference requests
GATEWAY_URL=http://localhost:8000 GATEWAY_URL=http://localhost:8000
...@@ -359,6 +368,14 @@ Sample inference output: ...@@ -359,6 +368,14 @@ Sample inference output:
} }
``` ```
***If you have more than one HttpRoute running on the cluster***
Add the host to your HttpRoute.yaml and add the header `curl -H "Host: llama3-70b-agg.example.com" ...` to every request.
```bash
spec:
hostnames:
- llama3-70b-agg.example.com
```
### 8. Deleting the installation ### ### 8. Deleting the installation ###
If you need to uninstall run: If you need to uninstall run:
...@@ -407,4 +424,4 @@ The plugins set HTTP headers that are forwarded to the backend workers. ...@@ -407,4 +424,4 @@ The plugins set HTTP headers that are forwarded to the backend workers.
| Header | Description | Set By | | Header | Description | Set By |
|--------|-------------|--------| |--------|-------------|--------|
| `x-worker-instance-id` | Primary worker ID (decode worker in disagg mode) | kv-aware-scorer | | `x-worker-instance-id` | Primary worker ID (decode worker in disagg mode) | kv-aware-scorer |
| `x-prefill-instance-id` | Prefill worker ID (disaggregated mode only) | kv-aware-scorer | | `x-prefill-instance-id` | Prefill worker ID (disaggregated mode only) | kv-aware-scorer |
\ No newline at end of file
...@@ -19,7 +19,6 @@ ...@@ -19,7 +19,6 @@
{{- $useDynamo := default false .Values.epp.useDynamo -}} {{- $useDynamo := default false .Values.epp.useDynamo -}}
{{- $resolvedDynNs := (include "dynamo-gaie.dynamoNamespace" .) | trim -}} {{- $resolvedDynNs := (include "dynamo-gaie.dynamoNamespace" .) | trim -}}
{{- $ns := ternary (required "set dynamoGraphDeploymentName when epp.useDynamo=true" $resolvedDynNs) "" $useDynamo -}} {{- $ns := ternary (required "set dynamoGraphDeploymentName when epp.useDynamo=true" $resolvedDynNs) "" $useDynamo -}}
{{- $kv := default "16" .Values.epp.dynamo.kvBlockSize -}}
{{- $useEtcd := default false .Values.epp.dynamo.useEtcd -}} {{- $useEtcd := default false .Values.epp.dynamo.useEtcd -}}
{{- $eppImage := required "extension.image is required - set via --set-string extension.image=$EPP_IMAGE or in values file" .Values.extension.image }} {{- $eppImage := required "extension.image is required - set via --set-string extension.image=$EPP_IMAGE or in values file" .Values.extension.image }}
...@@ -113,8 +112,6 @@ spec: ...@@ -113,8 +112,6 @@ spec:
value: "nats://{{ $platformName }}-nats.{{ $platformNs }}:4222" value: "nats://{{ $platformName }}-nats.{{ $platformNs }}:4222"
- name: DYN_NAMESPACE - name: DYN_NAMESPACE
value: "{{ $ns }}" value: "{{ $ns }}"
- name: DYN_KV_BLOCK_SIZE
value: "{{ $kv }}"
- name: USE_STREAMING - name: USE_STREAMING
value: "true" value: "true"
# HuggingFace token for downloading model config files # HuggingFace token for downloading model config files
......
...@@ -72,7 +72,6 @@ epp: ...@@ -72,7 +72,6 @@ epp:
# Dynamo-specific settings (only used when useDynamo: true) # Dynamo-specific settings (only used when useDynamo: true)
configFile: "/etc/epp/epp-config-dynamo.yaml" configFile: "/etc/epp/epp-config-dynamo.yaml"
dynamo: dynamo:
kvBlockSize: "16"
# Use ETCD for discovery instead of Kubernetes (default: false) # Use ETCD for discovery instead of Kubernetes (default: false)
# Set to true via --set epp.dynamo.useEtcd=true to enable ETCD discovery # Set to true via --set epp.dynamo.useEtcd=true to enable ETCD discovery
useEtcd: false useEtcd: false
......
...@@ -87,10 +87,6 @@ func (e *EPPDefaults) GetBaseContainer(context ComponentContext) (corev1.Contain ...@@ -87,10 +87,6 @@ func (e *EPPDefaults) GetBaseContainer(context ComponentContext) (corev1.Contain
// EPP-specific environment variables // EPP-specific environment variables
container.Env = append(container.Env, []corev1.EnvVar{ container.Env = append(container.Env, []corev1.EnvVar{
{
Name: "DYN_KV_BLOCK_SIZE",
Value: "16",
},
{ {
Name: "USE_STREAMING", Name: "USE_STREAMING",
Value: "true", Value: "true",
......
This diff is collapsed.
...@@ -1635,6 +1635,7 @@ dependencies = [ ...@@ -1635,6 +1635,7 @@ dependencies = [
"ahash", "ahash",
"aho-corasick", "aho-corasick",
"akin", "akin",
"aligned-vec",
"anyhow", "anyhow",
"async-nats", "async-nats",
"async-stream", "async-stream",
...@@ -1651,6 +1652,7 @@ dependencies = [ ...@@ -1651,6 +1652,7 @@ dependencies = [
"bytes", "bytes",
"candle-core", "candle-core",
"chrono", "chrono",
"cudarc",
"dashmap 5.5.3", "dashmap 5.5.3",
"derive-getters", "derive-getters",
"derive_builder", "derive_builder",
...@@ -1685,6 +1687,8 @@ dependencies = [ ...@@ -1685,6 +1687,8 @@ dependencies = [
"ndarray", "ndarray",
"ndarray-interp", "ndarray-interp",
"ndarray-npy", "ndarray-npy",
"nix 0.26.4",
"nixl-sys",
"object_store", "object_store",
"offset-allocator", "offset-allocator",
"oneshot", "oneshot",
...@@ -3962,6 +3966,15 @@ version = "0.3.3" ...@@ -3962,6 +3966,15 @@ version = "0.3.3"
source = "registry+https://github.com/rust-lang/crates.io-index" source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "38d1115007560874e373613744c6fba374c17688327a71c1476d1a5954cc857b" checksum = "38d1115007560874e373613744c6fba374c17688327a71c1476d1a5954cc857b"
[[package]]
name = "memoffset"
version = "0.7.1"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "5de893c32cde5f383baa4c04c5d6dbdd735cfd4a794b0debdb2bb1b421da5ff4"
dependencies = [
"autocfg",
]
[[package]] [[package]]
name = "memoffset" name = "memoffset"
version = "0.9.1" version = "0.9.1"
...@@ -4255,6 +4268,19 @@ dependencies = [ ...@@ -4255,6 +4268,19 @@ dependencies = [
"thiserror 1.0.69", "thiserror 1.0.69",
] ]
[[package]]
name = "nix"
version = "0.26.4"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "598beaf3cc6fdd9a5dfb1630c2800c7acd31df7aaf0f565796fba2b53ca1af1b"
dependencies = [
"bitflags 1.3.2",
"cfg-if 1.0.4",
"libc",
"memoffset 0.7.1",
"pin-utils",
]
[[package]] [[package]]
name = "nix" name = "nix"
version = "0.29.0" version = "0.29.0"
...@@ -5501,7 +5527,7 @@ dependencies = [ ...@@ -5501,7 +5527,7 @@ dependencies = [
"cfg-if 1.0.4", "cfg-if 1.0.4",
"indoc", "indoc",
"libc", "libc",
"memoffset", "memoffset 0.9.1",
"once_cell", "once_cell",
"portable-atomic", "portable-atomic",
"pyo3-build-config", "pyo3-build-config",
......
...@@ -46,6 +46,9 @@ pub enum RouterMode { ...@@ -46,6 +46,9 @@ pub enum RouterMode {
RoundRobin, RoundRobin,
Random, Random,
KV, KV,
/// Direct routing - reads worker ID from each request's routing hints.
/// Used when an external orchestrator (e.g., EPP) handles worker selection.
Direct,
} }
impl From<RouterMode> for RsRouterMode { impl From<RouterMode> for RsRouterMode {
...@@ -54,6 +57,7 @@ impl From<RouterMode> for RsRouterMode { ...@@ -54,6 +57,7 @@ impl From<RouterMode> for RsRouterMode {
RouterMode::RoundRobin => Self::RoundRobin, RouterMode::RoundRobin => Self::RoundRobin,
RouterMode::Random => Self::Random, RouterMode::Random => Self::Random,
RouterMode::KV => Self::KV, RouterMode::KV => Self::KV,
RouterMode::Direct => Self::Direct,
} }
} }
} }
......
...@@ -950,6 +950,7 @@ class RouterMode: ...@@ -950,6 +950,7 @@ class RouterMode:
RoundRobin: "RouterMode" RoundRobin: "RouterMode"
Random: "RouterMode" Random: "RouterMode"
KV: "RouterMode" KV: "RouterMode"
Direct: "RouterMode"
... ...
class RouterConfig: class RouterConfig:
...@@ -968,7 +969,7 @@ class RouterConfig: ...@@ -968,7 +969,7 @@ class RouterConfig:
Create a RouterConfig. Create a RouterConfig.
Args: Args:
mode: The router mode (RoundRobin, Random, or KV) mode: The router mode (RoundRobin, Random, KV, or Direct)
config: Optional KV router configuration (used when mode is KV) config: Optional KV router configuration (used when mode is KV)
active_decode_blocks_threshold: Threshold percentage (0.0-1.0) for decode blocks busy detection active_decode_blocks_threshold: Threshold percentage (0.0-1.0) for decode blocks busy detection
active_prefill_tokens_threshold: Literal token count threshold for prefill busy detection active_prefill_tokens_threshold: Literal token count threshold for prefill busy detection
......
...@@ -9,7 +9,7 @@ use crate::{ ...@@ -9,7 +9,7 @@ use crate::{
engines::StreamingEngineAdapter, engines::StreamingEngineAdapter,
entrypoint::{EngineConfig, RouterConfig}, entrypoint::{EngineConfig, RouterConfig},
http::service::metrics::Metrics, http::service::metrics::Metrics,
kv_router::{KvPushRouter, KvRouter, PrefillRouter}, kv_router::{DirectRoutingRouter, KvPushRouter, KvRouter, PrefillRouter},
migration::Migration, migration::Migration,
model_card::ModelDeploymentCard, model_card::ModelDeploymentCard,
preprocessor::{OpenAIPreprocessor, prompt::PromptFormatter}, preprocessor::{OpenAIPreprocessor, prompt::PromptFormatter},
...@@ -274,10 +274,10 @@ where ...@@ -274,10 +274,10 @@ where
.await?; .await?;
let service_backend = match router_mode { let service_backend = match router_mode {
RouterMode::Random | RouterMode::RoundRobin | RouterMode::Direct(_) => { RouterMode::Direct => {
// Non-KV routing: use PushRouter directly. ServiceBackend::from_engine(Arc::new(DirectRoutingRouter::new(router)))
// Note: Per-worker metrics (active_prefill_tokens, active_decode_blocks) are only }
// available in KV routing mode where the router has actual bookkeeping. RouterMode::Random | RouterMode::RoundRobin => {
ServiceBackend::from_engine(Arc::new(router)) ServiceBackend::from_engine(Arc::new(router))
} }
RouterMode::KV => { RouterMode::KV => {
......
...@@ -41,7 +41,7 @@ pub mod worker_query; ...@@ -41,7 +41,7 @@ pub mod worker_query;
pub use config::{KvRouterConfig, RouterConfigOverride}; pub use config::{KvRouterConfig, RouterConfigOverride};
pub use prefill_router::PrefillRouter; pub use prefill_router::PrefillRouter;
pub use push_router::KvPushRouter; pub use push_router::{DirectRoutingRouter, KvPushRouter};
use crate::{ use crate::{
discovery::RuntimeConfigWatch, discovery::RuntimeConfigWatch,
......
...@@ -81,14 +81,6 @@ impl InnerPrefillRouter { ...@@ -81,14 +81,6 @@ impl InnerPrefillRouter {
InnerPrefillRouter::KvRouter(_) => None, InnerPrefillRouter::KvRouter(_) => None,
} }
} }
/// Peek next worker without incrementing state (for non-KV modes only)
fn peek_next_worker(&self) -> Option<u64> {
match self {
InnerPrefillRouter::SimpleRouter(router) => router.peek_next_worker(),
InnerPrefillRouter::KvRouter(_) => None,
}
}
} }
/// PrefillRouter is a forward-only operator that sits between Migration and the decode router. /// PrefillRouter is a forward-only operator that sits between Migration and the decode router.
...@@ -273,7 +265,7 @@ impl PrefillRouter { ...@@ -273,7 +265,7 @@ impl PrefillRouter {
preselected_worker: Option<u64>, preselected_worker: Option<u64>,
) -> Option<(u64, u32, BootstrapInfo)> { ) -> Option<(u64, u32, BootstrapInfo)> {
let endpoint_id = self.endpoint_id.get()?; let endpoint_id = self.endpoint_id.get()?;
let prefill_router = self.prefill_router.get()?; let _prefill_router = self.prefill_router.get()?;
// Worker selection // Worker selection
let (worker_id, dp_rank) = if let Some(id) = preselected_worker { let (worker_id, dp_rank) = if let Some(id) = preselected_worker {
...@@ -284,12 +276,8 @@ impl PrefillRouter { ...@@ -284,12 +276,8 @@ impl PrefillRouter {
"Using pre-selected prefill worker for bootstrap" "Using pre-selected prefill worker for bootstrap"
); );
(id, dp_rank) (id, dp_rank)
} else if self.router_mode.is_kv_routing() { } else {
// KV mode: use find_best_match // Use shared worker selection logic (update_states=false for peek behavior)
let kv_router = match prefill_router {
InnerPrefillRouter::KvRouter(r) => r,
_ => return None,
};
// Extract LORA name and priority jump from routing hints // Extract LORA name and priority jump from routing hints
let lora_name = req.routing.as_ref().and_then(|r| r.lora_name.clone()); let lora_name = req.routing.as_ref().and_then(|r| r.lora_name.clone());
let priority_jump = req let priority_jump = req
...@@ -297,24 +285,14 @@ impl PrefillRouter { ...@@ -297,24 +285,14 @@ impl PrefillRouter {
.as_ref() .as_ref()
.and_then(|r| r.priority_jump) .and_then(|r| r.priority_jump)
.unwrap_or(0.0); .unwrap_or(0.0);
match async { match self
kv_router .query_prefill_worker(&req.token_ids, false, lora_name, priority_jump)
.chooser .instrument(tracing::info_span!("query_prefill_worker"))
.find_best_match(None, &req.token_ids, None, false, lora_name, priority_jump) .await
.await
}
.instrument(tracing::info_span!("kv_find_best_match"))
.await
{ {
Ok((worker, _overlap)) => (worker.worker_id, worker.dp_rank), Ok((worker_id, dp_rank)) => (worker_id, dp_rank),
Err(_) => return None, Err(_) => return None,
} }
} else {
// Non-KV mode: use PushRouter's stateful selection
// We use peek_next_worker instead of select_next_worker to avoid double-incrementing the counter
// if we fall back to the original path.
let worker_id = prefill_router.peek_next_worker()?;
(worker_id, 0)
}; };
// Get bootstrap info from ModelManager (works for ANY mode) // Get bootstrap info from ModelManager (works for ANY mode)
...@@ -489,6 +467,55 @@ impl PrefillRouter { ...@@ -489,6 +467,55 @@ impl PrefillRouter {
// No phase permit needed - we wait for completion before changing phase // No phase permit needed - we wait for completion before changing phase
Self::execute_prefill(self.prefill_router.get().cloned(), request, None, None).await Self::execute_prefill(self.prefill_router.get().cloned(), request, None, None).await
} }
/// Query the best prefill worker without executing a request.
/// Returns (worker_id, dp_rank).
///
/// This is the shared worker selection logic used by both `build_bootstrap_info`
/// and `query_route`.
pub async fn query_prefill_worker(
&self,
token_ids: &[u32],
update_states: bool,
lora_name: Option<String>,
priority_jump: f64,
) -> Result<(u64, u32)> {
let prefill_router = self
.prefill_router
.get()
.ok_or_else(|| anyhow::anyhow!(PrefillError::NotActivated))?;
match prefill_router {
InnerPrefillRouter::KvRouter(r) => {
let (worker, _overlap) = r
.chooser
.find_best_match(
None,
token_ids,
None,
update_states,
lora_name,
priority_jump,
)
.await?;
Ok((worker.worker_id, worker.dp_rank))
}
InnerPrefillRouter::SimpleRouter(r) => {
let worker_id = if update_states {
r.select_next_worker()
} else {
r.peek_next_worker()
}
.ok_or_else(|| anyhow::anyhow!("No workers available for prefill"))?;
Ok((worker_id, 0))
}
}
}
/// Check if disaggregated mode is currently active (prefill router activated)
pub fn is_activated(&self) -> bool {
self.prefill_router.get().is_some()
}
} }
impl Drop for PrefillRouter { impl Drop for PrefillRouter {
......
...@@ -48,7 +48,6 @@ struct WorkerSelection { ...@@ -48,7 +48,6 @@ struct WorkerSelection {
struct RequestGuard { struct RequestGuard {
chooser: Arc<KvRouter>, chooser: Arc<KvRouter>,
context_id: String, context_id: String,
handle_local_updates: bool,
tracker: Option<Arc<RequestTracker>>, tracker: Option<Arc<RequestTracker>>,
request_metrics: Arc<RouterRequestMetrics>, request_metrics: Arc<RouterRequestMetrics>,
cumulative_osl: usize, cumulative_osl: usize,
...@@ -59,9 +58,7 @@ struct RequestGuard { ...@@ -59,9 +58,7 @@ struct RequestGuard {
impl RequestGuard { impl RequestGuard {
async fn finish(&mut self) { async fn finish(&mut self) {
self.record_metrics(); self.record_metrics();
if self.handle_local_updates if let Err(e) = self.chooser.free(&self.context_id).await {
&& let Err(e) = self.chooser.free(&self.context_id).await
{
tracing::warn!("Failed to free request {}: {e}", self.context_id); tracing::warn!("Failed to free request {}: {e}", self.context_id);
} }
self.freed = true; self.freed = true;
...@@ -86,7 +83,7 @@ impl RequestGuard { ...@@ -86,7 +83,7 @@ impl RequestGuard {
impl Drop for RequestGuard { impl Drop for RequestGuard {
fn drop(&mut self) { fn drop(&mut self) {
self.record_metrics(); self.record_metrics();
if !self.freed && self.handle_local_updates { if !self.freed {
let chooser = self.chooser.clone(); let chooser = self.chooser.clone();
let context_id = self.context_id.clone(); let context_id = self.context_id.clone();
let Ok(handle) = tokio::runtime::Handle::try_current() else { let Ok(handle) = tokio::runtime::Handle::try_current() else {
...@@ -112,15 +109,13 @@ impl KvPushRouter { ...@@ -112,15 +109,13 @@ impl KvPushRouter {
/// Select a worker for the request, either using a preselected worker or finding the best match. /// Select a worker for the request, either using a preselected worker or finding the best match.
/// ///
/// When `is_query_only` is false and `handle_local_updates` is true, this also registers /// When `is_query_only` is false, this also registers the request with the scheduler via `add_request`.
/// the request with the scheduler via `add_request`.
async fn select_worker( async fn select_worker(
&self, &self,
context_id: &str, context_id: &str,
request: &PreprocessedRequest, request: &PreprocessedRequest,
phase: RequestPhase, phase: RequestPhase,
is_query_only: bool, is_query_only: bool,
handle_local_updates: bool,
) -> Result<WorkerSelection, Error> { ) -> Result<WorkerSelection, Error> {
let routing = request.routing.as_ref(); let routing = request.routing.as_ref();
let lora_name = routing.and_then(|r| r.lora_name.clone()); let lora_name = routing.and_then(|r| r.lora_name.clone());
...@@ -172,7 +167,7 @@ impl KvPushRouter { ...@@ -172,7 +167,7 @@ impl KvPushRouter {
.get_overlap_blocks(&request.token_ids, worker) .get_overlap_blocks(&request.token_ids, worker)
.await?; .await?;
if !is_query_only && handle_local_updates { if !is_query_only {
self.chooser self.chooser
.add_request( .add_request(
context_id.to_string(), context_id.to_string(),
...@@ -234,15 +229,6 @@ impl AsyncEngine<SingleIn<PreprocessedRequest>, ManyOut<Annotated<LLMEngineOutpu ...@@ -234,15 +229,6 @@ impl AsyncEngine<SingleIn<PreprocessedRequest>, ManyOut<Annotated<LLMEngineOutpu
// Simple query-only detection: presence of query_instance_id annotation means query-only mode // Simple query-only detection: presence of query_instance_id annotation means query-only mode
let is_query_only = request.get_annotation_value("query_instance_id").is_some(); let is_query_only = request.get_annotation_value("query_instance_id").is_some();
// Determine if this router should handle local state updates (add_request, free, etc.)
// Default is true (router handles bookkeeping). Set to false for GAIE Stage 2 where
// an external orchestrator (e.g., EPP sidecar) handles bookkeeping via C FFI.
let handle_local_updates = request
.routing
.as_ref()
.and_then(|r| r.enable_local_updates)
.unwrap_or(true);
// Get phase from tracker (defaults to Aggregated if no tracker or phase not set) // Get phase from tracker (defaults to Aggregated if no tracker or phase not set)
let phase = request let phase = request
.tracker .tracker
...@@ -252,13 +238,7 @@ impl AsyncEngine<SingleIn<PreprocessedRequest>, ManyOut<Annotated<LLMEngineOutpu ...@@ -252,13 +238,7 @@ impl AsyncEngine<SingleIn<PreprocessedRequest>, ManyOut<Annotated<LLMEngineOutpu
let block_size = self.chooser.block_size() as usize; let block_size = self.chooser.block_size() as usize;
let selection = self let selection = self
.select_worker( .select_worker(&context_id, &request, phase, is_query_only)
&context_id,
&request,
phase,
is_query_only,
handle_local_updates,
)
.await?; .await?;
let WorkerSelection { let WorkerSelection {
instance_id, instance_id,
...@@ -335,8 +315,7 @@ impl AsyncEngine<SingleIn<PreprocessedRequest>, ManyOut<Annotated<LLMEngineOutpu ...@@ -335,8 +315,7 @@ impl AsyncEngine<SingleIn<PreprocessedRequest>, ManyOut<Annotated<LLMEngineOutpu
.routing .routing
.as_ref() .as_ref()
.and_then(|r| r.expected_output_tokens); .and_then(|r| r.expected_output_tokens);
let track_output_blocks = let track_output_blocks = self.chooser.kv_router_config().router_track_output_blocks;
self.chooser.kv_router_config().router_track_output_blocks && handle_local_updates;
let tracker = request.tracker.clone(); let tracker = request.tracker.clone();
let (mut backend_input, context) = request.into_parts(); let (mut backend_input, context) = request.into_parts();
...@@ -360,7 +339,6 @@ impl AsyncEngine<SingleIn<PreprocessedRequest>, ManyOut<Annotated<LLMEngineOutpu ...@@ -360,7 +339,6 @@ impl AsyncEngine<SingleIn<PreprocessedRequest>, ManyOut<Annotated<LLMEngineOutpu
let mut guard = RequestGuard { let mut guard = RequestGuard {
chooser: chooser.clone(), chooser: chooser.clone(),
context_id: context_id.clone(), context_id: context_id.clone(),
handle_local_updates,
tracker: tracker.clone(), tracker: tracker.clone(),
request_metrics: request_metrics.clone(), request_metrics: request_metrics.clone(),
cumulative_osl: 0, cumulative_osl: 0,
...@@ -385,7 +363,9 @@ impl AsyncEngine<SingleIn<PreprocessedRequest>, ManyOut<Annotated<LLMEngineOutpu ...@@ -385,7 +363,9 @@ impl AsyncEngine<SingleIn<PreprocessedRequest>, ManyOut<Annotated<LLMEngineOutpu
break; break;
}; };
if handle_local_updates && !prefill_marked { if !prefill_marked {
// Only mark prefill completed when we receive actual tokens,
// not empty bootstrap info (token_ids: []) from disaggregated prefill
let has_tokens = item.data.as_ref() let has_tokens = item.data.as_ref()
.map(|d| !d.token_ids.is_empty()) .map(|d| !d.token_ids.is_empty())
.unwrap_or(false); .unwrap_or(false);
...@@ -451,3 +431,48 @@ impl AsyncEngine<SingleIn<PreprocessedRequest>, ManyOut<Annotated<LLMEngineOutpu ...@@ -451,3 +431,48 @@ impl AsyncEngine<SingleIn<PreprocessedRequest>, ManyOut<Annotated<LLMEngineOutpu
Ok(ResponseStream::new(wrapped_stream, stream_context)) Ok(ResponseStream::new(wrapped_stream, stream_context))
} }
} }
/// A direct routing wrapper for `RouterMode::Direct`.
///
/// This wraps a `PushRouter` and reads worker IDs from each request's routing hints,
/// then routes directly to the specified worker. Used when an external router
/// (e.g., EPP) handles worker selection.
pub struct DirectRoutingRouter {
inner: PushRouter<PreprocessedRequest, Annotated<LLMEngineOutput>>,
}
impl DirectRoutingRouter {
pub fn new(inner: PushRouter<PreprocessedRequest, Annotated<LLMEngineOutput>>) -> Self {
DirectRoutingRouter { inner }
}
/// Extract worker ID from request routing hints.
/// Returns an error if no worker ID is found (required in direct routing mode).
fn get_worker_id(request: &PreprocessedRequest) -> Result<u64, Error> {
let routing = request.routing.as_ref();
let worker_id = routing.and_then(|r| r.decode_worker_id.or(r.backend_instance_id));
worker_id.ok_or_else(|| {
anyhow::anyhow!(
"Worker ID required (--direct-route) but none found in request. \
Expected decode_worker_id or backend_instance_id to be set by external router (e.g., EPP)."
)
})
}
}
#[async_trait]
impl AsyncEngine<SingleIn<PreprocessedRequest>, ManyOut<Annotated<LLMEngineOutput>>, Error>
for DirectRoutingRouter
{
async fn generate(
&self,
request: SingleIn<PreprocessedRequest>,
) -> Result<ManyOut<Annotated<LLMEngineOutput>>, Error> {
let worker_id = Self::get_worker_id(&request)?;
tracing::debug!(worker_id = worker_id, "Direct routing to specified worker");
self.inner.direct(request, worker_id).await
}
}
...@@ -280,7 +280,6 @@ impl OpenAIPreprocessor { ...@@ -280,7 +280,6 @@ impl OpenAIPreprocessor {
prefill_worker_id: nvext.prefill_worker_id, prefill_worker_id: nvext.prefill_worker_id,
decode_worker_id: nvext.decode_worker_id, decode_worker_id: nvext.decode_worker_id,
dp_rank: None, // dp_rank is set later in the pipeline dp_rank: None, // dp_rank is set later in the pipeline
enable_local_updates: nvext.enable_local_updates,
expected_output_tokens: hints.and_then(|h| h.osl), expected_output_tokens: hints.and_then(|h| h.osl),
priority_jump: hints.and_then(|h| h.latency_sensitivity), priority_jump: hints.and_then(|h| h.latency_sensitivity),
lora_name, lora_name,
......
...@@ -34,14 +34,6 @@ pub struct RoutingHints { ...@@ -34,14 +34,6 @@ pub struct RoutingHints {
#[serde(default, skip_serializing_if = "Option::is_none")] #[serde(default, skip_serializing_if = "Option::is_none")]
pub dp_rank: Option<u32>, pub dp_rank: Option<u32>,
/// Controls whether the router should manage local bookkeeping (add_request,
/// mark_prefill_completed, free) for this request.
///
/// - `None` or `Some(true)`: Router handles bookkeeping locally (default behavior)
/// - `Some(false)`: External caller (e.g., GAIE sidecar) handles bookkeeping via C FFI
#[serde(default, skip_serializing_if = "Option::is_none")]
pub enable_local_updates: Option<bool>,
/// Expected number of output tokens for this request. /// Expected number of output tokens for this request.
/// Used as a hint for routing decisions to estimate resource requirements. /// Used as a hint for routing decisions to estimate resource requirements.
#[serde(default, skip_serializing_if = "Option::is_none")] #[serde(default, skip_serializing_if = "Option::is_none")]
......
...@@ -11,16 +11,12 @@ pub use crate::protocols::common::timing::TimingInfo; ...@@ -11,16 +11,12 @@ pub use crate::protocols::common::timing::TimingInfo;
pub const HEADER_WORKER_INSTANCE_ID: &str = "x-worker-instance-id"; pub const HEADER_WORKER_INSTANCE_ID: &str = "x-worker-instance-id";
pub const HEADER_PREFILL_INSTANCE_ID: &str = "x-prefill-instance-id"; pub const HEADER_PREFILL_INSTANCE_ID: &str = "x-prefill-instance-id";
/// Header to disable local bookkeeping updates (for GAIE Stage 2)
/// When set to "false", the router skips add_request, mark_prefill_completed, and free calls.
pub const HEADER_ENABLE_LOCAL_UPDATES: &str = "x-enable-local-updates";
/// Apply routing overrides from HTTP headers to nvext. /// Apply routing overrides from HTTP headers to nvext.
/// ///
/// Header mappings: /// Header mappings:
/// - `x-worker-instance-id` -> `backend_instance_id` and `decode_worker_id` /// - `x-worker-instance-id` -> `backend_instance_id` and `decode_worker_id`
/// - `x-prefill-instance-id` -> `prefill_worker_id` /// - `x-prefill-instance-id` -> `prefill_worker_id`
/// - `x-enable-local-updates` -> `enable_local_updates` (set to false to disable router bookkeeping)
/// ///
/// Headers take priority over existing nvext values when present. /// Headers take priority over existing nvext values when present.
/// If no headers are present, returns the original nvext unchanged. /// If no headers are present, returns the original nvext unchanged.
...@@ -35,17 +31,7 @@ pub fn apply_header_routing_overrides(nvext: Option<NvExt>, headers: &HeaderMap) ...@@ -35,17 +31,7 @@ pub fn apply_header_routing_overrides(nvext: Option<NvExt>, headers: &HeaderMap)
.and_then(|v| v.to_str().ok()) .and_then(|v| v.to_str().ok())
.and_then(|s| s.parse::<u64>().ok()); .and_then(|s| s.parse::<u64>().ok());
// Parse enable_local_updates header: "true" or "false" if worker_id.is_none() && prefill_id.is_none() {
let enable_local_updates = headers
.get(HEADER_ENABLE_LOCAL_UPDATES)
.and_then(|v| v.to_str().ok())
.and_then(|s| match s.to_lowercase().as_str() {
"true" | "1" => Some(true),
"false" | "0" => Some(false),
_ => None,
});
if worker_id.is_none() && prefill_id.is_none() && enable_local_updates.is_none() {
return nvext; return nvext;
} }
...@@ -57,9 +43,6 @@ pub fn apply_header_routing_overrides(nvext: Option<NvExt>, headers: &HeaderMap) ...@@ -57,9 +43,6 @@ pub fn apply_header_routing_overrides(nvext: Option<NvExt>, headers: &HeaderMap)
if let Some(id) = prefill_id { if let Some(id) = prefill_id {
ext.prefill_worker_id = Some(id); ext.prefill_worker_id = Some(id);
} }
if let Some(enabled) = enable_local_updates {
ext.enable_local_updates = Some(enabled);
}
Some(ext) Some(ext)
} }
...@@ -169,17 +152,6 @@ pub struct NvExt { ...@@ -169,17 +152,6 @@ pub struct NvExt {
#[serde(default, skip_serializing_if = "Option::is_none")] #[serde(default, skip_serializing_if = "Option::is_none")]
pub decode_worker_id: Option<u64>, pub decode_worker_id: Option<u64>,
/// Controls whether the router should manage local bookkeeping (add_request,
/// mark_prefill_completed, free) for this request.
///
/// - `None` or `true`: Router handles bookkeeping locally (default behavior)
/// - `false`: External caller (e.g., GAIE sidecar) handles bookkeeping via C FFI
///
/// Set to `false` for GAIE Stage 2 when the EPP/sidecar manages request lifecycle.
#[builder(default, setter(strip_option))]
#[serde(default, skip_serializing_if = "Option::is_none")]
pub enable_local_updates: Option<bool>,
/// Agent-provided hints for request handling. /// Agent-provided hints for request handling.
#[builder(default, setter(strip_option))] #[builder(default, setter(strip_option))]
#[serde(default, skip_serializing_if = "Option::is_none")] #[serde(default, skip_serializing_if = "Option::is_none")]
...@@ -187,7 +159,7 @@ pub struct NvExt { ...@@ -187,7 +159,7 @@ pub struct NvExt {
} }
/// Hints from the agent/caller about request characteristics. /// Hints from the agent/caller about request characteristics.
#[derive(ToSchema, Serialize, Deserialize, Builder, Debug, Clone, Default)] #[derive(ToSchema, Serialize, Deserialize, Builder, Debug, Clone, Default, PartialEq)]
pub struct AgentHints { pub struct AgentHints {
/// Latency sensitivity in seconds for queue ordering. /// Latency sensitivity in seconds for queue ordering.
/// Higher values cause the request to be scheduled sooner when the router queue is enabled. /// Higher values cause the request to be scheduled sooner when the router queue is enabled.
...@@ -249,7 +221,7 @@ mod tests { ...@@ -249,7 +221,7 @@ mod tests {
assert_eq!(nv_ext.extra_fields, None); assert_eq!(nv_ext.extra_fields, None);
assert_eq!(nv_ext.prefill_worker_id, None); assert_eq!(nv_ext.prefill_worker_id, None);
assert_eq!(nv_ext.decode_worker_id, None); assert_eq!(nv_ext.decode_worker_id, None);
assert_eq!(nv_ext.enable_local_updates, None); assert_eq!(nv_ext.agent_hints, None);
} }
// Test valid builder configurations // Test valid builder configurations
......
...@@ -83,15 +83,18 @@ pub enum RouterMode { ...@@ -83,15 +83,18 @@ pub enum RouterMode {
#[default] #[default]
RoundRobin, RoundRobin,
Random, Random,
Direct(u64),
// Marker value, KV routing itself is in dynamo-llm
KV, KV,
Direct,
} }
impl RouterMode { impl RouterMode {
pub fn is_kv_routing(&self) -> bool { pub fn is_kv_routing(&self) -> bool {
*self == RouterMode::KV *self == RouterMode::KV
} }
pub fn is_direct_routing(&self) -> bool {
*self == RouterMode::Direct
}
} }
async fn addressed_router(endpoint: &Endpoint) -> anyhow::Result<Arc<AddressedPushRouter>> { async fn addressed_router(endpoint: &Endpoint) -> anyhow::Result<Arc<AddressedPushRouter>> {
...@@ -415,14 +418,17 @@ where ...@@ -415,14 +418,17 @@ where
U: Data + for<'de> Deserialize<'de> + MaybeError, U: Data + for<'de> Deserialize<'de> + MaybeError,
{ {
async fn generate(&self, request: SingleIn<T>) -> Result<ManyOut<U>, Error> { async fn generate(&self, request: SingleIn<T>) -> Result<ManyOut<U>, Error> {
//InstanceSource::Static => self.r#static(request).await,
match self.router_mode { match self.router_mode {
RouterMode::Random => self.random(request).await, RouterMode::Random => self.random(request).await,
RouterMode::RoundRobin => self.round_robin(request).await, RouterMode::RoundRobin => self.round_robin(request).await,
RouterMode::Direct(instance_id) => self.direct(request, instance_id).await,
RouterMode::KV => { RouterMode::KV => {
anyhow::bail!("KV routing should not call generate on PushRouter"); anyhow::bail!("KV routing should not call generate on PushRouter");
} }
RouterMode::Direct => {
anyhow::bail!(
"Direct routing should not call generate on PushRouter directly; use DirectRoutingRouter wrapper"
);
}
} }
} }
} }
...@@ -16,12 +16,7 @@ spec: ...@@ -16,12 +16,7 @@ spec:
replicas: 1 replicas: 1
extraPodSpec: extraPodSpec:
mainContainer: mainContainer:
image: nvcr.io/nvidia/ai-dynamo/frontend:0.8.0 image: nvcr.io/nvidia/ai-dynamo/frontend:my-tag
env:
- name: DYN_KV_BLOCK_SIZE
value: "128"
- name: DYN_MODEL
value: "RedHatAI/Llama-3.3-70B-Instruct-FP8-dynamic"
eppConfig: eppConfig:
# This configuration uses Dynamo's KV-aware scorer for intelligent routing # This configuration uses Dynamo's KV-aware scorer for intelligent routing
config: config:
...@@ -49,8 +44,15 @@ spec: ...@@ -49,8 +44,15 @@ spec:
mountPoint: /opt/models mountPoint: /opt/models
extraPodSpec: extraPodSpec:
mainContainer: mainContainer:
image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.8.0 image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:my-tag
workingDir: /workspace/examples/backends/vllm workingDir: /workspace/examples/backends/vllm
command:
- python3
args:
- -m
- dynamo.frontend
- --router-mode
- direct
envs: envs:
- name: HF_HOME - name: HF_HOME
value: /opt/models value: /opt/models
...@@ -79,7 +81,7 @@ spec: ...@@ -79,7 +81,7 @@ spec:
command: command:
- /bin/sh - /bin/sh
- -c - -c
image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.8.0 image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:my-tag
workingDir: /workspace/examples/backends/vllm workingDir: /workspace/examples/backends/vllm
replicas: 1 replicas: 1
resources: resources:
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment