feat: Decomposed pipeline for EPP integration [DEP-730] (#5446)

Signed-off-by: Anna Tchernych <atchernych@nvidia.com>

feat: Decomposed pipeline for EPP integration [DEP-730] (#5446)
Signed-off-by: Anna Tchernych <atchernych@nvidia.com>
52271760 · atchernych · GitHub · 16a28058 · 52271760 · 52271760
Unverified Commit 52271760 authored Feb 12, 2026 by atchernych Committed by GitHub Feb 13, 2026
20 changed files
--- a/components/src/dynamo/frontend/frontend_args.py
+++ b/components/src/dynamo/frontend/frontend_args.py
@@ -167,7 +167,7 @@ class FrontendArgGroup(ArgGroup):
            env_var="DYN_ROUTER_MODE",
            default="round-robin",
            help="How to route the request.",
-            choices=["round-robin", "random", "kv"],
+            choices=["round-robin", "random", "kv", "direct"],
        )
        add_argument(
            g,

--- a/components/src/dynamo/frontend/main.py
+++ b/components/src/dynamo/frontend/main.py
@@ -7,7 +7,7 @@
 # - OpenAI HTTP server.
 # - Auto-discovery: Watches etcd for engine/worker registration (via `register_llm`).
 # - Pre-processor: Prompt templating and tokenization.
-# - Router, defaulting to round-robin. Use --router-mode to switch (round-robin, random, kv).
+# - Router, defaulting to round-robin. Use --router-mode to switch (round-robin, random, kv, direct).
 #
 # Pass `--interactive` or `-i` for text chat instead of HTTP server.
 #
@@ -197,6 +197,9 @@ async def async_main():
    elif config.router_mode == "random":
        router_mode = RouterMode.Random
        kv_router_config = None
+    elif config.router_mode == "direct":
+        router_mode = RouterMode.Direct
+        kv_router_config = None
    else:
        router_mode = RouterMode.RoundRobin
        kv_router_config = None

--- a/deploy/inference-gateway/README.md
+++ b/deploy/inference-gateway/README.md
@@ -63,26 +63,6 @@ kubectl get gateway inference-gateway
 ### 3. Setup secrets ###
-Follow the steps in [model deployment](../../examples/backends/vllm/deploy/README.md) to deploy `Qwen/Qwen3-0.6B` model in aggregate mode using [agg.yaml](../../examples/backends/vllm/deploy/agg.yaml) in `my-model` kubernetes namespace.
-Make sure to enable kv-routing by adding the env var in the FrontEnd.
-```bash
-    mainContainer:
-      image: ...
-      env:
-        - name: DYN_ROUTER_MODE
-          value: "kv"
-```
-Sample commands to deploy model:
-```bash
-cd <dynamo-source-root>
-cd examples/backends/vllm/deploy
-kubectl apply -f agg.yaml -n my-model
-```
-Take a note of or change the DYNAMO_IMAGE in the model deployment file.
 Do not forget docker registry secret if needed.
 ```bash
@@ -93,7 +73,7 @@ kubectl create secret docker-registry docker-imagepullsecret \
  --namespace=$NAMESPACE
 ```
-Do not forget to include the HuggingFace token if required.
+Do not forget to include the HuggingFace token.
 ```bash
 export HF_TOKEN=your_hf_token
@@ -139,13 +119,34 @@ make info # Check image tag
 We recommend deploying Inference Gateway's Endpoint Picker as a Dynamo operator's managed component. Alternatively,
 you could deploy it as a standalone pod
-#### 5.a. Deploy as a DGD component
+#### 5.a. Deploy as a DGD component (recommended)
+We provide an example for llama-3-70b vLLM below.
 ```bash
-kubectl apply -f operator-managed/examples/agg.yaml -n ${NAMESPACE}
+# Deploy PVC, first Update `storageClassName` in recipes/llama-3-70b/model-cache/model-cache.yaml to match your cluster before deploying
-kubectl apply -f operator-managed/examples/http-route.yaml -n ${NAMESPACE}
+kubectl apply -f recipes/llama-3-70b/model-cache/model-cache.yaml
+kubectl apply -f recipes/llama-3-70b/model-cache/model-download.yaml
+# Deploy your model
+kubectl apply -f recipes/llama-3-70b/vllm/agg/gaie/deploy.yaml -n ${NAMESPACE}
+# Deploy the GAIE http-route CR.
+kubectl apply -f recipes/llama-3-70b/vllm/agg/gaie/http-route.yaml -n ${NAMESPACE}
 ```
+- When using GAIE the FrontEnd does not choose the workers. The routing is determined in the EPP.
+- You must enable the flag in the FrontEnd cli as below.
+```bash
+    command:
+      - python3
+    args:
+      - -m
+      - dynamo.frontend
+      - --router-mode
+      - direct
+```
+- The pre-selected worker (decode and prefill in case of the disaggregated serving) are passed in the request headers.
+- The flag assures the routing respects this selection.
 **Startup Probe Timeout:** The EPP has a default startup probe timeout of 30 minutes (10s × 180 failures).
 If your model takes longer to load, increase the `failureThreshold` in the EPP's `startupProbe`. For example,
 to allow 60 minutes for startup:
@@ -166,6 +167,18 @@ If you installed it into a different namespace, you need to adjust the HttpRoute
 ##### 5.b.1 Deploy Your Model ###
+We provide an example for Qwen vLLM below.
+Before deploying you must enable the `--direct-route` flag in the FrontEnd cli in your Dynamo Graph.
+```bash
+    command:
+      - python3
+    args:
+      - -m
+      - dynamo.frontend
+      - --router-mode
+      - direct
+```
 Follow the steps in [model deployment](../../examples/backends/vllm/deploy/README.md) to deploy `Qwen/Qwen3-0.6B` model in aggregate mode using [agg.yaml](../../examples/backends/vllm/deploy/agg.yaml) in `my-model` kubernetes namespace.
 Sample commands to deploy model:
@@ -176,10 +189,6 @@ cd examples/backends/vllm/deploy
 kubectl apply -f agg.yaml -n my-model
 ```
-Take a note of or change the DYNAMO_IMAGE in the model deployment file.
-Do not forget docker registry secret if needed.
 ##### 5.b.2 Install Dynamo GIE helm chart ###
 ```bash
@@ -214,14 +223,14 @@ You can configure the plugin by setting environment variables in the EPP compone
 Common Vars for Routing Configuration:
 - Set `DYN_BUSY_THRESHOLD` to configure the upper bound on how "full" a worker can be (often derived from kv_active_blocks or other load metrics) before the router skips it. If the selected worker exceeds this value, routing falls back to the next best candidate. By default the value is negative meaning this is not enabled.
 - Set `DYN_ENFORCE_DISAGG=true` if you want to enforce every request being served in the disaggregated manner. By default it is false meaning if the the prefill worker is not available the request will be served in the aggregated manner.
- By default the Dynamo plugin uses KV routing. You can expose `DYN_USE_KV_ROUTING=false` in your [values.yaml](standalone/helm/dynamo-gaie/values.yaml) if you prefer to route in the round-robin fashion.
+- Set `DYN_OVERLAP_SCORE_WEIGHT` to weigh how heavily the score uses token overlap (predicted KV cache hits) versus other factors (load, historical hit rate). Higher weight biases toward reusing workers with similar cached prefixes. (default: 1)
- If using kv-routing:
+- Set `DYN_ROUTER_TEMPERATURE` to soften or sharpen the selection curve when combining scores. Low temperature makes the router pick the top candidate deterministically; higher temperature lets lower-scoring workers through more often (exploration).
-  - Overwrite the `DYN_KV_BLOCK_SIZE` in your [values-dynamo-epp.yaml](./values-dynamo-epp.yaml) to match your model's block size.The `DYN_KV_BLOCK_SIZE` env var is ***MANDATORY*** to prevent silent KV routing failures.
+- Set `DYN_USE_KV_EVENTS=false` if you want to disable the workers sending KV events while using kv-routing (default: true)
-  - Set `DYNAMO_OVERLAP_SCORE_WEIGHT` to weigh how heavily the score uses token overlap (predicted KV cache hits) versus other factors (load, historical hit rate). Higher weight biases toward reusing workers with similar cached prefixes.
+- `DYN_ROUTER_TEMPERATURE` — Temperature for worker sampling via softmax (default: 0.0)
-  - Set `DYNAMO_ROUTER_TEMPERATURE` to soften or sharpen the selection curve when combining scores. Low temperature makes the router pick the top candidate deterministically; higher temperature lets lower-scoring workers through more often (exploration).
+- `DYN_ROUTER_REPLICA_SYNC` — Enable replica synchronization (default: false)
-  - Set `DYNAMO_USE_KV_EVENTS=false` if you want to disable the workers sending KV events while using kv-routing
+- `DYN_ROUTER_TRACK_ACTIVE_BLOCKS` — Track active blocks (default: true)
-  - See the [Router Guide](../../docs/pages/components/router/router-guide.md) for details.
+- `DYN_ROUTER_TRACK_OUTPUT_BLOCKS` — Track output blocks during generation (default: false)
+- See the [KV cache routing design](../../docs/pages/design-docs/router-design.md) for details.
 Stand-Alone installation only:
 - Overwrite the `DYN_NAMESPACE` env var if needed to match your model's dynamo namespace.
@@ -272,7 +281,7 @@ b. use port-forward to expose the gateway to the host
 ```bash
 # in first terminal
-kubectl port-forward svc/inference-gateway 8000:80 -n my-model
+kubectl port-forward svc/inference-gateway 8000:80 -n kgateway-system
 # in second terminal where you want to send inference requests
 GATEWAY_URL=http://localhost:8000
@@ -359,6 +368,14 @@ Sample inference output:
 }
 ```
+***If you have more than one HttpRoute running on the cluster***
+Add the host to your HttpRoute.yaml and add the header `curl -H "Host: llama3-70b-agg.example.com" ...` to every request.
+```bash
+spec:
+  hostnames:
+    - llama3-70b-agg.example.com
+```
 ### 8. Deleting the installation ###
 If you need to uninstall run:
@@ -407,4 +424,4 @@ The plugins set HTTP headers that are forwarded to the backend workers.
 | Header | Description | Set By |
 |--------|-------------|--------|
 | `x-worker-instance-id` | Primary worker ID (decode worker in disagg mode) | kv-aware-scorer |
 | `x-prefill-instance-id` | Prefill worker ID (disaggregated mode only) | kv-aware-scorer |
\ No newline at end of file
--- a/deploy/inference-gateway/epp/pkg/plugins/dynamo_kv_scorer/plugin.go
+++ b/deploy/inference-gateway/epp/pkg/plugins/dynamo_kv_scorer/plugin.go
--- a/deploy/inference-gateway/standalone/helm/dynamo-gaie/templates/dynamo-epp.yaml
+++ b/deploy/inference-gateway/standalone/helm/dynamo-gaie/templates/dynamo-epp.yaml
@@ -19,7 +19,6 @@
 {{- $useDynamo    := default false .Values.epp.useDynamo -}}
 {{- $resolvedDynNs := (include "dynamo-gaie.dynamoNamespace" .) | trim -}}
 {{- $ns           := ternary (required "set dynamoGraphDeploymentName when epp.useDynamo=true" $resolvedDynNs) "" $useDynamo -}}
-{{- $kv           := default "16" .Values.epp.dynamo.kvBlockSize -}}
 {{- $useEtcd      := default false .Values.epp.dynamo.useEtcd -}}
 {{- $eppImage     := required "extension.image is required - set via --set-string extension.image=$EPP_IMAGE or in values file" .Values.extension.image }}
@@ -113,8 +112,6 @@ spec:
            value: "nats://{{ $platformName }}-nats.{{ $platformNs }}:4222"
          - name: DYN_NAMESPACE
            value: "{{ $ns }}"
-          - name: DYN_KV_BLOCK_SIZE
-            value: "{{ $kv }}"
          - name: USE_STREAMING
            value: "true"
          # HuggingFace token for downloading model config files

--- a/deploy/inference-gateway/standalone/helm/dynamo-gaie/values.yaml
+++ b/deploy/inference-gateway/standalone/helm/dynamo-gaie/values.yaml
@@ -72,7 +72,6 @@ epp:
  # Dynamo-specific settings (only used when useDynamo: true)
  configFile: "/etc/epp/epp-config-dynamo.yaml"
  dynamo:
-    kvBlockSize: "16"
    # Use ETCD for discovery instead of Kubernetes (default: false)
    # Set to true via --set epp.dynamo.useEtcd=true to enable ETCD discovery
    useEtcd: false

--- a/deploy/operator/internal/dynamo/component_epp.go
+++ b/deploy/operator/internal/dynamo/component_epp.go
@@ -87,10 +87,6 @@ func (e *EPPDefaults) GetBaseContainer(context ComponentContext) (corev1.Contain
 	// EPP-specific environment variables
 	container.Env = append(container.Env, []corev1.EnvVar{
-		{
-			Name:  "DYN_KV_BLOCK_SIZE",
-			Value: "16",
-		},
 		{
 			Name:  "USE_STREAMING",
 			Value: "true",

--- a/lib/bindings/c/src/lib.rs
+++ b/lib/bindings/c/src/lib.rs
--- a/lib/bindings/python/Cargo.lock
+++ b/lib/bindings/python/Cargo.lock
@@ -1635,6 +1635,7 @@ dependencies = [
 "ahash",
 "aho-corasick",
 "akin",
+ "aligned-vec",
 "anyhow",
 "async-nats",
 "async-stream",
@@ -1651,6 +1652,7 @@ dependencies = [
 "bytes",
 "candle-core",
 "chrono",
+ "cudarc",
 "dashmap 5.5.3",
 "derive-getters",
 "derive_builder",
@@ -1685,6 +1687,8 @@ dependencies = [
 "ndarray",
 "ndarray-interp",
 "ndarray-npy",
+ "nix 0.26.4",
+ "nixl-sys",
 "object_store",
 "offset-allocator",
 "oneshot",
@@ -3962,6 +3966,15 @@ version = "0.3.3"
 source = "registry+https://github.com/rust-lang/crates.io-index"
 checksum = "38d1115007560874e373613744c6fba374c17688327a71c1476d1a5954cc857b"
+[[package]]
+name = "memoffset"
+version = "0.7.1"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "5de893c32cde5f383baa4c04c5d6dbdd735cfd4a794b0debdb2bb1b421da5ff4"
+dependencies = [
+ "autocfg",
+]
 [[package]]
 name = "memoffset"
 version = "0.9.1"
@@ -4255,6 +4268,19 @@ dependencies = [
 "thiserror 1.0.69",
 ]
+[[package]]
+name = "nix"
+version = "0.26.4"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "598beaf3cc6fdd9a5dfb1630c2800c7acd31df7aaf0f565796fba2b53ca1af1b"
+dependencies = [
+ "bitflags 1.3.2",
+ "cfg-if 1.0.4",
+ "libc",
+ "memoffset 0.7.1",
+ "pin-utils",
+]
 [[package]]
 name = "nix"
 version = "0.29.0"
@@ -5501,7 +5527,7 @@ dependencies = [
 "cfg-if 1.0.4",
 "indoc",
 "libc",
- "memoffset",
+ "memoffset 0.9.1",
 "once_cell",
 "portable-atomic",
 "pyo3-build-config",

--- a/lib/bindings/python/rust/lib.rs
+++ b/lib/bindings/python/rust/lib.rs
@@ -46,6 +46,9 @@ pub enum RouterMode {
    RoundRobin,
    Random,
    KV,
+    /// Direct routing - reads worker ID from each request's routing hints.
+    /// Used when an external orchestrator (e.g., EPP) handles worker selection.
+    Direct,
 }
 impl From<RouterMode> for RsRouterMode {
@@ -54,6 +57,7 @@ impl From<RouterMode> for RsRouterMode {
            RouterMode::RoundRobin => Self::RoundRobin,
            RouterMode::Random => Self::Random,
            RouterMode::KV => Self::KV,
+            RouterMode::Direct => Self::Direct,
        }
    }
 }

--- a/lib/bindings/python/src/dynamo/_core.pyi
+++ b/lib/bindings/python/src/dynamo/_core.pyi
@@ -950,6 +950,7 @@ class RouterMode:
    RoundRobin: "RouterMode"
    Random: "RouterMode"
    KV: "RouterMode"
+    Direct: "RouterMode"
    ...
 class RouterConfig:
@@ -968,7 +969,7 @@ class RouterConfig:
        Create a RouterConfig.
        Args:
-            mode: The router mode (RoundRobin, Random, or KV)
+            mode: The router mode (RoundRobin, Random, KV, or Direct)
            config: Optional KV router configuration (used when mode is KV)
            active_decode_blocks_threshold: Threshold percentage (0.0-1.0) for decode blocks busy detection
            active_prefill_tokens_threshold: Literal token count threshold for prefill busy detection

--- a/lib/llm/src/entrypoint/input/common.rs
+++ b/lib/llm/src/entrypoint/input/common.rs
@@ -9,7 +9,7 @@ use crate::{
    engines::StreamingEngineAdapter,
    entrypoint::{EngineConfig, RouterConfig},
    http::service::metrics::Metrics,
-    kv_router::{KvPushRouter, KvRouter, PrefillRouter},
+    kv_router::{DirectRoutingRouter, KvPushRouter, KvRouter, PrefillRouter},
    migration::Migration,
    model_card::ModelDeploymentCard,
    preprocessor::{OpenAIPreprocessor, prompt::PromptFormatter},
@@ -274,10 +274,10 @@ where
        .await?;
    let service_backend = match router_mode {
-        RouterMode::Random | RouterMode::RoundRobin | RouterMode::Direct(_) => {
+        RouterMode::Direct => {
-            // Non-KV routing: use PushRouter directly.
+            ServiceBackend::from_engine(Arc::new(DirectRoutingRouter::new(router)))
-            // Note: Per-worker metrics (active_prefill_tokens, active_decode_blocks) are only
+        }
-            // available in KV routing mode where the router has actual bookkeeping.
+        RouterMode::Random | RouterMode::RoundRobin => {
            ServiceBackend::from_engine(Arc::new(router))
        }
        RouterMode::KV => {

--- a/lib/llm/src/kv_router.rs
+++ b/lib/llm/src/kv_router.rs
@@ -41,7 +41,7 @@ pub mod worker_query;
 pub use config::{KvRouterConfig, RouterConfigOverride};
 pub use prefill_router::PrefillRouter;
-pub use push_router::KvPushRouter;
+pub use push_router::{DirectRoutingRouter, KvPushRouter};
 use crate::{
    discovery::RuntimeConfigWatch,

--- a/lib/llm/src/kv_router/prefill_router.rs
+++ b/lib/llm/src/kv_router/prefill_router.rs
@@ -81,14 +81,6 @@ impl InnerPrefillRouter {
            InnerPrefillRouter::KvRouter(_) => None,
        }
    }
-    /// Peek next worker without incrementing state (for non-KV modes only)
-    fn peek_next_worker(&self) -> Option<u64> {
-        match self {
-            InnerPrefillRouter::SimpleRouter(router) => router.peek_next_worker(),
-            InnerPrefillRouter::KvRouter(_) => None,
-        }
-    }
 }
 /// PrefillRouter is a forward-only operator that sits between Migration and the decode router.
@@ -273,7 +265,7 @@ impl PrefillRouter {
        preselected_worker: Option<u64>,
    ) -> Option<(u64, u32, BootstrapInfo)> {
        let endpoint_id = self.endpoint_id.get()?;
-        let prefill_router = self.prefill_router.get()?;
+        let _prefill_router = self.prefill_router.get()?;
        // Worker selection
        let (worker_id, dp_rank) = if let Some(id) = preselected_worker {
@@ -284,12 +276,8 @@ impl PrefillRouter {
                "Using pre-selected prefill worker for bootstrap"
            );
            (id, dp_rank)
-        } else if self.router_mode.is_kv_routing() {
+        } else {
-            // KV mode: use find_best_match
+            // Use shared worker selection logic (update_states=false for peek behavior)
-            let kv_router = match prefill_router {
-                InnerPrefillRouter::KvRouter(r) => r,
-                _ => return None,
-            };
            // Extract LORA name and priority jump from routing hints
            let lora_name = req.routing.as_ref().and_then(|r| r.lora_name.clone());
            let priority_jump = req
@@ -297,24 +285,14 @@ impl PrefillRouter {
                .as_ref()
                .and_then(|r| r.priority_jump)
                .unwrap_or(0.0);
-            match async {
+            match self
-                kv_router
+                .query_prefill_worker(&req.token_ids, false, lora_name, priority_jump)
-                    .chooser
+                .instrument(tracing::info_span!("query_prefill_worker"))
-                    .find_best_match(None, &req.token_ids, None, false, lora_name, priority_jump)
+                .await
-                    .await
-            }
-            .instrument(tracing::info_span!("kv_find_best_match"))
-            .await
            {
-                Ok((worker, _overlap)) => (worker.worker_id, worker.dp_rank),
+                Ok((worker_id, dp_rank)) => (worker_id, dp_rank),
                Err(_) => return None,
            }
-        } else {
-            // Non-KV mode: use PushRouter's stateful selection
-            // We use peek_next_worker instead of select_next_worker to avoid double-incrementing the counter
-            // if we fall back to the original path.
-            let worker_id = prefill_router.peek_next_worker()?;
-            (worker_id, 0)
        };
        // Get bootstrap info from ModelManager (works for ANY mode)
@@ -489,6 +467,55 @@ impl PrefillRouter {
        // No phase permit needed - we wait for completion before changing phase
        Self::execute_prefill(self.prefill_router.get().cloned(), request, None, None).await
    }
+    /// Query the best prefill worker without executing a request.
+    /// Returns (worker_id, dp_rank).
+    ///
+    /// This is the shared worker selection logic used by both `build_bootstrap_info`
+    /// and `query_route`.
+    pub async fn query_prefill_worker(
+        &self,
+        token_ids: &[u32],
+        update_states: bool,
+        lora_name: Option<String>,
+        priority_jump: f64,
+    ) -> Result<(u64, u32)> {
+        let prefill_router = self
+            .prefill_router
+            .get()
+            .ok_or_else(|| anyhow::anyhow!(PrefillError::NotActivated))?;
+        match prefill_router {
+            InnerPrefillRouter::KvRouter(r) => {
+                let (worker, _overlap) = r
+                    .chooser
+                    .find_best_match(
+                        None,
+                        token_ids,
+                        None,
+                        update_states,
+                        lora_name,
+                        priority_jump,
+                    )
+                    .await?;
+                Ok((worker.worker_id, worker.dp_rank))
+            }
+            InnerPrefillRouter::SimpleRouter(r) => {
+                let worker_id = if update_states {
+                    r.select_next_worker()
+                } else {
+                    r.peek_next_worker()
+                }
+                .ok_or_else(|| anyhow::anyhow!("No workers available for prefill"))?;
+                Ok((worker_id, 0))
+            }
+        }
+    }
+    /// Check if disaggregated mode is currently active (prefill router activated)
+    pub fn is_activated(&self) -> bool {
+        self.prefill_router.get().is_some()
+    }
 }
 impl Drop for PrefillRouter {

--- a/lib/llm/src/kv_router/push_router.rs
+++ b/lib/llm/src/kv_router/push_router.rs
@@ -48,7 +48,6 @@ struct WorkerSelection {
 struct RequestGuard {
    chooser: Arc<KvRouter>,
    context_id: String,
-    handle_local_updates: bool,
    tracker: Option<Arc<RequestTracker>>,
    request_metrics: Arc<RouterRequestMetrics>,
    cumulative_osl: usize,
@@ -59,9 +58,7 @@ struct RequestGuard {
 impl RequestGuard {
    async fn finish(&mut self) {
        self.record_metrics();
-        if self.handle_local_updates
+        if let Err(e) = self.chooser.free(&self.context_id).await {
-            && let Err(e) = self.chooser.free(&self.context_id).await
-        {
            tracing::warn!("Failed to free request {}: {e}", self.context_id);
        }
        self.freed = true;
@@ -86,7 +83,7 @@ impl RequestGuard {
 impl Drop for RequestGuard {
    fn drop(&mut self) {
        self.record_metrics();
-        if !self.freed && self.handle_local_updates {
+        if !self.freed {
            let chooser = self.chooser.clone();
            let context_id = self.context_id.clone();
            let Ok(handle) = tokio::runtime::Handle::try_current() else {
@@ -112,15 +109,13 @@ impl KvPushRouter {
    /// Select a worker for the request, either using a preselected worker or finding the best match.
    ///
-    /// When `is_query_only` is false and `handle_local_updates` is true, this also registers
+    /// When `is_query_only` is false, this also registers the request with the scheduler via `add_request`.
-    /// the request with the scheduler via `add_request`.
    async fn select_worker(
        &self,
        context_id: &str,
        request: &PreprocessedRequest,
        phase: RequestPhase,
        is_query_only: bool,
-        handle_local_updates: bool,
    ) -> Result<WorkerSelection, Error> {
        let routing = request.routing.as_ref();
        let lora_name = routing.and_then(|r| r.lora_name.clone());
@@ -172,7 +167,7 @@ impl KvPushRouter {
            .get_overlap_blocks(&request.token_ids, worker)
            .await?;
-        if !is_query_only && handle_local_updates {
+        if !is_query_only {
            self.chooser
                .add_request(
                    context_id.to_string(),
@@ -234,15 +229,6 @@ impl AsyncEngine<SingleIn<PreprocessedRequest>, ManyOut<Annotated<LLMEngineOutpu
        // Simple query-only detection: presence of query_instance_id annotation means query-only mode
        let is_query_only = request.get_annotation_value("query_instance_id").is_some();
-        // Determine if this router should handle local state updates (add_request, free, etc.)
-        // Default is true (router handles bookkeeping). Set to false for GAIE Stage 2 where
-        // an external orchestrator (e.g., EPP sidecar) handles bookkeeping via C FFI.
-        let handle_local_updates = request
-            .routing
-            .as_ref()
-            .and_then(|r| r.enable_local_updates)
-            .unwrap_or(true);
        // Get phase from tracker (defaults to Aggregated if no tracker or phase not set)
        let phase = request
            .tracker
@@ -252,13 +238,7 @@ impl AsyncEngine<SingleIn<PreprocessedRequest>, ManyOut<Annotated<LLMEngineOutpu
        let block_size = self.chooser.block_size() as usize;
        let selection = self
-            .select_worker(
+            .select_worker(&context_id, &request, phase, is_query_only)
-                &context_id,
-                &request,
-                phase,
-                is_query_only,
-                handle_local_updates,
-            )
            .await?;
        let WorkerSelection {
            instance_id,
@@ -335,8 +315,7 @@ impl AsyncEngine<SingleIn<PreprocessedRequest>, ManyOut<Annotated<LLMEngineOutpu
            .routing
            .as_ref()
            .and_then(|r| r.expected_output_tokens);
-        let track_output_blocks =
+        let track_output_blocks = self.chooser.kv_router_config().router_track_output_blocks;
-            self.chooser.kv_router_config().router_track_output_blocks && handle_local_updates;
        let tracker = request.tracker.clone();
        let (mut backend_input, context) = request.into_parts();
@@ -360,7 +339,6 @@ impl AsyncEngine<SingleIn<PreprocessedRequest>, ManyOut<Annotated<LLMEngineOutpu
            let mut guard = RequestGuard {
                chooser: chooser.clone(),
                context_id: context_id.clone(),
-                handle_local_updates,
                tracker: tracker.clone(),
                request_metrics: request_metrics.clone(),
                cumulative_osl: 0,
@@ -385,7 +363,9 @@ impl AsyncEngine<SingleIn<PreprocessedRequest>, ManyOut<Annotated<LLMEngineOutpu
                            break;
                        };
-                        if handle_local_updates && !prefill_marked {
+                        if !prefill_marked {
+                            // Only mark prefill completed when we receive actual tokens,
+                            // not empty bootstrap info (token_ids: []) from disaggregated prefill
                            let has_tokens = item.data.as_ref()
                                .map(|d| !d.token_ids.is_empty())
                                .unwrap_or(false);
@@ -451,3 +431,48 @@ impl AsyncEngine<SingleIn<PreprocessedRequest>, ManyOut<Annotated<LLMEngineOutpu
        Ok(ResponseStream::new(wrapped_stream, stream_context))
    }
 }
+/// A direct routing wrapper for `RouterMode::Direct`.
+///
+/// This wraps a `PushRouter` and reads worker IDs from each request's routing hints,
+/// then routes directly to the specified worker. Used when an external router
+/// (e.g., EPP) handles worker selection.
+pub struct DirectRoutingRouter {
+    inner: PushRouter<PreprocessedRequest, Annotated<LLMEngineOutput>>,
+}
+impl DirectRoutingRouter {
+    pub fn new(inner: PushRouter<PreprocessedRequest, Annotated<LLMEngineOutput>>) -> Self {
+        DirectRoutingRouter { inner }
+    }
+    /// Extract worker ID from request routing hints.
+    /// Returns an error if no worker ID is found (required in direct routing mode).
+    fn get_worker_id(request: &PreprocessedRequest) -> Result<u64, Error> {
+        let routing = request.routing.as_ref();
+        let worker_id = routing.and_then(|r| r.decode_worker_id.or(r.backend_instance_id));
+        worker_id.ok_or_else(|| {
+            anyhow::anyhow!(
+                "Worker ID required (--direct-route) but none found in request. \
+                 Expected decode_worker_id or backend_instance_id to be set by external router (e.g., EPP)."
+            )
+        })
+    }
+}
+#[async_trait]
+impl AsyncEngine<SingleIn<PreprocessedRequest>, ManyOut<Annotated<LLMEngineOutput>>, Error>
+    for DirectRoutingRouter
+{
+    async fn generate(
+        &self,
+        request: SingleIn<PreprocessedRequest>,
+    ) -> Result<ManyOut<Annotated<LLMEngineOutput>>, Error> {
+        let worker_id = Self::get_worker_id(&request)?;
+        tracing::debug!(worker_id = worker_id, "Direct routing to specified worker");
+        self.inner.direct(request, worker_id).await
+    }
+}
--- a/lib/llm/src/preprocessor.rs
+++ b/lib/llm/src/preprocessor.rs
@@ -280,7 +280,6 @@ impl OpenAIPreprocessor {
                prefill_worker_id: nvext.prefill_worker_id,
                decode_worker_id: nvext.decode_worker_id,
                dp_rank: None, // dp_rank is set later in the pipeline
-                enable_local_updates: nvext.enable_local_updates,
                expected_output_tokens: hints.and_then(|h| h.osl),
                priority_jump: hints.and_then(|h| h.latency_sensitivity),
                lora_name,

--- a/lib/llm/src/protocols/common/preprocessor.rs
+++ b/lib/llm/src/protocols/common/preprocessor.rs
@@ -34,14 +34,6 @@ pub struct RoutingHints {
    #[serde(default, skip_serializing_if = "Option::is_none")]
    pub dp_rank: Option<u32>,
-    /// Controls whether the router should manage local bookkeeping (add_request,
-    /// mark_prefill_completed, free) for this request.
-    ///
-    /// - `None` or `Some(true)`: Router handles bookkeeping locally (default behavior)
-    /// - `Some(false)`: External caller (e.g., GAIE sidecar) handles bookkeeping via C FFI
-    #[serde(default, skip_serializing_if = "Option::is_none")]
-    pub enable_local_updates: Option<bool>,
    /// Expected number of output tokens for this request.
    /// Used as a hint for routing decisions to estimate resource requirements.
    #[serde(default, skip_serializing_if = "Option::is_none")]

--- a/lib/llm/src/protocols/openai/nvext.rs
+++ b/lib/llm/src/protocols/openai/nvext.rs
@@ -11,16 +11,12 @@ pub use crate::protocols::common::timing::TimingInfo;
 pub const HEADER_WORKER_INSTANCE_ID: &str = "x-worker-instance-id";
 pub const HEADER_PREFILL_INSTANCE_ID: &str = "x-prefill-instance-id";
-/// Header to disable local bookkeeping updates (for GAIE Stage 2)
-/// When set to "false", the router skips add_request, mark_prefill_completed, and free calls.
-pub const HEADER_ENABLE_LOCAL_UPDATES: &str = "x-enable-local-updates";
 /// Apply routing overrides from HTTP headers to nvext.
 ///
 /// Header mappings:
 /// - `x-worker-instance-id` -> `backend_instance_id` and `decode_worker_id`
 /// - `x-prefill-instance-id` -> `prefill_worker_id`
-/// - `x-enable-local-updates` -> `enable_local_updates` (set to false to disable router bookkeeping)
 ///
 /// Headers take priority over existing nvext values when present.
 /// If no headers are present, returns the original nvext unchanged.
@@ -35,17 +31,7 @@ pub fn apply_header_routing_overrides(nvext: Option<NvExt>, headers: &HeaderMap)
        .and_then(|v| v.to_str().ok())
        .and_then(|s| s.parse::<u64>().ok());
-    // Parse enable_local_updates header: "true" or "false"
+    if worker_id.is_none() && prefill_id.is_none() {
-    let enable_local_updates = headers
-        .get(HEADER_ENABLE_LOCAL_UPDATES)
-        .and_then(|v| v.to_str().ok())
-        .and_then(|s| match s.to_lowercase().as_str() {
-            "true" | "1" => Some(true),
-            "false" | "0" => Some(false),
-            _ => None,
-        });
-    if worker_id.is_none() && prefill_id.is_none() && enable_local_updates.is_none() {
        return nvext;
    }
@@ -57,9 +43,6 @@ pub fn apply_header_routing_overrides(nvext: Option<NvExt>, headers: &HeaderMap)
    if let Some(id) = prefill_id {
        ext.prefill_worker_id = Some(id);
    }
-    if let Some(enabled) = enable_local_updates {
-        ext.enable_local_updates = Some(enabled);
-    }
    Some(ext)
 }
@@ -169,17 +152,6 @@ pub struct NvExt {
    #[serde(default, skip_serializing_if = "Option::is_none")]
    pub decode_worker_id: Option<u64>,
-    /// Controls whether the router should manage local bookkeeping (add_request,
-    /// mark_prefill_completed, free) for this request.
-    ///
-    /// - `None` or `true`: Router handles bookkeeping locally (default behavior)
-    /// - `false`: External caller (e.g., GAIE sidecar) handles bookkeeping via C FFI
-    ///
-    /// Set to `false` for GAIE Stage 2 when the EPP/sidecar manages request lifecycle.
-    #[builder(default, setter(strip_option))]
-    #[serde(default, skip_serializing_if = "Option::is_none")]
-    pub enable_local_updates: Option<bool>,
    /// Agent-provided hints for request handling.
    #[builder(default, setter(strip_option))]
    #[serde(default, skip_serializing_if = "Option::is_none")]
@@ -187,7 +159,7 @@ pub struct NvExt {
 }
 /// Hints from the agent/caller about request characteristics.
-#[derive(ToSchema, Serialize, Deserialize, Builder, Debug, Clone, Default)]
+#[derive(ToSchema, Serialize, Deserialize, Builder, Debug, Clone, Default, PartialEq)]
 pub struct AgentHints {
    /// Latency sensitivity in seconds for queue ordering.
    /// Higher values cause the request to be scheduled sooner when the router queue is enabled.
@@ -249,7 +221,7 @@ mod tests {
        assert_eq!(nv_ext.extra_fields, None);
        assert_eq!(nv_ext.prefill_worker_id, None);
        assert_eq!(nv_ext.decode_worker_id, None);
-        assert_eq!(nv_ext.enable_local_updates, None);
+        assert_eq!(nv_ext.agent_hints, None);
    }
    // Test valid builder configurations

--- a/lib/runtime/src/pipeline/network/egress/push_router.rs
+++ b/lib/runtime/src/pipeline/network/egress/push_router.rs
@@ -83,15 +83,18 @@ pub enum RouterMode {
    #[default]
    RoundRobin,
    Random,
-    Direct(u64),
-    // Marker value, KV routing itself is in dynamo-llm
    KV,
+    Direct,
 }
 impl RouterMode {
    pub fn is_kv_routing(&self) -> bool {
        *self == RouterMode::KV
    }
+    pub fn is_direct_routing(&self) -> bool {
+        *self == RouterMode::Direct
+    }
 }
 async fn addressed_router(endpoint: &Endpoint) -> anyhow::Result<Arc<AddressedPushRouter>> {
@@ -415,14 +418,17 @@ where
    U: Data + for<'de> Deserialize<'de> + MaybeError,
 {
    async fn generate(&self, request: SingleIn<T>) -> Result<ManyOut<U>, Error> {
-        //InstanceSource::Static => self.r#static(request).await,
        match self.router_mode {
            RouterMode::Random => self.random(request).await,
            RouterMode::RoundRobin => self.round_robin(request).await,
-            RouterMode::Direct(instance_id) => self.direct(request, instance_id).await,
            RouterMode::KV => {
                anyhow::bail!("KV routing should not call generate on PushRouter");
            }
+            RouterMode::Direct => {
+                anyhow::bail!(
+                    "Direct routing should not call generate on PushRouter directly; use DirectRoutingRouter wrapper"
+                );
+            }
        }
    }
 }
--- a/recipes/llama-3-70b/vllm/agg/gaie/deploy.yaml
+++ b/recipes/llama-3-70b/vllm/agg/gaie/deploy.yaml
@@ -16,12 +16,7 @@ spec:
      replicas: 1
      extraPodSpec:
        mainContainer:
-          image: nvcr.io/nvidia/ai-dynamo/frontend:0.8.0
+          image: nvcr.io/nvidia/ai-dynamo/frontend:my-tag
-          env:
-            - name: DYN_KV_BLOCK_SIZE
-              value: "128"
-            - name: DYN_MODEL
-              value: "RedHatAI/Llama-3.3-70B-Instruct-FP8-dynamic"
      eppConfig:
        # This configuration uses Dynamo's KV-aware scorer for intelligent routing
        config:
@@ -49,8 +44,15 @@ spec:
          mountPoint: /opt/models
      extraPodSpec:
        mainContainer:
-          image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.8.0
+          image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:my-tag
          workingDir: /workspace/examples/backends/vllm
+          command:
+            - python3
+          args:
+            - -m
+            - dynamo.frontend
+            - --router-mode
+            - direct
      envs:
        - name: HF_HOME
          value: /opt/models
@@ -79,7 +81,7 @@ spec:
          command:
          - /bin/sh
          - -c
-          image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.8.0
+          image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:my-tag
          workingDir: /workspace/examples/backends/vllm
      replicas: 1
      resources: