Unverified Commit 52271760 authored by atchernych's avatar atchernych Committed by GitHub
Browse files

feat: Decomposed pipeline for EPP integration [DEP-730] (#5446)


Signed-off-by: default avatarAnna Tchernych <atchernych@nvidia.com>
parent 16a28058
......@@ -167,7 +167,7 @@ class FrontendArgGroup(ArgGroup):
env_var="DYN_ROUTER_MODE",
default="round-robin",
help="How to route the request.",
choices=["round-robin", "random", "kv"],
choices=["round-robin", "random", "kv", "direct"],
)
add_argument(
g,
......
......@@ -7,7 +7,7 @@
# - OpenAI HTTP server.
# - Auto-discovery: Watches etcd for engine/worker registration (via `register_llm`).
# - Pre-processor: Prompt templating and tokenization.
# - Router, defaulting to round-robin. Use --router-mode to switch (round-robin, random, kv).
# - Router, defaulting to round-robin. Use --router-mode to switch (round-robin, random, kv, direct).
#
# Pass `--interactive` or `-i` for text chat instead of HTTP server.
#
......@@ -197,6 +197,9 @@ async def async_main():
elif config.router_mode == "random":
router_mode = RouterMode.Random
kv_router_config = None
elif config.router_mode == "direct":
router_mode = RouterMode.Direct
kv_router_config = None
else:
router_mode = RouterMode.RoundRobin
kv_router_config = None
......
......@@ -63,26 +63,6 @@ kubectl get gateway inference-gateway
### 3. Setup secrets ###
Follow the steps in [model deployment](../../examples/backends/vllm/deploy/README.md) to deploy `Qwen/Qwen3-0.6B` model in aggregate mode using [agg.yaml](../../examples/backends/vllm/deploy/agg.yaml) in `my-model` kubernetes namespace.
Make sure to enable kv-routing by adding the env var in the FrontEnd.
```bash
mainContainer:
image: ...
env:
- name: DYN_ROUTER_MODE
value: "kv"
```
Sample commands to deploy model:
```bash
cd <dynamo-source-root>
cd examples/backends/vllm/deploy
kubectl apply -f agg.yaml -n my-model
```
Take a note of or change the DYNAMO_IMAGE in the model deployment file.
Do not forget docker registry secret if needed.
```bash
......@@ -93,7 +73,7 @@ kubectl create secret docker-registry docker-imagepullsecret \
--namespace=$NAMESPACE
```
Do not forget to include the HuggingFace token if required.
Do not forget to include the HuggingFace token.
```bash
export HF_TOKEN=your_hf_token
......@@ -139,12 +119,33 @@ make info # Check image tag
We recommend deploying Inference Gateway's Endpoint Picker as a Dynamo operator's managed component. Alternatively,
you could deploy it as a standalone pod
#### 5.a. Deploy as a DGD component
#### 5.a. Deploy as a DGD component (recommended)
We provide an example for llama-3-70b vLLM below.
```bash
# Deploy PVC, first Update `storageClassName` in recipes/llama-3-70b/model-cache/model-cache.yaml to match your cluster before deploying
kubectl apply -f recipes/llama-3-70b/model-cache/model-cache.yaml
kubectl apply -f recipes/llama-3-70b/model-cache/model-download.yaml
# Deploy your model
kubectl apply -f recipes/llama-3-70b/vllm/agg/gaie/deploy.yaml -n ${NAMESPACE}
# Deploy the GAIE http-route CR.
kubectl apply -f recipes/llama-3-70b/vllm/agg/gaie/http-route.yaml -n ${NAMESPACE}
```
- When using GAIE the FrontEnd does not choose the workers. The routing is determined in the EPP.
- You must enable the flag in the FrontEnd cli as below.
```bash
kubectl apply -f operator-managed/examples/agg.yaml -n ${NAMESPACE}
kubectl apply -f operator-managed/examples/http-route.yaml -n ${NAMESPACE}
command:
- python3
args:
- -m
- dynamo.frontend
- --router-mode
- direct
```
- The pre-selected worker (decode and prefill in case of the disaggregated serving) are passed in the request headers.
- The flag assures the routing respects this selection.
**Startup Probe Timeout:** The EPP has a default startup probe timeout of 30 minutes (10s × 180 failures).
If your model takes longer to load, increase the `failureThreshold` in the EPP's `startupProbe`. For example,
......@@ -166,6 +167,18 @@ If you installed it into a different namespace, you need to adjust the HttpRoute
##### 5.b.1 Deploy Your Model ###
We provide an example for Qwen vLLM below.
Before deploying you must enable the `--direct-route` flag in the FrontEnd cli in your Dynamo Graph.
```bash
command:
- python3
args:
- -m
- dynamo.frontend
- --router-mode
- direct
```
Follow the steps in [model deployment](../../examples/backends/vllm/deploy/README.md) to deploy `Qwen/Qwen3-0.6B` model in aggregate mode using [agg.yaml](../../examples/backends/vllm/deploy/agg.yaml) in `my-model` kubernetes namespace.
Sample commands to deploy model:
......@@ -176,10 +189,6 @@ cd examples/backends/vllm/deploy
kubectl apply -f agg.yaml -n my-model
```
Take a note of or change the DYNAMO_IMAGE in the model deployment file.
Do not forget docker registry secret if needed.
##### 5.b.2 Install Dynamo GIE helm chart ###
```bash
......@@ -214,14 +223,14 @@ You can configure the plugin by setting environment variables in the EPP compone
Common Vars for Routing Configuration:
- Set `DYN_BUSY_THRESHOLD` to configure the upper bound on how "full" a worker can be (often derived from kv_active_blocks or other load metrics) before the router skips it. If the selected worker exceeds this value, routing falls back to the next best candidate. By default the value is negative meaning this is not enabled.
- Set `DYN_ENFORCE_DISAGG=true` if you want to enforce every request being served in the disaggregated manner. By default it is false meaning if the the prefill worker is not available the request will be served in the aggregated manner.
- By default the Dynamo plugin uses KV routing. You can expose `DYN_USE_KV_ROUTING=false` in your [values.yaml](standalone/helm/dynamo-gaie/values.yaml) if you prefer to route in the round-robin fashion.
- If using kv-routing:
- Overwrite the `DYN_KV_BLOCK_SIZE` in your [values-dynamo-epp.yaml](./values-dynamo-epp.yaml) to match your model's block size.The `DYN_KV_BLOCK_SIZE` env var is ***MANDATORY*** to prevent silent KV routing failures.
- Set `DYNAMO_OVERLAP_SCORE_WEIGHT` to weigh how heavily the score uses token overlap (predicted KV cache hits) versus other factors (load, historical hit rate). Higher weight biases toward reusing workers with similar cached prefixes.
- Set `DYNAMO_ROUTER_TEMPERATURE` to soften or sharpen the selection curve when combining scores. Low temperature makes the router pick the top candidate deterministically; higher temperature lets lower-scoring workers through more often (exploration).
- Set `DYNAMO_USE_KV_EVENTS=false` if you want to disable the workers sending KV events while using kv-routing
- See the [Router Guide](../../docs/pages/components/router/router-guide.md) for details.
- Set `DYN_OVERLAP_SCORE_WEIGHT` to weigh how heavily the score uses token overlap (predicted KV cache hits) versus other factors (load, historical hit rate). Higher weight biases toward reusing workers with similar cached prefixes. (default: 1)
- Set `DYN_ROUTER_TEMPERATURE` to soften or sharpen the selection curve when combining scores. Low temperature makes the router pick the top candidate deterministically; higher temperature lets lower-scoring workers through more often (exploration).
- Set `DYN_USE_KV_EVENTS=false` if you want to disable the workers sending KV events while using kv-routing (default: true)
- `DYN_ROUTER_TEMPERATURE` — Temperature for worker sampling via softmax (default: 0.0)
- `DYN_ROUTER_REPLICA_SYNC` — Enable replica synchronization (default: false)
- `DYN_ROUTER_TRACK_ACTIVE_BLOCKS` — Track active blocks (default: true)
- `DYN_ROUTER_TRACK_OUTPUT_BLOCKS` — Track output blocks during generation (default: false)
- See the [KV cache routing design](../../docs/pages/design-docs/router-design.md) for details.
Stand-Alone installation only:
- Overwrite the `DYN_NAMESPACE` env var if needed to match your model's dynamo namespace.
......@@ -272,7 +281,7 @@ b. use port-forward to expose the gateway to the host
```bash
# in first terminal
kubectl port-forward svc/inference-gateway 8000:80 -n my-model
kubectl port-forward svc/inference-gateway 8000:80 -n kgateway-system
# in second terminal where you want to send inference requests
GATEWAY_URL=http://localhost:8000
......@@ -359,6 +368,14 @@ Sample inference output:
}
```
***If you have more than one HttpRoute running on the cluster***
Add the host to your HttpRoute.yaml and add the header `curl -H "Host: llama3-70b-agg.example.com" ...` to every request.
```bash
spec:
hostnames:
- llama3-70b-agg.example.com
```
### 8. Deleting the installation ###
If you need to uninstall run:
......
......@@ -19,7 +19,6 @@
{{- $useDynamo := default false .Values.epp.useDynamo -}}
{{- $resolvedDynNs := (include "dynamo-gaie.dynamoNamespace" .) | trim -}}
{{- $ns := ternary (required "set dynamoGraphDeploymentName when epp.useDynamo=true" $resolvedDynNs) "" $useDynamo -}}
{{- $kv := default "16" .Values.epp.dynamo.kvBlockSize -}}
{{- $useEtcd := default false .Values.epp.dynamo.useEtcd -}}
{{- $eppImage := required "extension.image is required - set via --set-string extension.image=$EPP_IMAGE or in values file" .Values.extension.image }}
......@@ -113,8 +112,6 @@ spec:
value: "nats://{{ $platformName }}-nats.{{ $platformNs }}:4222"
- name: DYN_NAMESPACE
value: "{{ $ns }}"
- name: DYN_KV_BLOCK_SIZE
value: "{{ $kv }}"
- name: USE_STREAMING
value: "true"
# HuggingFace token for downloading model config files
......
......@@ -72,7 +72,6 @@ epp:
# Dynamo-specific settings (only used when useDynamo: true)
configFile: "/etc/epp/epp-config-dynamo.yaml"
dynamo:
kvBlockSize: "16"
# Use ETCD for discovery instead of Kubernetes (default: false)
# Set to true via --set epp.dynamo.useEtcd=true to enable ETCD discovery
useEtcd: false
......
......@@ -87,10 +87,6 @@ func (e *EPPDefaults) GetBaseContainer(context ComponentContext) (corev1.Contain
// EPP-specific environment variables
container.Env = append(container.Env, []corev1.EnvVar{
{
Name: "DYN_KV_BLOCK_SIZE",
Value: "16",
},
{
Name: "USE_STREAMING",
Value: "true",
......
This diff is collapsed.
......@@ -1635,6 +1635,7 @@ dependencies = [
"ahash",
"aho-corasick",
"akin",
"aligned-vec",
"anyhow",
"async-nats",
"async-stream",
......@@ -1651,6 +1652,7 @@ dependencies = [
"bytes",
"candle-core",
"chrono",
"cudarc",
"dashmap 5.5.3",
"derive-getters",
"derive_builder",
......@@ -1685,6 +1687,8 @@ dependencies = [
"ndarray",
"ndarray-interp",
"ndarray-npy",
"nix 0.26.4",
"nixl-sys",
"object_store",
"offset-allocator",
"oneshot",
......@@ -3962,6 +3966,15 @@ version = "0.3.3"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "38d1115007560874e373613744c6fba374c17688327a71c1476d1a5954cc857b"
[[package]]
name = "memoffset"
version = "0.7.1"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "5de893c32cde5f383baa4c04c5d6dbdd735cfd4a794b0debdb2bb1b421da5ff4"
dependencies = [
"autocfg",
]
[[package]]
name = "memoffset"
version = "0.9.1"
......@@ -4255,6 +4268,19 @@ dependencies = [
"thiserror 1.0.69",
]
[[package]]
name = "nix"
version = "0.26.4"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "598beaf3cc6fdd9a5dfb1630c2800c7acd31df7aaf0f565796fba2b53ca1af1b"
dependencies = [
"bitflags 1.3.2",
"cfg-if 1.0.4",
"libc",
"memoffset 0.7.1",
"pin-utils",
]
[[package]]
name = "nix"
version = "0.29.0"
......@@ -5501,7 +5527,7 @@ dependencies = [
"cfg-if 1.0.4",
"indoc",
"libc",
"memoffset",
"memoffset 0.9.1",
"once_cell",
"portable-atomic",
"pyo3-build-config",
......
......@@ -46,6 +46,9 @@ pub enum RouterMode {
RoundRobin,
Random,
KV,
/// Direct routing - reads worker ID from each request's routing hints.
/// Used when an external orchestrator (e.g., EPP) handles worker selection.
Direct,
}
impl From<RouterMode> for RsRouterMode {
......@@ -54,6 +57,7 @@ impl From<RouterMode> for RsRouterMode {
RouterMode::RoundRobin => Self::RoundRobin,
RouterMode::Random => Self::Random,
RouterMode::KV => Self::KV,
RouterMode::Direct => Self::Direct,
}
}
}
......
......@@ -950,6 +950,7 @@ class RouterMode:
RoundRobin: "RouterMode"
Random: "RouterMode"
KV: "RouterMode"
Direct: "RouterMode"
...
class RouterConfig:
......@@ -968,7 +969,7 @@ class RouterConfig:
Create a RouterConfig.
Args:
mode: The router mode (RoundRobin, Random, or KV)
mode: The router mode (RoundRobin, Random, KV, or Direct)
config: Optional KV router configuration (used when mode is KV)
active_decode_blocks_threshold: Threshold percentage (0.0-1.0) for decode blocks busy detection
active_prefill_tokens_threshold: Literal token count threshold for prefill busy detection
......
......@@ -9,7 +9,7 @@ use crate::{
engines::StreamingEngineAdapter,
entrypoint::{EngineConfig, RouterConfig},
http::service::metrics::Metrics,
kv_router::{KvPushRouter, KvRouter, PrefillRouter},
kv_router::{DirectRoutingRouter, KvPushRouter, KvRouter, PrefillRouter},
migration::Migration,
model_card::ModelDeploymentCard,
preprocessor::{OpenAIPreprocessor, prompt::PromptFormatter},
......@@ -274,10 +274,10 @@ where
.await?;
let service_backend = match router_mode {
RouterMode::Random | RouterMode::RoundRobin | RouterMode::Direct(_) => {
// Non-KV routing: use PushRouter directly.
// Note: Per-worker metrics (active_prefill_tokens, active_decode_blocks) are only
// available in KV routing mode where the router has actual bookkeeping.
RouterMode::Direct => {
ServiceBackend::from_engine(Arc::new(DirectRoutingRouter::new(router)))
}
RouterMode::Random | RouterMode::RoundRobin => {
ServiceBackend::from_engine(Arc::new(router))
}
RouterMode::KV => {
......
......@@ -41,7 +41,7 @@ pub mod worker_query;
pub use config::{KvRouterConfig, RouterConfigOverride};
pub use prefill_router::PrefillRouter;
pub use push_router::KvPushRouter;
pub use push_router::{DirectRoutingRouter, KvPushRouter};
use crate::{
discovery::RuntimeConfigWatch,
......
......@@ -81,14 +81,6 @@ impl InnerPrefillRouter {
InnerPrefillRouter::KvRouter(_) => None,
}
}
/// Peek next worker without incrementing state (for non-KV modes only)
fn peek_next_worker(&self) -> Option<u64> {
match self {
InnerPrefillRouter::SimpleRouter(router) => router.peek_next_worker(),
InnerPrefillRouter::KvRouter(_) => None,
}
}
}
/// PrefillRouter is a forward-only operator that sits between Migration and the decode router.
......@@ -273,7 +265,7 @@ impl PrefillRouter {
preselected_worker: Option<u64>,
) -> Option<(u64, u32, BootstrapInfo)> {
let endpoint_id = self.endpoint_id.get()?;
let prefill_router = self.prefill_router.get()?;
let _prefill_router = self.prefill_router.get()?;
// Worker selection
let (worker_id, dp_rank) = if let Some(id) = preselected_worker {
......@@ -284,12 +276,8 @@ impl PrefillRouter {
"Using pre-selected prefill worker for bootstrap"
);
(id, dp_rank)
} else if self.router_mode.is_kv_routing() {
// KV mode: use find_best_match
let kv_router = match prefill_router {
InnerPrefillRouter::KvRouter(r) => r,
_ => return None,
};
} else {
// Use shared worker selection logic (update_states=false for peek behavior)
// Extract LORA name and priority jump from routing hints
let lora_name = req.routing.as_ref().and_then(|r| r.lora_name.clone());
let priority_jump = req
......@@ -297,24 +285,14 @@ impl PrefillRouter {
.as_ref()
.and_then(|r| r.priority_jump)
.unwrap_or(0.0);
match async {
kv_router
.chooser
.find_best_match(None, &req.token_ids, None, false, lora_name, priority_jump)
.await
}
.instrument(tracing::info_span!("kv_find_best_match"))
match self
.query_prefill_worker(&req.token_ids, false, lora_name, priority_jump)
.instrument(tracing::info_span!("query_prefill_worker"))
.await
{
Ok((worker, _overlap)) => (worker.worker_id, worker.dp_rank),
Ok((worker_id, dp_rank)) => (worker_id, dp_rank),
Err(_) => return None,
}
} else {
// Non-KV mode: use PushRouter's stateful selection
// We use peek_next_worker instead of select_next_worker to avoid double-incrementing the counter
// if we fall back to the original path.
let worker_id = prefill_router.peek_next_worker()?;
(worker_id, 0)
};
// Get bootstrap info from ModelManager (works for ANY mode)
......@@ -489,6 +467,55 @@ impl PrefillRouter {
// No phase permit needed - we wait for completion before changing phase
Self::execute_prefill(self.prefill_router.get().cloned(), request, None, None).await
}
/// Query the best prefill worker without executing a request.
/// Returns (worker_id, dp_rank).
///
/// This is the shared worker selection logic used by both `build_bootstrap_info`
/// and `query_route`.
pub async fn query_prefill_worker(
&self,
token_ids: &[u32],
update_states: bool,
lora_name: Option<String>,
priority_jump: f64,
) -> Result<(u64, u32)> {
let prefill_router = self
.prefill_router
.get()
.ok_or_else(|| anyhow::anyhow!(PrefillError::NotActivated))?;
match prefill_router {
InnerPrefillRouter::KvRouter(r) => {
let (worker, _overlap) = r
.chooser
.find_best_match(
None,
token_ids,
None,
update_states,
lora_name,
priority_jump,
)
.await?;
Ok((worker.worker_id, worker.dp_rank))
}
InnerPrefillRouter::SimpleRouter(r) => {
let worker_id = if update_states {
r.select_next_worker()
} else {
r.peek_next_worker()
}
.ok_or_else(|| anyhow::anyhow!("No workers available for prefill"))?;
Ok((worker_id, 0))
}
}
}
/// Check if disaggregated mode is currently active (prefill router activated)
pub fn is_activated(&self) -> bool {
self.prefill_router.get().is_some()
}
}
impl Drop for PrefillRouter {
......
......@@ -48,7 +48,6 @@ struct WorkerSelection {
struct RequestGuard {
chooser: Arc<KvRouter>,
context_id: String,
handle_local_updates: bool,
tracker: Option<Arc<RequestTracker>>,
request_metrics: Arc<RouterRequestMetrics>,
cumulative_osl: usize,
......@@ -59,9 +58,7 @@ struct RequestGuard {
impl RequestGuard {
async fn finish(&mut self) {
self.record_metrics();
if self.handle_local_updates
&& let Err(e) = self.chooser.free(&self.context_id).await
{
if let Err(e) = self.chooser.free(&self.context_id).await {
tracing::warn!("Failed to free request {}: {e}", self.context_id);
}
self.freed = true;
......@@ -86,7 +83,7 @@ impl RequestGuard {
impl Drop for RequestGuard {
fn drop(&mut self) {
self.record_metrics();
if !self.freed && self.handle_local_updates {
if !self.freed {
let chooser = self.chooser.clone();
let context_id = self.context_id.clone();
let Ok(handle) = tokio::runtime::Handle::try_current() else {
......@@ -112,15 +109,13 @@ impl KvPushRouter {
/// Select a worker for the request, either using a preselected worker or finding the best match.
///
/// When `is_query_only` is false and `handle_local_updates` is true, this also registers
/// the request with the scheduler via `add_request`.
/// When `is_query_only` is false, this also registers the request with the scheduler via `add_request`.
async fn select_worker(
&self,
context_id: &str,
request: &PreprocessedRequest,
phase: RequestPhase,
is_query_only: bool,
handle_local_updates: bool,
) -> Result<WorkerSelection, Error> {
let routing = request.routing.as_ref();
let lora_name = routing.and_then(|r| r.lora_name.clone());
......@@ -172,7 +167,7 @@ impl KvPushRouter {
.get_overlap_blocks(&request.token_ids, worker)
.await?;
if !is_query_only && handle_local_updates {
if !is_query_only {
self.chooser
.add_request(
context_id.to_string(),
......@@ -234,15 +229,6 @@ impl AsyncEngine<SingleIn<PreprocessedRequest>, ManyOut<Annotated<LLMEngineOutpu
// Simple query-only detection: presence of query_instance_id annotation means query-only mode
let is_query_only = request.get_annotation_value("query_instance_id").is_some();
// Determine if this router should handle local state updates (add_request, free, etc.)
// Default is true (router handles bookkeeping). Set to false for GAIE Stage 2 where
// an external orchestrator (e.g., EPP sidecar) handles bookkeeping via C FFI.
let handle_local_updates = request
.routing
.as_ref()
.and_then(|r| r.enable_local_updates)
.unwrap_or(true);
// Get phase from tracker (defaults to Aggregated if no tracker or phase not set)
let phase = request
.tracker
......@@ -252,13 +238,7 @@ impl AsyncEngine<SingleIn<PreprocessedRequest>, ManyOut<Annotated<LLMEngineOutpu
let block_size = self.chooser.block_size() as usize;
let selection = self
.select_worker(
&context_id,
&request,
phase,
is_query_only,
handle_local_updates,
)
.select_worker(&context_id, &request, phase, is_query_only)
.await?;
let WorkerSelection {
instance_id,
......@@ -335,8 +315,7 @@ impl AsyncEngine<SingleIn<PreprocessedRequest>, ManyOut<Annotated<LLMEngineOutpu
.routing
.as_ref()
.and_then(|r| r.expected_output_tokens);
let track_output_blocks =
self.chooser.kv_router_config().router_track_output_blocks && handle_local_updates;
let track_output_blocks = self.chooser.kv_router_config().router_track_output_blocks;
let tracker = request.tracker.clone();
let (mut backend_input, context) = request.into_parts();
......@@ -360,7 +339,6 @@ impl AsyncEngine<SingleIn<PreprocessedRequest>, ManyOut<Annotated<LLMEngineOutpu
let mut guard = RequestGuard {
chooser: chooser.clone(),
context_id: context_id.clone(),
handle_local_updates,
tracker: tracker.clone(),
request_metrics: request_metrics.clone(),
cumulative_osl: 0,
......@@ -385,7 +363,9 @@ impl AsyncEngine<SingleIn<PreprocessedRequest>, ManyOut<Annotated<LLMEngineOutpu
break;
};
if handle_local_updates && !prefill_marked {
if !prefill_marked {
// Only mark prefill completed when we receive actual tokens,
// not empty bootstrap info (token_ids: []) from disaggregated prefill
let has_tokens = item.data.as_ref()
.map(|d| !d.token_ids.is_empty())
.unwrap_or(false);
......@@ -451,3 +431,48 @@ impl AsyncEngine<SingleIn<PreprocessedRequest>, ManyOut<Annotated<LLMEngineOutpu
Ok(ResponseStream::new(wrapped_stream, stream_context))
}
}
/// A direct routing wrapper for `RouterMode::Direct`.
///
/// This wraps a `PushRouter` and reads worker IDs from each request's routing hints,
/// then routes directly to the specified worker. Used when an external router
/// (e.g., EPP) handles worker selection.
pub struct DirectRoutingRouter {
inner: PushRouter<PreprocessedRequest, Annotated<LLMEngineOutput>>,
}
impl DirectRoutingRouter {
pub fn new(inner: PushRouter<PreprocessedRequest, Annotated<LLMEngineOutput>>) -> Self {
DirectRoutingRouter { inner }
}
/// Extract worker ID from request routing hints.
/// Returns an error if no worker ID is found (required in direct routing mode).
fn get_worker_id(request: &PreprocessedRequest) -> Result<u64, Error> {
let routing = request.routing.as_ref();
let worker_id = routing.and_then(|r| r.decode_worker_id.or(r.backend_instance_id));
worker_id.ok_or_else(|| {
anyhow::anyhow!(
"Worker ID required (--direct-route) but none found in request. \
Expected decode_worker_id or backend_instance_id to be set by external router (e.g., EPP)."
)
})
}
}
#[async_trait]
impl AsyncEngine<SingleIn<PreprocessedRequest>, ManyOut<Annotated<LLMEngineOutput>>, Error>
for DirectRoutingRouter
{
async fn generate(
&self,
request: SingleIn<PreprocessedRequest>,
) -> Result<ManyOut<Annotated<LLMEngineOutput>>, Error> {
let worker_id = Self::get_worker_id(&request)?;
tracing::debug!(worker_id = worker_id, "Direct routing to specified worker");
self.inner.direct(request, worker_id).await
}
}
......@@ -280,7 +280,6 @@ impl OpenAIPreprocessor {
prefill_worker_id: nvext.prefill_worker_id,
decode_worker_id: nvext.decode_worker_id,
dp_rank: None, // dp_rank is set later in the pipeline
enable_local_updates: nvext.enable_local_updates,
expected_output_tokens: hints.and_then(|h| h.osl),
priority_jump: hints.and_then(|h| h.latency_sensitivity),
lora_name,
......
......@@ -34,14 +34,6 @@ pub struct RoutingHints {
#[serde(default, skip_serializing_if = "Option::is_none")]
pub dp_rank: Option<u32>,
/// Controls whether the router should manage local bookkeeping (add_request,
/// mark_prefill_completed, free) for this request.
///
/// - `None` or `Some(true)`: Router handles bookkeeping locally (default behavior)
/// - `Some(false)`: External caller (e.g., GAIE sidecar) handles bookkeeping via C FFI
#[serde(default, skip_serializing_if = "Option::is_none")]
pub enable_local_updates: Option<bool>,
/// Expected number of output tokens for this request.
/// Used as a hint for routing decisions to estimate resource requirements.
#[serde(default, skip_serializing_if = "Option::is_none")]
......
......@@ -11,16 +11,12 @@ pub use crate::protocols::common::timing::TimingInfo;
pub const HEADER_WORKER_INSTANCE_ID: &str = "x-worker-instance-id";
pub const HEADER_PREFILL_INSTANCE_ID: &str = "x-prefill-instance-id";
/// Header to disable local bookkeeping updates (for GAIE Stage 2)
/// When set to "false", the router skips add_request, mark_prefill_completed, and free calls.
pub const HEADER_ENABLE_LOCAL_UPDATES: &str = "x-enable-local-updates";
/// Apply routing overrides from HTTP headers to nvext.
///
/// Header mappings:
/// - `x-worker-instance-id` -> `backend_instance_id` and `decode_worker_id`
/// - `x-prefill-instance-id` -> `prefill_worker_id`
/// - `x-enable-local-updates` -> `enable_local_updates` (set to false to disable router bookkeeping)
///
/// Headers take priority over existing nvext values when present.
/// If no headers are present, returns the original nvext unchanged.
......@@ -35,17 +31,7 @@ pub fn apply_header_routing_overrides(nvext: Option<NvExt>, headers: &HeaderMap)
.and_then(|v| v.to_str().ok())
.and_then(|s| s.parse::<u64>().ok());
// Parse enable_local_updates header: "true" or "false"
let enable_local_updates = headers
.get(HEADER_ENABLE_LOCAL_UPDATES)
.and_then(|v| v.to_str().ok())
.and_then(|s| match s.to_lowercase().as_str() {
"true" | "1" => Some(true),
"false" | "0" => Some(false),
_ => None,
});
if worker_id.is_none() && prefill_id.is_none() && enable_local_updates.is_none() {
if worker_id.is_none() && prefill_id.is_none() {
return nvext;
}
......@@ -57,9 +43,6 @@ pub fn apply_header_routing_overrides(nvext: Option<NvExt>, headers: &HeaderMap)
if let Some(id) = prefill_id {
ext.prefill_worker_id = Some(id);
}
if let Some(enabled) = enable_local_updates {
ext.enable_local_updates = Some(enabled);
}
Some(ext)
}
......@@ -169,17 +152,6 @@ pub struct NvExt {
#[serde(default, skip_serializing_if = "Option::is_none")]
pub decode_worker_id: Option<u64>,
/// Controls whether the router should manage local bookkeeping (add_request,
/// mark_prefill_completed, free) for this request.
///
/// - `None` or `true`: Router handles bookkeeping locally (default behavior)
/// - `false`: External caller (e.g., GAIE sidecar) handles bookkeeping via C FFI
///
/// Set to `false` for GAIE Stage 2 when the EPP/sidecar manages request lifecycle.
#[builder(default, setter(strip_option))]
#[serde(default, skip_serializing_if = "Option::is_none")]
pub enable_local_updates: Option<bool>,
/// Agent-provided hints for request handling.
#[builder(default, setter(strip_option))]
#[serde(default, skip_serializing_if = "Option::is_none")]
......@@ -187,7 +159,7 @@ pub struct NvExt {
}
/// Hints from the agent/caller about request characteristics.
#[derive(ToSchema, Serialize, Deserialize, Builder, Debug, Clone, Default)]
#[derive(ToSchema, Serialize, Deserialize, Builder, Debug, Clone, Default, PartialEq)]
pub struct AgentHints {
/// Latency sensitivity in seconds for queue ordering.
/// Higher values cause the request to be scheduled sooner when the router queue is enabled.
......@@ -249,7 +221,7 @@ mod tests {
assert_eq!(nv_ext.extra_fields, None);
assert_eq!(nv_ext.prefill_worker_id, None);
assert_eq!(nv_ext.decode_worker_id, None);
assert_eq!(nv_ext.enable_local_updates, None);
assert_eq!(nv_ext.agent_hints, None);
}
// Test valid builder configurations
......
......@@ -83,15 +83,18 @@ pub enum RouterMode {
#[default]
RoundRobin,
Random,
Direct(u64),
// Marker value, KV routing itself is in dynamo-llm
KV,
Direct,
}
impl RouterMode {
pub fn is_kv_routing(&self) -> bool {
*self == RouterMode::KV
}
pub fn is_direct_routing(&self) -> bool {
*self == RouterMode::Direct
}
}
async fn addressed_router(endpoint: &Endpoint) -> anyhow::Result<Arc<AddressedPushRouter>> {
......@@ -415,14 +418,17 @@ where
U: Data + for<'de> Deserialize<'de> + MaybeError,
{
async fn generate(&self, request: SingleIn<T>) -> Result<ManyOut<U>, Error> {
//InstanceSource::Static => self.r#static(request).await,
match self.router_mode {
RouterMode::Random => self.random(request).await,
RouterMode::RoundRobin => self.round_robin(request).await,
RouterMode::Direct(instance_id) => self.direct(request, instance_id).await,
RouterMode::KV => {
anyhow::bail!("KV routing should not call generate on PushRouter");
}
RouterMode::Direct => {
anyhow::bail!(
"Direct routing should not call generate on PushRouter directly; use DirectRoutingRouter wrapper"
);
}
}
}
}
......@@ -16,12 +16,7 @@ spec:
replicas: 1
extraPodSpec:
mainContainer:
image: nvcr.io/nvidia/ai-dynamo/frontend:0.8.0
env:
- name: DYN_KV_BLOCK_SIZE
value: "128"
- name: DYN_MODEL
value: "RedHatAI/Llama-3.3-70B-Instruct-FP8-dynamic"
image: nvcr.io/nvidia/ai-dynamo/frontend:my-tag
eppConfig:
# This configuration uses Dynamo's KV-aware scorer for intelligent routing
config:
......@@ -49,8 +44,15 @@ spec:
mountPoint: /opt/models
extraPodSpec:
mainContainer:
image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.8.0
image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:my-tag
workingDir: /workspace/examples/backends/vllm
command:
- python3
args:
- -m
- dynamo.frontend
- --router-mode
- direct
envs:
- name: HF_HOME
value: /opt/models
......@@ -79,7 +81,7 @@ spec:
command:
- /bin/sh
- -c
image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.8.0
image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:my-tag
workingDir: /workspace/examples/backends/vllm
replicas: 1
resources:
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment