@@ -46,7 +46,7 @@ For all CLI arguments, environment variables, K8s deployment examples, and tunin
**Limitations:**
- Static endpoints not supported—KV router requires dynamic model discovery via etcd to track worker instances and their KV cache states
For basic model registration without KV routing, use `--router-mode round-robin`, `--router-mode random`, or `--router-mode least-loaded` with both static and dynamic endpoints.
For basic model registration without KV routing, use `--router-mode round-robin`, `--router-mode random`, `--router-mode least-loaded`, or `--router-mode device-aware-weighted` with both static and dynamic endpoints.
| **Frontend + Least-Loaded** | `python -m dynamo.frontend --router-mode least-loaded` | Fewest active connections | None | Aggregated or disaggregated fallback | Simple load-aware balancing without KV awareness |
| **Frontend + Device-Aware Weighted** | `python -m dynamo.frontend --router-mode device-aware-weighted` | Device-aware budget + least-loaded within selected device group | None | Aggregated or disaggregated fallback | Heterogeneous fleet balancing (CPU/non-CPU); degenerates to least-loaded when only one device class is present |
@@ -32,8 +33,71 @@ The Dynamo router can be deployed in several configurations. The table below sho
| **Random** | `random` | Selects a random worker for each request |
| **KV** | `kv` | Evaluates KV cache overlap and decode load per worker; picks lowest cost |
| **Least-Loaded** | `least-loaded` | Routes to the worker with fewest active connections; in disaggregated prefill paths it skips bootstrap optimization and falls back to synchronous prefill |
| **Device-Aware Weighted** | `device-aware-weighted` | Partitions workers into CPU and non-CPU groups, applies capability-normalized ratio budgeting using `DYN_ENCODER_CUDA_TO_CPU_RATIO` to decide which group receives the request, then selects the least-loaded worker within that group; when only one device class exists, behavior degenerates to least-loaded |
| **Direct** | `direct` | Reads the target `worker_id` from the request's routing hints; no selection logic |
### Device-Aware Weighted Routing
`device-aware-weighted` is designed for **heterogeneous fleets** where workers of different compute capability — for example CPU embedding encoders alongside GPU embedding encoders — share the same endpoint.
Raw in-flight counts are not directly comparable across device types: a GPU can sustain far more concurrent requests than a CPU before reaching the same relative load. Comparing raw counts would permanently starve CPU workers because GPUs would always look less loaded.
#### Budget policy
Workers are split into two groups: *CPU* and *non-CPU* (CUDA/GPU/etc.). A **capability-normalized load** is compared between the two groups:
where `throughput_weight` is **1** for CPU workers and **`DYN_ENCODER_CUDA_TO_CPU_RATIO`** for non-CPU workers. The request is routed to the group with the **lower normalized load**, i.e. the group that has more headroom relative to its compute capacity.
This comparison rearranges to an equivalent integer budget check (avoiding floating-point division):
The right-hand side is the *allowed CPU in-flight budget*: the number of concurrent CPU requests that would produce the same relative load as the current non-CPU workload. When actual CPU in-flight is below that budget, CPUs are underloaded relative to GPUs and get the next request; once the budget is exhausted the router switches back to the non-CPU group.
#### Example
Ratio = 8 (default; each GPU handles 8× the requests of a CPU at equal normalized load).
Current per-instance load:
- GPU instance g1: 8 in-flight
- GPU instance g2: 8 in-flight
- CPU instance c1: 0 in-flight
- CPU instance c2: 0 in-flight
So `non_cpu_count = 2`, `cpu_count = 2`, `total_non_cpu_inflight = 16`, and `total_cpu_inflight = 0`:
```
allowed_cpu_inflight = 16 × 2 / (8 × 2) = 2
total_cpu_inflight = 0 < 2 → route to CPU group
```
Interpretation: with this non-CPU load, the CPU group budget is 2 total in-flight requests, i.e. roughly 1 request per CPU instance before normalized load matches the GPU group.
After c1=1 and c2=1 (total CPU in-flight = 2):
```
total_cpu_inflight = 2 < 2 is false → route to non-CPU group
```
This keeps both groups at roughly equal *normalized* load.
#### Configuration
| Variable | Default | Description |
|----------|---------|-------------|
| `DYN_ENCODER_CUDA_TO_CPU_RATIO` | `8` | Throughput ratio of a non-CPU (GPU) worker relative to one CPU worker. Set to the approximate ratio of requests-per-second each device type can sustain under your workload. |
> [!TIP]
> When only one device class is present (all GPU or all CPU) the policy returns all instances and behavior degenerates to standard least-loaded selection within that single group.
### KV Event Transport Modes (within `--router-mode kv`)
When using KV routing, the router needs to know what each worker has cached. There are four ways to get this information:
...
...
@@ -268,6 +332,7 @@ We can then use the default routing methods exposed by the client class to send
-**Direct routing**: Explicitly targets a specific worker via `client.direct(input, component_id)`
-**Least-loaded routing**: Routes to the worker with fewest active connections via `--router-mode least-loaded`
In disaggregated prefill paths it skips bootstrap optimization and uses the synchronous prefill path, matching power-of-two routing.
-**Device-aware weighted routing**: Routes using CPU/non-CPU ratio budgeting plus least-loaded selection within the selected device group via `--router-mode device-aware-weighted`. Tune ratio with `DYN_ENCODER_CUDA_TO_CPU_RATIO` (default `8`). See [Device-Aware Weighted Routing](#device-aware-weighted-routing) below for a detailed explanation of the budget policy.
KV Cache routing uses direct routing with a special worker selection algorithm.