@@ -392,6 +392,13 @@ For this A/B comparison, we use the [**Mooncake FAST'25 Toolagent Trace**](https
These two requests share blocks 46–57 (12 blocks × 512 tokens = ~6,144 tokens of shared prefix) — a tool agent continuing the same session with accumulated context. Each hash ID represents a **512-token block**, and the hash includes both the current block and all preceding blocks, preserving the pattern of prefix reuse while protecting user privacy. The **KV Smart Router** routes requests with matching hash IDs to the same worker, maximizing cache hits.
If you reproduce this benchmark with `python -m dynamo.replay`, keep that dataset fact separate from
the replay engine configuration:
- use `--trace-block-size 512` for the Mooncake/toolagent trace itself
- keep engine `block_size` in `--extra-engine-args` aligned with the runtime you want to mimic
(for the published vLLM deployment, that is typically `64`)
**Key Dataset Properties:**
- ✅ **Realistic timing:** Request arrival patterns from production tool-agent workloads
- ✅ **High prefix overlap:** 59% cache ratio ([Mooncake FAST'25 paper](https://github.com/kvcache-ai/Mooncake/blob/main/FAST25-release/Mooncake-FAST25.pdf)); iterative tool calls within sessions produce natural prefix reuse
| `--aic-backend` | `DYN_AIC_BACKEND` | — | Backend family to model in AIC, for example `vllm` or `sglang` |
| `--aic-system` | `DYN_AIC_SYSTEM` | — | AIC hardware/system identifier, for example `h200_sxm` |
| `--aic-model-path` | `DYN_AIC_MODEL_PATH` | — | Model path or model identifier used for AIC perf lookup |
| `--aic-backend-version` | `DYN_AIC_BACKEND_VERSION` | backend-specific | Pinned AIC database version. If omitted, Dynamo uses the backend default |
| `--aic-tp-size` | `DYN_AIC_TP_SIZE` | `1` | Tensor-parallel size to model in AIC |
When enabled, the frontend's embedded KV router predicts one expected prefill duration per admitted request, using the selected worker's overlap-derived cached prefix. The router then decays only the oldest active prefill request on each worker for prompt-side load accounting.
| `--router-track-prefill-tokens` / `--no-router-track-prefill-tokens` | `--router-track-prefill-tokens` | Include prompt-side load in active worker load accounting |
| `--router-prefill-load-model <none\|aic>` | `none` | Prompt-side load model. `aic` decays only the oldest active prefill using an AIC-predicted duration |
- stores one prompt-load hint for the admitted request
- decays only the **oldest** active prefill request on each worker over time
This affects router-side prompt load accounting only. It does not change backend execution or decode-side accounting.
Enable it on the frontend like this:
```bash
python -m dynamo.frontend \
--router-mode kv \
--router-prefill-load-model aic \
--aic-backend vllm \
--aic-system h200_sxm \
--aic-model-path nvidia/Llama-3.1-8B-Instruct-FP8
```
The standalone router uses the same AIC flags:
```bash
python -m dynamo.router \
--endpoint dynamo.prefill.generate \
--router-prefill-load-model aic \
--aic-backend vllm \
--aic-system h200_sxm \
--aic-model-path nvidia/Llama-3.1-8B-Instruct-FP8
```
Required when `--router-prefill-load-model=aic` is enabled:
-`--router-mode kv` on the frontend
-`--router-track-prefill-tokens`
-`--aic-backend`
-`--aic-system`
-`--aic-model-path`
Optional AIC knobs:
-`--aic-backend-version`: pinned AIC database version; if omitted, Dynamo uses a backend-specific default
-`--aic-tp-size`: tensor-parallel size for the modeled backend; defaults to `1`
### Kubernetes Deployment
To enable the KV Router in Kubernetes, add the `DYN_ROUTER_MODE` environment variable to your frontend service:
...
...
@@ -235,6 +283,10 @@ The main KV-aware routing arguments (frontend uses the same `--router-*` flag na
-`--router-temperature`: Controls worker selection randomness through softmax sampling of router cost logits. A value of 0 (default) ensures deterministic selection of the lowest-cost worker, while higher values introduce more randomness.
-`--router-track-prefill-tokens`: Enables prompt-side load accounting in the worker cost model. This should stay enabled if you want queue thresholds, `active_prefill_tokens`, and AIC prefill load decay to reflect prompt work.
-`--router-prefill-load-model`: Selects the router's prompt-side load model. `none` keeps the existing static prompt load accounting. `aic` predicts one expected prefill duration per admitted request and lazily decays only the oldest active prefill request on each worker.
-`--router-queue-threshold`: Queue threshold fraction for prefill token capacity (default: 4.0). The router holds incoming requests in a priority queue while all workers exceed this fraction of `max_num_batched_tokens`, releasing them when capacity frees up. This defers dispatch (not rejection) so that routing decisions use the most up-to-date load metrics at the moment the request is actually sent to a worker. It also enables **priority scheduling** via `priority` hints in `nvext.agent_hints` — higher values shift a request's effective arrival time earlier in the queue, giving it priority over lower-valued requests. Must be > 0. Set to None to disable queueing (requests are dispatched immediately).
-`--router-queue-policy`: Scheduling policy for the router queue (default: `fcfs`). Three policies are available:
...
...
@@ -292,6 +344,8 @@ Use `--router-track-output-blocks` **(experimental)** when your workload is outp
The `--router-queue-threshold` (default: 4.0) controls when incoming requests are held in a priority queue. The router holds requests while all workers exceed the given fraction of `max_num_batched_tokens`, releasing them as capacity frees up. This defers the routing decision so it is made with the freshest load metrics, rather than dispatching into an already-saturated system. It also enables priority scheduling via `nvext.agent_hints.priority`. Set to None to disable queueing entirely.
Use `--router-prefill-load-model aic`**(experimental)** when you want prompt-side load tracking to decay the oldest active prefill request using an AIC-predicted duration instead of keeping prompt load static until first token. This requires `--router-track-prefill-tokens` and the shared `--aic-*` config (`--aic-backend`, `--aic-system`, and `--aic-model-path`; `--aic-tp-size` defaults to `1`, and `--aic-backend-version` is optional). This path is still experimental because the decay model is based on expected prefill duration rather than observed worker-side progress.
Use `--router-queue-policy wspt` when your workload has a mix of short and long requests and you want to minimize **average** TTFT. WSPT (Smith's rule) schedules short or high-priority requests first, reducing mean latency across the batch. Use the default `fcfs` when you want to minimize **tail** TTFT — no request waits longer than necessary, since ordering is purely by (adjusted) arrival time.
For trace-file replay, `--trace-block-size` controls how many tokens each `hash_id` represents in
the dataset, while engine `block_size` still controls the replay engine and router hashing. Public
Mooncake/toolagent traces use `--trace-block-size 512`; engine `block_size` can still stay at `64`
to match the live runtime configuration.
The standalone replay CLI prints an AIPerf-style summary table to stdout and writes the full replay
report JSON to disk.
...
...
@@ -264,12 +270,19 @@ The AIC model automatically uses `--model-path` and `--engine-type` to select th
Important notes:
- AIC is opt-in. If you do not pass `--aic-perf-model`, `python -m dynamo.mocker` does not use AIC.
-`python -m dynamo.replay` also does not use AIC unless you explicitly put AIC fields in the engine-args JSON.
-`python -m dynamo.replay` has two separate AIC surfaces:
- engine timing AIC through `--extra-engine-args` / staged engine JSON
- router-side prefill-load AIC through top-level `--aic-*` flags plus `router_prefill_load_model="aic"` in `--router-config`
- The Python AIC session bridge is now shared with the live KV router path via the internal `dynamo._internal.aic` module. Mocker CLI behavior is unchanged; this just removes duplicate AIC session code.
-`aiconfigurator` must be able to load the requested performance database for the selected `system/backend/version`. If the SDK is installed but the backing systems data is missing or unreadable, mocker now fails fast at startup with a clear error instead of failing later on first request.
- In development environments, this may require pointing Python at a source checkout of `aiconfigurator` with real Git LFS payloads materialized in its `systems/` directory.
When using `python -m dynamo.replay`, there are no dedicated AIC flags. For aggregated replay,
pass the equivalent fields via `--extra-engine-args`:
This mocker AIC path is separate from the router-side prefill-load estimator. Live router,
frontend, and replay all use `router_prefill_load_model="aic"` plus top-level `--aic-*` flags for
oldest-prefill prompt-load decay. Replay still uses engine-args AIC separately when you want the
mocked worker timing model itself to come from AIC.
For aggregated replay, engine timing AIC still comes from `--extra-engine-args`:
The `aic_backend` field enables the AIC perf model and should match `engine_type` (`"vllm"` or `"sglang"`). The `aic_model_path` field is the equivalent of `--model-path` in `dynamo.mocker`.
Replay router-side AIC prompt-load modeling is configured separately with top-level flags: