Unverified Commit f9050aae authored by Jonathan Tong's avatar Jonathan Tong Committed by GitHub
Browse files

docs: migrate existing docs to fern (#5445)


Signed-off-by: default avatarJont828 <jt572@cornell.edu>
Signed-off-by: default avatarNeal Vaidya <nealv@nvidia.com>
Co-authored-by: default avatarNeal Vaidya <nealv@nvidia.com>
parent f238d23a
---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
title: "SLA-based Planner"
---
<Tip>
**New to SLA Planner?** For a complete workflow including profiling and deployment, see the [SLA Profiling + Planner Quick Start Guide](sla-planner-quickstart.md).
</Tip>
This document covers information regarding the SLA-based planner in `examples/common/utils/planner_core.py`.
The SLA (Service Level Agreement)-based planner is an intelligent autoscaling system that monitors system performance and adjusts the number of prefill and decode workers to meet specified TTFT and ITL targets. Unlike the load-based planner that scales based on resource utilization thresholds, the SLA planner uses predictive modeling and performance interpolation to proactively scale the workers.
<Note>
Currently, SLA-based planner only supports disaggregated setup.
</Note>
<Warning>
Bare metal deployment with local connector is deprecated. Please deploy the SLA planner in k8s.
</Warning>
## Architecture Overview
**Components:**
- **Frontend**: Serves requests and exposes `/metrics`
- **Prometheus**: Scrapes frontend metrics every 5s (by default, can be updated in the podmonitor manifest)
- **Planner**: Queries Prometheus and adjusts worker scaling every adjustment interval
- **Workers**: prefill and backend workers handle inference
The adjustment interval can be defined in the planner manifest as an argument. The default interval value can be found in this [file](https://github.com/ai-dynamo/dynamo/tree/main/components/src/dynamo/planner/defaults.py).
```mermaid
flowchart LR
Frontend --"/metrics"--> Prometheus
Planner --"query API"--> Prometheus
Planner --"scaling decisions"--> Workers
Frontend -.->|"requests"| Workers
```
## Features
* **SLA-driven scaling**: Automatically scales prefill/decode workers to meet TTFT and ITL targets
* **Predictive load forecasting**: Uses ARIMA, Prophet, or constant predictors to forecast future load
* **Performance interpolation**: Leverages profiling results data from pre-deployment profiling for accurate scaling decisions
* **Correction factors**: Adapts to real-world performance deviations from profiled data
## Design
The SLA planner consists of several key components:
1. **Load Predictors**: Forecast future request patterns (number of requests, input/output sequence lengths)
2. **Performance Interpolators**: Estimate TTFT and ITL based on profiled performance data
3. **Correction Factors**: Adjust predictions based on observed vs. expected performance
4. **Scaling Logic**: Calculate optimal number of prefill/decode replicas to meet SLA targets
## SLA-Driven Pre-Deployment Profiling
**Prerequisite**: SLA-based planner requires pre-deployment profiling to be completed before deployment. The profiling process analyzes your model's performance characteristics to determine optimal tensor parallelism configurations and scaling parameters that the planner will use during operation.
See [Pre-Deployment Profiling](../benchmarks/sla-driven-profiling.md) for detailed instructions on running the profiling process.
## Load Prediction
The SLA planner use load predictor to predict the number of requests, ISL, and OSL in the next adjustment interval. Currently, three load prediction model is supported:
### Constant Predictor
- **Use case**: Stable and long prediction interval
- **Behavior**: Assumes next load equals current load
- **Configuration**: `load-predictor: "constant"`
### ARIMA Predictor
- **Use case**: Time-series data with trends and seasonality
- **Behavior**: Uses auto-ARIMA to fit optimal model parameters
- **Configuration**: `load-predictor: "arima"`
### Prophet Predictor
- **Use case**: Complex seasonal patterns and trend changes
- **Behavior**: Facebook's [Prophet](https://facebook.github.io/prophet/) model for time-series forecasting
- **Configuration**: `load-predictor: "prophet"`
## Scaling Algorithm
SLA planner uses a sophisticated scaling algorithm. At each adjustment interval, SLA planner performs the following operations:
### 1. Metric Collection
Every adjustment interval, collect:
- Average Time to First Token (TTFT)
- Average Inter-Token Latency (ITL)
- Request count and duration
- Input/Output sequence lengths
### 2. Correction Factor Calculation
Using the collected metrics, SLA planner applies the interpolator to find out the expected TTFT/ITL and calibrate the interpolation model. This step is important because the actual TTFT/ITL can often be different than the ideal world:
- **TTFT**: actual TTFT heavily depends on request queueing and prefix cache hit rate (if use kv reuse). For example, if all requests arrives at the beginning of the adjustment interval, they queue heavily and TTFT will be significantly higher. If prefix cache hit rate is very high, the actual number of tokens in the prefill will be very low and TTFT will be significantly lower.
- **ITL**: actual ITL maybe affected by chunked small prefill request in decode engine.
- **Metric variances**: large variances in request rate, ISL, and OSL may lead to inaccurate estimation of the TTFT/ITL since SLA only consider the average when interpolating.
SLA planner calculate the correction factor with
- **Prefill correction**: `actual_ttft / expected_ttft`
- **Decode correction**: `actual_itl / expected_itl`
### 3. Load Prediction
SLA planner forecasts these metric in the next interval using the load predictor
- Number of requests
- Input sequence length
- Output sequence length
### 4. Calculating Number of Replicas
**Prefill replicas**: SLA planner assumes the prefill correction factor has linear affect on the prefill throughput per GPU as prefill is single-batched.
```
predicted_load = next_requests * next_isl / interval * min(1, prefill_correction)
prefill_replicas = ceil(predicted_load / interpolated_throughput / gpus_per_engine)
```
**Decode replicas**:
```
# 1. apply d_correction_factor to the ITL SLA
corrected_itl = self.args.itl / self.d_correction_factor
# 2. reversely find out what is best throughput/gpu that can achieve corrected_itl under the predicted context length
pred_decode_thpt_per_gpu = self.decode_interpolator.find_best_throughput_per_gpu(
itl=corrected_itl,
context_length=next_isl + next_osl / 2
)
# 3. compute number of decode replicas needed
next_num_d = math.ceil(next_num_req * next_osl / self.args.adjustment_interval / pred_decode_thpt_per_gpu / self.args.decode_engine_num_gpu)
```
### 5. Scaling
Finally, SLA planner applies the change by scaling up/down the number of prefill and decode workers to the calculated number of replica in the next interval.
<Note>
SLA-planner scales up/down the P/D engines non-blockingly. If `adjustment-interval` is too short, the previous scaling operations may not finish before the new scaling operations are issued. Make sure to set a large enough `adjustment-interval`.
</Note>
## Deploying
For complete deployment instructions, see the [SLA Planner Quick Start Guide](sla-planner-quickstart.md).
<Note>
The SLA planner requires a frontend that reports metrics at the `/metrics` HTTP endpoint with the number of requests, ISL, OSL, TTFT, and ITL in the correct format. The dynamo frontend provides these metrics automatically.
</Note>
### Virtual Deployment
The SLA planner supports virtual deployment mode for customized environments (e.g., customized cluster) through the `VirtualConnector`. This connector enables the planner to communicate scaling decisions without directly managing the deployment infrastructure.
The `VirtualConnector` acts as a bridge between the SLA planner and external deployment environments. Instead of directly scaling Kubernetes resources, it writes scaling decisions and waits for the deployment environment to acknowledge completion.
#### Scaling Decision Flow
1. **Decision Generation**: The planner calculates optimal worker counts
2. **Change Detection**: The planner skips scaling if the target counts match current counts, logging: `"No scaling needed (prefill=X, decode=Y)"`
3. **Readiness Check**: Before making new decisions, the planner verifies that previous scaling operations have completed by checking if `scaled_decision_id >= decision_id`
4. **Timeout Handling**: If a scaling decision isn't acknowledged within 30 minutes (1800 seconds), the planner proceeds with new decisions anyway
5. **Completion Tracking**: The planner can optionally wait for scaling completion confirmation (blocking mode)
#### Configuration
To use virtual deployment mode:
```yaml
environment: "virtual"
backend: "vllm" # or "sglang"
```
#### Deployment Environment Requirements
The external deployment environment must use `VirtualConnectorClient`:
```
from dynamo._core import DistributedRuntime, VirtualConnectorClient
client = VirtualConnectorClient(distributed_runtime, namespace)
```
1. **Monitor Planner**: Continuously watch for scaling decisions: `await client.wait()`. This blocks until there is a change.
2. **Parse Decisions**: Read `num_prefill_workers` and `num_decode_workers` values: `decision = await client.get()`
3. **Execute Scaling**: Apply the scaling decisions to the actual deployment infrastructure
4. **Acknowledge Completion**: Mark the decision completed when scaling is finished: `await client.complete(decision)`
A scaling decision (returned by `client.get()`) contains the following fields, which are -1 if not set yet:
- `num_prefill_workers`: Integer specifying the target number of prefill workers
- `num_decode_workers`: Integer specifying the target number of decode workers
- `decision_id`: Integer with incremental ID for each scaling decision
See `components/planner/test/test_virtual_connector.py` for a full example.
---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
title: "Dynamo Run"
---
`dynamo-run` is a Rust binary that lets you easily run a model, explore the Dynamo components, and demonstrates the Rust API. It supports the `mistral.rs` engines, as well as testing engines `echo` and `mocker`.
It is primarily for development and rapid prototyping. For production use we recommend the Python wrapped components, see the main project README.
## Basics
Usage: See `dynamo-run --help`
Example: `dynamo-run Qwen/Qwen3-0.6B`
Set the environment variable `DYN_LOG` to adjust the logging level; for example, `export DYN_LOG=debug`. It has the same syntax as `RUST_LOG`.
To adjust verbosity, use `-v` to enable debug logging or `-vv` to enable full trace logging. For example:
```bash
dynamo-run in=http out=mistralrs <model> -v # enables debug logging
```
### Use model from Hugging Face
To automatically download Qwen3 4B from Hugging Face (16 GiB download) and to start it in interactive text mode:
```
dynamo-run Qwen/Qwen3-4B
```
The general format for HF download follows this pattern:
```
dynamo-run out=<engine> <HUGGING_FACE_ORGANIZATION/MODEL_NAME>
```
For gated models (such as meta-llama/Llama-3.2-3B-Instruct), you must set an `HF_TOKEN` environment variable.
The parameter can be the ID of a HuggingFace repository (which will be downloaded) or a folder containing safetensors, config.json, or similar (perhaps a locally checked out HuggingFace repository).
### Run a model from local file
To run a model from local file:
- Download the model from Hugging Face
- Run the model from local file
See the following sections for details.
#### Download model from Hugging Face
This model available from Hugging Face should be high quality and fast on almost any machine: https://huggingface.co/Qwen/Qwen3-0.6B
To run the model:
*Text interface*
```
dynamo-run Qwen/Qwen3-0.6B
```
You can also pipe a prompt into `dynamo-run`:
```
echo 'What is the capital of Tuvalu?' | dynamo-run Qwen/Qwen3-0.6B --context-length 4096
```
*HTTP interface*
```
dynamo-run in=http out=mistralrs Qwen/Qwen3-0.6B
```
You can also list models or send a request:
*List the models*
```
curl localhost:8080/v1/models
```
*Send a request*
```
curl -d '{"model": "Qwen/Qwen3-0.6B", "max_completion_tokens": 2049, "messages":[{"role":"user", "content": "What is the capital of South Africa?" }]}' -H 'Content-Type: application/json' http://localhost:8080/v1/chat/completions
```
## Distributed System
You can run the ingress side (HTTP server and pre-processing) on one machine, for example a CPU node, and the worker on a different machine (a GPU node).
You will need [etcd](https://etcd.io/) and [nats](https://nats.io) with jetstream installed and accessible from both nodes. For development I run NATS like this: `nats-server -js --trace --store_dir $(mktemp -d)`.
**Node 1:** OpenAI compliant HTTP server, optional pre-processing, worker discovery:
```
dynamo-run in=http out=auto
```
**Node 2:** Engine. Receives and returns requests over the network:
```
dynamo-run in=dyn://llama3B.backend.generate out=mistralrs ~/llms/Llama-3.2-3B-Instruct
```
This uses etcd to auto-discover the model and NATS to talk to it. You can
run multiple instances on the same endpoint; it picks one based on the
`--router-mode` (round-robin by default if left unspecified).
Run `dynamo-run --help` for more options.
### Network names
The `in=dyn://` URLs have the format `dyn://namespace.component.endpoint`. For quickstart just use any string `dyn://test`, `dynamo-run` will default any missing parts for you. The pieces matter for a larger system.
* *Namespace*: A pipeline. Usually a model. e.g "llama_8b". Just a name.
* *Component*: A load balanced service needed to run that pipeline. "backend", "prefill", "decode", "preprocessor", "draft", etc. This typically has some configuration (which model to use, for example).
* *Endpoint*: Like a URL. "generate", "load_metrics".
* *Instance*: A process. Unique. Dynamo assigns each one a unique instance_id. The thing that is running is always an instance. Namespace/component/endpoint can refer to multiple instances.
If you run two models, that is two pipelines. An exception would be if doing speculative decoding. The draft model is part of the pipeline of a bigger model.
If you run two instances of the same model ("data parallel") they are the same namespace+component+endpoint but different instances. The router will spread traffic over all the instances of a namespace+component+endpoint. If you have four prefill workers in a pipeline, they all have the same namespace+component+endpoint and are automatically assigned unique instance_ids.
Example 1: Data parallel load balanced, one model one pipeline two instances.
```
Node 1: dynamo-run in=dyn://qwen3-32b.backend.generate /data/Qwen3-32B
Node 2: dynamo-run in=dyn://qwen3-32b.backend.generate /data/Qwen3-32B
```
Example 2: Two models, two pipelines.
```
Node 1: dynamo-run in=dyn://qwen3-32b.backend.generate /data/Qwen3-32B
Node 2: dynamo-run in=dyn://llama3-1-8b.backend.generate /data/Llama-3.1-8B-Instruct/
```
Example 3: Different endpoints.
The KV metrics publisher in VLLM adds a `load_metrics` endpoint to the current component. If the `llama3-1-8b.backend` component above is using patched vllm it will also expose `llama3-1-8b.backend.load_metrics`.
Example 4: Multiple component in a pipeline.
In the P/D disaggregated setup you would have `deepseek-distill-llama8b.prefill.generate` (possibly multiple instances of this) and `deepseek-distill-llama8b.decode.generate`.
For output it is always only `out=auto`. This tells Dynamo to auto-discover the instances, group them by model, and load balance appropriately (depending on `--router-mode` flag).
### KV-aware routing
```
dynamo-run in=http out=auto --router-mode kv
```
The only difference from the distributed system above is `--router-mode kv`. vllm announces when a KV block is created or removed. The Dynamo router finds the worker with the best match for those KV blocks and directs the traffic to that node.
For performance testing, compare a typical workload with `--router-mode random|round-robin` to see if it can benefit from KV-aware routing.
The KV-aware routing arguments:
- `--kv-overlap-score-weight`: Sets the amount of weighting on overlaps with prefix caches, which directly contributes to the prefill cost. A large weight is expected to yield a better TTFT (at the expense of worse ITL). When set to 0, prefix caches are not considered at all (falling back to pure load balancing behavior on the active blocks).
- `--router-temperature`: Sets the temperature when randomly selecting workers to route to via softmax sampling on the router cost logits. Setting it to 0 recovers the deterministic behavior where the min logit is picked.
- `--use-kv-events`: Sets whether to listen to KV events for maintaining the global view of cached blocks. If true, the router uses KV events to track block creation and deletion from workers. If false, the router predicts cache state based on routing decisions with TTL-based expiration (default 120s) and pruning. Set false if your backend engine does not emit KV events.
### Request Migration
In a [Distributed System](#distributed-system), you can enable [request migration](../fault-tolerance/request-migration.md) to handle worker failures gracefully. Use the `--migration-limit` flag to specify how many times a request can be migrated to another worker:
```bash
dynamo-run in=dyn://... out=<engine> ... --migration-limit=3
```
This allows a request to be migrated up to 3 times before failing. See the [Request Migration Architecture](../fault-tolerance/request-migration.md) documentation for details on how this works.
### Request Cancellation
When using the HTTP interface (`in=http`), if the HTTP request connection is dropped by the client, Dynamo automatically cancels the downstream request to the worker. This ensures that computational resources are not wasted on generating responses that are no longer needed.
For detailed information about how request cancellation works across the system, see the [Request Cancellation Architecture](../fault-tolerance/request-cancellation.md) documentation.
## Development
`dynamo-run` is also an example of what can be built in Rust with the `dynamo-llm` and `dynamo-runtime` crates. The following guide shows how to build from source with all the features.
### Step 1: Install libraries
**Ubuntu:**
```
sudo apt install -y build-essential libhwloc-dev libudev-dev pkg-config libssl-dev libclang-dev protobuf-compiler python3-dev cmake
```
**macOS:**
- [Homebrew](https://brew.sh/)
```
# if brew is not installed on your system, install it
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
```
- [Xcode](https://developer.apple.com/xcode/)
```
brew install cmake protobuf
## Check that Metal is accessible
xcrun -sdk macosx metal
```
If Metal is accessible, you should see an error like `metal: error: no input files`, which confirms it is installed correctly.
### Step 2: Install Rust
```
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
source $HOME/.cargo/env
```
### Step 3: Build
- Linux with GPU and CUDA (tested on Ubuntu):
```
cargo build --features cuda
```
- macOS with Metal:
```
cargo build --features metal
```
- CPU only:
```
cargo build
```
Optionally you can run `cargo build` from any location with arguments:
```
--target-dir /path/to/target_directory # specify target_directory with write privileges
--manifest-path /path/to/project/Cargo.toml # if cargo build is run outside of `launch/` directory
```
The binary is called `dynamo-run` in `target/debug`
```
cd target/debug
```
Build with `--release` for a smaller binary and better performance, but longer build times. The binary will be in `target/release`.
## Engines
The input defaults to `in=text`. The output defaults to `out=mistralrs` engine, unless it is disabled with `--no-default-features` in which case an engine that echo's back your input is used.
### mistralrs
[mistral.rs](https://github.com/EricLBuehler/mistral.rs) is a pure Rust engine that is fast to run and fast to load, and runs well on CPU as well as GPU. For those reasons it is the default engine.
```
dynamo-run Qwen/Qwen3-4B
```
is equivalent to
```
dynamo-run in=text out=mistralrs Qwen/Qwen3-4B
```
If you have multiple GPUs, `mistral.rs` does automatic tensor parallelism. You do not need to pass any extra flags to dynamo-run to enable it.
### Mocker engine
The mocker engine is a mock vLLM implementation designed for testing and development purposes. It simulates realistic token generation timing without requiring actual model inference, making it useful for:
- Testing distributed system components without GPU resources
- Benchmarking infrastructure and networking overhead
- Developing and debugging Dynamo components
- Load testing and performance analysis
**Basic usage:**
The `--model-path` is required but can point to any valid model path - the mocker doesn't actually load the model weights (but the pre-processor needs the tokenizer). The arguments `block_size`, `num_gpu_blocks`, `max_num_seqs`, `max_num_batched_tokens`, `enable_prefix_caching`, and `enable_chunked_prefill` are common arguments shared with the real VLLM engine.
And below are arguments that are mocker-specific:
- `speedup_ratio`: Speed multiplier for token generation (default: 1.0). Higher values make the simulation engines run faster.
- `dp_size`: Number of data parallel workers to simulate (default: 1)
- `watermark`: KV cache watermark threshold as a fraction (default: 0.01). This argument also exists for the real VLLM engine but cannot be passed as an engine arg.
```bash
echo '{"speedup_ratio": 10.0}' > mocker_args.json
dynamo-run in=dyn://dynamo.mocker.generate out=mocker --model-path TinyLlama/TinyLlama-1.1B-Chat-v1.0 --extra-engine-args mocker_args.json
dynamo-run in=http out=auto --router-mode kv
```
### echo
The `echo` engine echoes the prompt back as the response.
```
dynamo-run in=http out=echo --model-name my_model
```
The echo engine uses a configurable delay between tokens to simulate generation speed. You can adjust this using the `DYN_TOKEN_ECHO_DELAY_MS` environment variable:
```
# Set token echo delay to 1ms (1000 tokens per second)
DYN_TOKEN_ECHO_DELAY_MS=1 dynamo-run in=http out=echo
```
The default delay is 10ms, which produces approximately 100 tokens per second.
### Other engines, multi-node, production
`vllm`, `sglang` and `trtllm` production grade engines are available in `examples/backends`. They run as Python components, using the Rust bindings. See the main README.
`dynamo-run` is an exploration, development and prototyping tool, as well as an example of using the Rust API. Multi-node and production setups should be using the main engine components.
## Batch mode
`dynamo-run` can take a jsonl file full of prompts and evaluate them all:
```
dynamo-run in=batch:prompts.jsonl out=mistralrs <model>
```
The input file should look like this:
```
{"text": "What is the capital of France?"}
{"text": "What is the capital of Spain?"}
```
Each one is passed as a prompt to the model. The output is written back to the same folder in `output.jsonl`. At the end of the run some statistics are printed.
The output looks like this:
```
{"text":"What is the capital of France?","response":"The capital of France is Paris.","tokens_in":7,"tokens_out":7,"elapsed_ms":1566}
{"text":"What is the capital of Spain?","response":".The capital of Spain is Madrid.","tokens_in":7,"tokens_out":7,"elapsed_ms":855}
```
## Writing your own engine in Python
The [dynamo](https://pypi.org/project/ai-dynamo/) Python library allows you to build your own engine and attach it to Dynamo. All of the main backend components in `examples/backends/` work like this.
The Python file must do three things:
1. Decorate a function to get the runtime
2. Register on the network
3. Attach a request handler
```
from dynamo.llm import ModelInput, ModelType, register_llm
from dynamo.runtime import DistributedRuntime, dynamo_worker
# 1. Decorate a function to get the runtime
#
@dynamo_worker()
async def worker(runtime: DistributedRuntime):
# 2. Register ourselves on the network
#
component = runtime.namespace("namespace").component("component")
model_path = "Qwen/Qwen3-0.6B" # or "/data/models/Qwen3-0.6B"
model_input = ModelInput.Tokens # or ModelInput.Text if engine handles pre-processing
model_type = ModelType.Chat # or ModelType.Chat | ModelType.Completions if model can be deployed on chat and completions endpoints
endpoint = component.endpoint("endpoint")
# Optional last param to register_llm is model_name. If not present derives it from model_path
await register_llm(model_input, model_type, endpoint, model_path)
# Initialize your engine here
# engine = ...
# 3. Attach request handler
#
await endpoint.serve_endpoint(RequestHandler(engine).generate)
class RequestHandler:
def __init__(self, engine):
...
async def generate(self, request):
# Call the engine
# yield result dict
...
if __name__ == "__main__":
uvloop.install()
asyncio.run(worker())
```
The `model_path` can be:
- A HuggingFace repo ID, optionally prefixed with `hf://`. It is downloaded and cached locally.
- The path to a checkout of a HuggingFace repo - any folder containing safetensor files as well as `config.json`, `tokenizer.json` and `tokenizer_config.json`.
The `model_input` can be:
- ModelInput.Tokens. Your engine expects pre-processed input (token IDs). Dynamo handles tokenization and pre-processing.
- ModelInput.Text. Your engine expects raw text input and handles its own tokenization and pre-processing.
The `model_type` can be:
- ModelType.Chat. Your `generate` method receives a `request` and must return a response dict of type [OpenAI Chat Completion](https://platform.openai.com/docs/api-reference/chat).
- ModelType.Completions. Your `generate` method receives a `request` and must return a response dict of the older [Completions](https://platform.openai.com/docs/api-reference/completions).
`register_llm` can also take the following kwargs:
- `model_name`: The name to call the model. Your incoming HTTP requests model name must match this. Defaults to the hugging face repo name, or the folder name.
- `context_length`: Max model length in tokens. Defaults to the model's set max. Only set this if you need to reduce KV cache allocation to fit into VRAM.
- `kv_cache_block_size`: Size of a KV block for the engine, in tokens. Defaults to 16.
- `user_data`: Optional dictionary containing custom metadata for worker behavior (e.g., LoRA configuration). Defaults to None.
Here are some example engines:
- Backend:
* [vllm](https://github.com/ai-dynamo/dynamo/blob/main/lib/bindings/python/examples/hello_world/server_vllm.py)
* [sglang](https://github.com/ai-dynamo/dynamo/blob/main/lib/bindings/python/examples/hello_world/server_sglang.py)
- Chat:
* [sglang](https://github.com/ai-dynamo/dynamo/blob/main/lib/bindings/python/examples/hello_world/server_sglang_tok.py)
More fully-featured Python engines are in `examples/backends`.
## Debugging
`dynamo-run` and `dynamo-runtime` support [tokio-console](https://github.com/tokio-rs/console). Build with the feature to enable:
```
cargo build --features cuda,tokio-console -p dynamo-run
```
The listener uses the default tokio console port, and all interfaces (0.0.0.0).
---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
title: "NVIDIA Dynamo Glossary"
---
## B
**Block** - A fixed-size chunk of tokens (typically 16 or 64 tokens) used for efficient KV cache management and memory allocation, serving as the fundamental unit for techniques like PagedAttention.
## C
**Component** - The fundamental deployable unit in Dynamo. A discoverable service entity that can host multiple endpoints and typically maps to a Docker container (such as VllmWorker, Router, Processor).
**Conditional Disaggregation** - Dynamo's intelligent decision-making process within disaggregated serving that determines whether a request is processed locally or sent to a remote prefill engine based on prefill length and queue status.
## D
**Decode Phase** - The second phase of LLM inference that generates output tokens one at a time.
**Disaggregated Serving** - Dynamo's core architecture that separates prefill and decode phases into specialized engines to maximize GPU throughput and improve performance.
**Distributed Runtime** - Dynamo's Rust-based core system that manages service discovery, communication, and component lifecycle across distributed clusters.
**Dynamo** - NVIDIA's high-performance distributed inference framework for Large Language Models (LLMs) and generative AI models, designed for multinode environments with disaggregated serving and cache-aware routing.
**Dynamo Kubernetes Platform** - A Kubernetes platform providing managed deployment experience for Dynamo inference graphs.
## E
**Endpoint** - A specific network-accessible API within a Dynamo component, such as `generate` or `load_metrics`.
## F
**Frontend** - Dynamo's API server component that receives user requests and provides OpenAI-compatible HTTP endpoints.
## G
**Graph** - A collection of interconnected Dynamo components that form a complete inference pipeline with request paths (single-in) and response paths (many-out for streaming). A graph can be packaged into a Dynamo Artifact for deployment.
## I
**Instance** - A running process with a unique `instance_id`. Multiple instances can serve the same namespace, component, and endpoint for load balancing
## K
**KV Block Manager (KVBM)** - Dynamo's scalable runtime component that handles memory allocation, management, and remote sharing of Key-Value blocks across heterogeneous and distributed environments.
**KV Cache** - Key-Value cache that stores computed attention states from previous tokens to avoid recomputation during inference.
**KV Router** - Dynamo's intelligent routing system that directs requests to workers with the highest cache overlap to maximize KV cache reuse. Determines routing based on KV cache hit rates and worker metrics.
**KVIndexer** - Dynamo component that maintains a global view of cached blocks across all workers using a prefix tree structure to calculate cache hit rates.
**KVPublisher** - Dynamo component that emits KV cache events (stored/removed) from individual workers to the global KVIndexer.
## M
**Model Deployment Card (MDC)** - A configuration structure containing all information required for distributed model serving. When a worker loads a model, it creates an MDC containing references to components such as the tokenizer, templates, runtime config. Workers publish their MDC to make the model discoverable to frontends. Frontends use the MDC to configure request preprocessing (tokenization, prompt formatting).
## N
**Namespace** - Dynamo's logical grouping mechanism for related components. Similar to directories in a file system, they prevent collisions between different deployments.
**NIXL (NVIDIA Inference tranXfer Library)** - High-performance data transfer library optimized for inference workloads, supporting direct GPU-to-GPU transfers and multiple memory hierarchies.
## P
**PagedAttention** - Memory management technique from vLLM that efficiently manages KV cache by chunking requests into blocks.
**Planner** - Dynamo component that performs dynamic resource scaling based on real-time demand signals and system metrics.
**Prefill Phase** - The first phase of LLM inference that processes the input prompt and generates KV cache.
**Prefix Caching** - Optimization technique that reuses previously computed KV cache for common prompt prefixes.
**Processor** - Dynamo component that handles request preprocessing, tokenization, and routing decisions.
## R
**RadixAttention** - Technique from SGLang that uses a prefix tree structure for efficient KV cache matching, insertion, and eviction.
**RDMA (Remote Direct Memory Access)** - Technology that allows direct memory access between distributed systems, used for efficient KV cache transfers.
## S
**SGLang** - Fast LLM inference framework with native embedding support and RadixAttention.
## T
**Tensor Parallelism (TP)** - Model parallelism technique where model weights are distributed across multiple GPUs.
**TensorRT-LLM** - NVIDIA's optimized LLM inference engine with multinode MPI distributed support.
**Time-To-First-Token (TTFT)** - The latency from receiving a request to generating the first output token.
## V
**vLLM** - High-throughput LLM serving engine with distributed tensor/pipeline parallelism and PagedAttention.
## W
**Wide Expert Parallelism (WideEP)** - Mixture-of-Experts deployment strategy that spreads experts across many GPUs (e.g., 64-way EP) so each GPU hosts only a few experts.
## X
**xPyD (x Prefill y Decode)** - Dynamo notation describing disaggregated serving configurations where x prefill workers serve y decode workers. Dynamo supports runtime-reconfigurable xPyD.
---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
title: "Dynamo Support Matrix"
---
This document provides the support matrix for Dynamo, including hardware, software and build instructions.
## Hardware Compatibility
| **CPU Architecture** | **Status** |
| :------------------- | :----------- |
| **x86_64** | Supported |
| **ARM64** | Supported |
### GPU Compatibility
If you are using a **GPU**, the following GPU models and architectures are supported:
| **GPU Architecture** | **Status** |
| :----------------------------------- | :--------- |
| **NVIDIA Blackwell Architecture** | Supported |
| **NVIDIA Hopper Architecture** | Supported |
| **NVIDIA Ada Lovelace Architecture** | Supported |
| **NVIDIA Ampere Architecture** | Supported |
## Platform Architecture Compatibility
**Dynamo** is compatible with the following platforms:
| **Operating System** | **Version** | **Architecture** | **Status** |
| :------------------- | :---------- | :--------------- | :----------- |
| **Ubuntu** | 22.04 | x86_64 | Supported |
| **Ubuntu** | 24.04 | x86_64 | Supported |
| **Ubuntu** | 24.04 | ARM64 | Supported |
| **CentOS Stream** | 9 | x86_64 | Experimental |
<Note>
Wheels are built using a manylinux_2_28-compatible environment and they have been validated on CentOS 9 and Ubuntu (22.04, 24.04).
Compatibility with other Linux distributions is expected but has not been officially verified yet.
</Note>
<Error>
KV Block Manager is supported only with Python 3.12. Python 3.12 support is currently limited to Ubuntu 24.04.
</Error>
## Software Compatibility
### Runtime Dependency
| **Python Package** | **Version** | glibc version | CUDA Version |
| :----------------- | :---------- | :------------------------------------ | :----------- |
| ai-dynamo | 0.8.0 | >=2.28 | |
| ai-dynamo-runtime | 0.8.0 | >=2.28 (Python 3.12 has known issues) | |
| NIXL | 0.8.0 | >=2.27 | >=11.8 |
### Build Dependency
The following table shows the dependency versions included with each Dynamo release:
| **Dependency** | **main (ToT)** | **v0.8.0 (unreleased)** | **v0.7.1** | **v0.7.0.post1** | **v0.7.0** |
| :------------- | :------------- | :---------------------- | :--------- | :--------------- | :--------- |
| SGLang | 0.5.7 | 0.5.7 | 0.5.3.post4| 0.5.3.post4 | 0.5.3.post4|
| TensorRT-LLM | 1.2.0rc6 | 1.2.0rc6 | 1.2.0rc3 | 1.2.0rc3 | 1.2.0rc2 |
| vLLM | 0.13.0 | 0.12.0 | 0.11.0 | 0.11.0 | 0.11.0 |
| NIXL | 0.8.0 | 0.8.0 | 0.8.0 | 0.8.0 | 0.8.0 |
<Note>
**main (ToT)** reflects the current development branch. **v0.8.0** is the upcoming release (planned for January 14, 2025) and not yet available.
</Note>
<Warning>
Specific versions of TensorRT-LLM supported by Dynamo are subject to change. Currently TensorRT-LLM does not support Python 3.11 so installation of the ai-dynamo[trtllm] will fail.
</Warning>
### CUDA Support by Framework
| **Dynamo Version** | **SGLang** | **TensorRT-LLM** | **vLLM** |
| :------------------- | :-----------------------| :-----------------------| :-----------------------|
| **Dynamo 0.7.1** | CUDA 12.8 | CUDA 13.0 | CUDA 12.9 |
## Cloud Service Provider Compatibility
### AWS
| **Host Operating System** | **Version** | **Architecture** | **Status** |
| :------------------------ | :---------- | :--------------- | :--------- |
| **Amazon Linux** | 2023 | x86_64 | Supported¹ |
<Error>
There is a known issue with the TensorRT-LLM framework when running the AL2023 container locally with `docker run --network host ...` due to a [bug](https://github.com/mpi4py/mpi4py/discussions/491#discussioncomment-12660609) in mpi4py. To avoid this issue, replace the `--network host` flag with more precise networking configuration by mapping only the necessary ports (e.g., 4222 for nats, 2379/2380 for etcd, 8000 for frontend).
</Error>
## Build Support
**Dynamo** currently provides build support in the following ways:
- **Wheels**: We distribute Python wheels of Dynamo and KV Block Manager:
- [ai-dynamo](https://pypi.org/project/ai-dynamo/)
- [ai-dynamo-runtime](https://pypi.org/project/ai-dynamo-runtime/)
- **New as of Dynamo v0.7.0:** [kvbm](https://pypi.org/project/kvbm/) as a standalone implementation.
- **Dynamo Runtime Images**: We distribute multi-arch images (x86 & ARM64 compatible) of the Dynamo Runtime for each of the LLM inference frameworks on [NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/collections/ai-dynamo):
- [SGLang](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/containers/sglang-runtime)
- [TensorRT-LLM](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/containers/tensorrtllm-runtime)
- [vLLM](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/containers/vllm-runtime)
- **Dynamo Kubernetes Operator Images**: We distribute multi-arch images (x86 & ARM64 compatible) of the Dynamo Operator on [NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/collections/ai-dynamo):
- [kubernetes-operator](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/containers/kubernetes-operator) to simplify deployments of Dynamo Graphs.
- **Helm Charts**: [NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/collections/ai-dynamo) hosts the helm charts supporting Kubernetes deployments of Dynamo:
- [Dynamo CRDs](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/helm-charts/dynamo-crds)
- [Dynamo Platform](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/helm-charts/dynamo-platform)
- [Dynamo Graph](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/helm-charts/dynamo-graph)
- **Rust Crates**:
- [dynamo-runtime](https://crates.io/crates/dynamo-runtime/)
- [dynamo-async-openai](https://crates.io/crates/dynamo-async-openai/)
- [dynamo-parsers](https://crates.io/crates/dynamo-parsers/)
- [dynamo-llm](https://crates.io/crates/dynamo-llm/)
Once you've confirmed that your platform and architecture are compatible, you can install **Dynamo** by following the instructions in the [Quick Start Guide](https://github.com/ai-dynamo/dynamo/blob/main/README.md#installation).
---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
title: "KV Router"
---
## Overview
The Dynamo KV Router intelligently routes requests by evaluating their computational costs across different workers. It considers both decoding costs (from active blocks) and prefill costs (from newly computed blocks). Optimizing the KV Router is critical for achieving maximum throughput and minimum latency in distributed inference setups.
## Quick Start
### Python / CLI Deployment
To launch the Dynamo frontend with the KV Router:
```bash
python -m dynamo.frontend --router-mode kv --http-port 8000
```
This command:
- Launches the Dynamo frontend service with KV routing enabled
- Exposes the service on port 8000 (configurable)
- Automatically handles all backend workers registered to the Dynamo endpoint
Backend workers register themselves using the `register_llm` API, after which the KV Router automatically:
- Tracks the state of all registered workers
- Makes routing decisions based on KV cache overlap
- Balances load across available workers
### Kubernetes Deployment
To enable the KV Router in a Kubernetes deployment, add the `DYN_ROUTER_MODE` environment variable to your frontend service:
```yaml
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
name: my-deployment
spec:
services:
Frontend:
dynamoNamespace: my-namespace
componentType: frontend
replicas: 1
envs:
- name: DYN_ROUTER_MODE
value: kv # Enable KV Smart Router
extraPodSpec:
mainContainer:
image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.6.0
Worker:
# ... worker configuration ...
```
**Key Points:**
- Set `DYN_ROUTER_MODE=kv` on the **Frontend** service only
- Workers automatically report KV cache events to the router
- No worker-side configuration changes needed
**Complete K8s Examples:**
- [TRT-LLM aggregated router example](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/trtllm/deploy/agg_router.yaml)
- [vLLM aggregated router example](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/vllm/deploy/agg_router.yaml)
- [SGLang aggregated router example](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/sglang/deploy/agg_router.yaml)
- [Distributed inference tutorial](https://github.com/ai-dynamo/dynamo/tree/main/examples/basics/kubernetes/Distributed_Inference/agg_router.yaml)
**For A/B Testing and Advanced K8s Setup:**
See the comprehensive [KV Router A/B Benchmarking Guide](../benchmarks/kv-router-ab-testing.md) for step-by-step instructions on deploying, configuring, and benchmarking the KV router in Kubernetes.
## Configuration Options
### CLI Arguments (Python Deployment)
The KV Router supports several key configuration options:
- **`--router-mode kv`**: Enable KV cache-aware routing (required)
- **`--kv-cache-block-size <size>`**: Sets the KV cache block size (default: backend-specific). Larger blocks reduce overlap detection granularity but improve memory efficiency. This should match your backend configuration.
- **`--router-temperature <float>`**: Controls routing randomness (default: 0.0)
- `0.0`: Deterministic selection of the best worker
- `> 0.0`: Probabilistic selection using softmax sampling
- Higher values increase randomness, helping prevent worker saturation
- **`--kv-events` / `--no-kv-events`**: Controls how the router tracks cached blocks (default: `--kv-events`)
- `--kv-events`: Uses real-time events from workers for accurate cache tracking
- `--no-kv-events`: Uses approximation based on routing decisions (lower overhead, less accurate)
- **`--kv-overlap-score-weight <float>`**: Balance between prefill and decode optimization (default: 1.0)
- Higher values (> 1.0): Prioritize reducing prefill cost (better TTFT)
- Lower values (< 1.0): Prioritize decode performance (better ITL)
For a complete list of available options:
```bash
python -m dynamo.frontend --help
```
### Kubernetes Environment Variables
All CLI arguments can be configured via environment variables in Kubernetes deployments. Use the `DYN_` prefix with uppercase parameter names:
| CLI Argument | K8s Environment Variable | Default | Description |
|--------------|-------------------------|---------|-------------|
| `--router-mode kv` | `DYN_ROUTER_MODE=kv` | `round_robin` | Enable KV router |
| `--router-temperature <float>` | `DYN_ROUTER_TEMPERATURE=<float>` | `0.0` | Routing randomness |
| `--kv-cache-block-size <size>` | `DYN_KV_CACHE_BLOCK_SIZE=<size>` | Backend-specific | KV cache block size |
| `--no-kv-events` | `DYN_KV_EVENTS=false` | `true` | Disable KV event tracking |
| `--kv-overlap-score-weight <float>` | `DYN_KV_OVERLAP_SCORE_WEIGHT=<float>` | `1.0` | Prefill vs decode weight |
| `--http-port <port>` | `DYN_HTTP_PORT=<port>` | `8000` | HTTP server port |
### Example with Advanced Configuration
```yaml
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
name: my-deployment
spec:
services:
Frontend:
dynamoNamespace: my-namespace
componentType: frontend
replicas: 1
envs:
- name: DYN_ROUTER_MODE
value: kv
- name: DYN_ROUTER_TEMPERATURE
value: "0.5" # Add some randomness to prevent worker saturation
- name: DYN_KV_OVERLAP_SCORE_WEIGHT
value: "1.5" # Prioritize TTFT over ITL
- name: DYN_KV_CACHE_BLOCK_SIZE
value: "16"
extraPodSpec:
mainContainer:
image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.6.0
```
### Alternative: Using Command Args in K8s
You can also pass CLI arguments directly in the container command:
```yaml
extraPodSpec:
mainContainer:
image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.6.0
command:
- /bin/sh
- -c
args:
- "python3 -m dynamo.frontend --router-mode kv --router-temperature 0.5 --http-port 8000"
```
**Recommendation:** Use environment variables for easier configuration management and consistency with Dynamo's K8s patterns.
## KV Router Architecture
The KV Router tracks two key metrics for each worker:
1. **Potential Active Blocks**: The number of blocks that would be used for decoding if a request is routed to a worker. This includes both existing active blocks and new blocks from the incoming request.
2. **Potential New Prefill Blocks**: The number of tokens that need to be computed from scratch on a worker, calculated as:
- New prefill tokens = Total input tokens - (Overlap blocks × Block size)
- Potential prefill blocks = New prefill tokens / Block size
### Block Tracking Mechanisms
The router maintains block information through two complementary systems:
- **Active Decoding Blocks**: Tracked locally by the router throughout the request lifecycle:
- Incremented when adding a new request
- Updated during token generation
- Decremented upon request completion
- **Cached Blocks**: Maintained globally by the KvIndexer using a prefix tree built from worker-reported KV events. This provides accurate overlap information for routing decisions.
## Cost Function
The KV Router's routing decision is based on a simple cost function:
```
logit = kv_overlap_score_weight × potential_prefill_blocks + potential_active_blocks
```
Where:
- Lower logit values are better (less computational cost)
- The router uses softmax sampling with optional temperature to select workers
### Key Parameter: kv-overlap-score-weight
The `kv-overlap-score-weight` parameter (default: 1.0) controls the balance between prefill and decode optimization:
- **Higher values (> 1.0)**: Emphasize reducing prefill cost
- Prioritizes routing to workers with better cache hits
- Optimizes for Time To First Token (TTFT)
- Best for workloads where initial response latency is critical
- **Lower values (< 1.0)**: Emphasize decode performance
- Distributes active decoding blocks more evenly
- Optimizes for Inter-Token Latency (ITL)
- Best for workloads with long generation sequences
## KV Events vs. Approximation Mode
The router uses KV events from workers by default to maintain an accurate global view of cached blocks. You can disable this with the `--no-kv-events` flag:
- **With KV Events (default)**:
- Calculates overlap accurately using actual cached blocks
- Provides higher accuracy with event processing overhead
- Recommended for production deployments
- **Without KV Events (--no-kv-events)**:
- Router predicts cache state based on routing decisions with TTL-based expiration and pruning
- Tracks blocks from recent requests with configurable time-to-live
- Reduces overhead at the cost of routing accuracy
- Suitable for testing or when event processing becomes a bottleneck
## Tuning Guidelines
### 1. Understand Your Workload Characteristics
- **Prefill-heavy workloads** (long prompts, short generations): Increase `kv-overlap-score-weight`
- **Decode-heavy workloads** (short prompts, long generations): Decrease `kv-overlap-score-weight`
### 2. Monitor Key Metrics
The router logs the cost calculation for each worker:
```
Formula for worker_1: 125.3 = 1.0 * 100.5 + 25.0 (cached_blocks: 15)
```
This shows:
- Total cost (125.3)
- Overlap weight × prefill blocks (1.0 × 100.5)
- Active blocks (25.0)
- Cached blocks that contribute to overlap (15)
### 3. Temperature-Based Routing
The `router_temperature` parameter controls routing randomness:
- **0.0 (default)**: Deterministic selection of the best worker
- **> 0.0**: Probabilistic selection, higher values increase randomness
- Useful for preventing worker saturation and improving load distribution
### 4. Iterative Optimization
1. Begin with default settings
2. Monitor TTFT and ITL metrics
3. Adjust `kv-overlap-score-weight` to meet your performance goals:
- To reduce TTFT: Increase the weight
- To reduce ITL: Decrease the weight
4. If you observe severe load imbalance, increase the temperature setting
---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
title: "KV Cache Routing"
---
This document explains how Dynamo's Key-Value (KV) cache routing optimizes large language model inference by intelligently directing requests to workers with the most relevant cached data, while maintaining load balance through worker utilization metrics.
To enable KV cache aware routing start the frontend node like this:
```
python -m dynamo.frontend --router-mode kv
```
When KV blocks are created or removed, the engine notifies the Dynamo router, which then identifies the worker with the best matching blocks and routes traffic accordingly.
To evaluate the benefits of KV-aware routing, compare your workload's performance using `--router-mode random|round-robin` against KV-aware routing.
The main KV-aware routing arguments:
- `--kv-overlap-score-weight`: Controls the importance of prefix cache overlaps in prefill cost calculations. Higher values improve Time To First Token (TTFT) at the cost of Inter-Token Latency (ITL). When set to 0, the router ignores prefix caches and uses pure load balancing. Defaults to 1.
- `--router-temperature`: Controls worker selection randomness through softmax sampling of router cost logits. A value of 0 (default) ensures deterministic selection of the lowest-cost worker, while higher values introduce more randomness.
- `--no-kv-events`: Disables KV event tracking. By default (when this flag is not provided), the router uses KV events to monitor block creation and deletion from workers. When disabled with this flag, the router predicts cache state based on routing decisions with TTL-based expiration (default 120s) and pruning. Use this flag if your backend doesn't support KV events (or you are not confident in the accuracy or responsiveness of the events).
- `--router-replica-sync`: Disabled by default. Enables NATS-based synchronization of local routing decisions between router replicas. When enabled, routers share their active sequence information and local predictions of block usage, improving routing consistency across instances. Note that this does not sync the radix tree or cached KV block states themselves - those are synchronized through JetStream events
- `--router-reset-states`: When specified, resets the router state on startup by clearing both the JetStream event stream and NATS object store, starting with a fresh state. By default (when this flag is not provided), the router persists state across restarts, downloading any available snapshot from NATS object store and continuing to consume events from where it left off. This enables routers to maintain KV cache awareness across restarts. **Warning**: Using `--router-reset-states` can bring existing router replicas into an inconsistent state. Only use this flag when launching the first router replica in a component, or consider using a different namespace/component for a clean slate.
- `--router-snapshot-threshold`: Sets the number of messages in the JetStream before triggering a snapshot. When the message count exceeds this threshold, a router will attempt to purge acknowledged messages from the stream and create a snapshot of the current radix tree state in NATs object store. Defaults to 1000000. This helps manage stream size and provides faster initialization for routers that restart.
- `--no-track-active-blocks`: Disables tracking of active blocks (blocks being used for ongoing generation/decode phases). By default, the router tracks active blocks for load balancing. Disable this when routing to workers that only perform prefill (no decode phase), as tracking decode load is not relevant. This reduces router overhead and simplifies state management.
- `--active-decode-blocks-threshold`: Initial threshold (0.0-1.0) for determining when a worker is considered busy based on KV cache block utilization. When a worker's KV cache active blocks exceed this percentage of total blocks, it will be marked as busy and excluded from routing. If not set, blocks-based busy detection is disabled. This feature works with all routing modes (`--router-mode kv|round-robin|random`) as long as backend engines emit `ForwardPassMetrics`. The threshold can be dynamically updated at runtime via the `/busy_threshold` HTTP endpoint (see [Dynamic Threshold Configuration](#dynamic-threshold-configuration)).
- `--active-prefill-tokens-threshold`: Literal token count threshold for determining when a worker is considered busy based on prefill token utilization. When active prefill tokens exceed this threshold, the worker is marked as busy. If not set, tokens-based busy detection is disabled.
- `--router-ttl`: Time-to-live in seconds for blocks in the router's local cache predictions. Blocks older than this duration will be automatically expired and removed from the router's radix tree. Defaults to 120.0 seconds when `--no-kv-events` is used. This helps manage memory usage by removing stale cache predictions that are unlikely to be accurate.
- `--router-max-tree-size`: Maximum tree size (number of blocks) before pruning is triggered. When the total number of blocks in the radix tree exceeds this threshold, the router will prune the least recently used blocks. Defaults to 1048576 (2^20 blocks) when `--no-kv-events` is used. This prevents unbounded memory growth in long-running deployments.
- `--router-prune-target-ratio`: Target size ratio to prune down to when `--router-max-tree-size` is exceeded. For example, with a value of 0.8 (default) and max tree size of 1048576, the router will prune down to approximately 838860 blocks when the threshold is exceeded. Defaults to 0.8 when `--no-kv-events` is used. This creates headroom before the next pruning cycle.
>[!Note]
> **State persistence** depends on the event transport mode:
> - **JetStream mode** (default): State persists across router restarts via JetStream and NATS object store snapshots.
> - **NATS Core with Local Indexer mode** (`--enable-local-indexer` on workers): State persists on workers—router rebuilds state by querying workers on startup.
> - **No KV events** (`--no-kv-events`): State persistence is not supported.
>
> **Request plane is independent of KV event transport.**
> `DYN_REQUEST_PLANE` controls how **requests** are sent (TCP/HTTP/NATS), but KV-aware routing still uses **NATS** for KV events in both JetStream and NATS Core + Local Indexer modes.
> If you run with `DYN_REQUEST_PLANE=tcp` (or `http`) and KV events enabled (default), you must also configure NATS, e.g. `NATS_SERVER=nats://...`.
> Only `--no-kv-events` removes the NATS requirement.
>
> When `--kv-overlap-score-weight` is set to 0, no KvIndexer is created and prefix matching is disabled (pure load balancing). When `--no-kv-events` is set, a KvIndexer is still created but no event subscriber is launched to consume KV events from workers. Instead, the router predicts cache state based on its own routing decisions with TTL-based expiration and pruning. In both cases, it's recommended to disable your backend workers from publishing events through `KvEventPublisher` to avoid event accumulation in JetStream. WIP to enable disabling publishing of KV events completely in these cases.
>
> The cli args `--router-ttl`, `--router-max-tree-size`, and `--router-prune-target-ratio` control local cache management when the router operates without receiving events from workers. When KV events are enabled (default), the router relies on worker-side eviction events and these parameters are ignored.
## Prerequisites and Limitations
>[!Note]
> **KV Router Requirements**: The KV router currently works only with **dynamic endpoints** that are registered via [`register_llm()`](../development/backend-guide.md) with `model_input=ModelInput.Tokens`. Your backend handler receives pre-tokenized requests with `token_ids` instead of raw text.
**Current Limitations (WIP):**
- **Static endpoints**: Not yet supported. The KV router requires dynamic model discovery via etcd to track worker instances and their KV cache states.
- **Multimodal models**: Not yet supported. The KV router currently tracks token-based blocks only.
**What this means for your setup:**
1. Backend workers must call `register_llm()` with `model_input=ModelInput.Tokens` (see [Backend Guide](../development/backend-guide.md) or [example implementations](https://github.com/ai-dynamo/dynamo/tree/main/lib/bindings/python/examples/hello_world))
2. Your handler receives requests with pre-tokenized `token_ids`, not raw text or multimodal inputs
3. You cannot use `--static-endpoint` mode with KV routing (use dynamic discovery instead)
For basic model registration without KV routing, you can use `--router-mode round-robin` or `--router-mode random` with both static and dynamic endpoints.
## Disaggregated Serving (Prefill and Decode)
Dynamo supports disaggregated serving where prefill (prompt processing) and decode (token generation) are handled by separate worker pools. When you register workers with `ModelType.Prefill` (see [Backend Guide](../development/backend-guide.md)), the frontend automatically detects them and activates an internal prefill router.
### Automatic Prefill Router Activation
The prefill router is automatically created when:
1. A decode model is registered (e.g., via `register_llm()` with `ModelType.Chat | ModelType.Completions`)
2. A prefill worker is detected with the same model name and `ModelType.Prefill`
**Key characteristics of the prefill router:**
- **Always disables active block tracking** (`track_active_blocks=false`) since prefill workers don't perform decode
- **Seamlessly integrated** into the request pipeline between preprocessing and decode routing
- **Falls back gracefully** to decode-only mode if prefill fails or no prefill workers are available
### Setup Example
When both workers are registered, requests are automatically routed.
```python
# Decode worker registration (in your decode worker)
decode_endpoint = runtime.namespace("dynamo").component("decode").endpoint("generate")
await register_llm(
model_input=ModelInput.Tokens,
model_type=ModelType.Chat | ModelType.Completions,
endpoint=decode_endpoint,
model_name="meta-llama/Llama-2-7b-hf",
# ... other parameters
)
await decode_endpoint.serve_endpoint(decode_handler.generate)
# Prefill worker registration (in your prefill worker)
prefill_endpoint = runtime.namespace("dynamo").component("prefill").endpoint("generate")
await register_llm(
model_input=ModelInput.Tokens,
model_type=ModelType.Prefill, # <-- Mark as prefill worker
endpoint=prefill_endpoint,
model_name="meta-llama/Llama-2-7b-hf", # Must match decode model name
# ... other parameters
)
await prefill_endpoint.serve_endpoint(prefill_handler.generate)
```
<Note>
The unified frontend with automatic prefill routing is currently enabled for vLLM and TensorRT-LLM backends. For SGLang (work in progress), you need to launch a separate standalone router as the prefill router targeting the prefill endpoints. See example script: [`examples/backends/sglang/launch/disagg_router.sh`](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/sglang/launch/disagg_router.sh).
</Note>
### Request Flow
The following diagram shows an overview of the major components in disaggregated serving:
```mermaid
graph TD
HTTP[HTTP]
ROUTER[Router]
PREFILL[Prefill Worker]
DECODE[Decode Worker]
classDef worker_style fill:#f3e5f5,stroke:#333,stroke-width:2px,color:#333;
classDef router_style fill:#2e8b57,stroke:#333,stroke-width:2px,color:#fff;
class PREFILL,DECODE worker_style
class ROUTER router_style
HTTP <--> |"request/response"| ROUTER
ROUTER --> |"1. send to prefill"| PREFILL
PREFILL --> |"2. return NIXL metadata"| ROUTER
ROUTER --> |"3. send with metadata"| DECODE
DECODE --> |"4. stream response"| ROUTER
PREFILL -.-> |"publish kv events"| ROUTER
linkStyle 0,1,2,3,4 stroke:#8b4513,stroke-width:2px
linkStyle 5 stroke:#2196f3,stroke-width:2px
```
## Overview
The KV-aware router operates on two key principles to optimize request routing:
### Global KV Cache State Synchronization
KV events from engines are collected by the router to maintain a global view of cached blocks across all workers. The router supports two event transport modes:
#### Mode 1: JetStream (Default)
KV events are sent to a persistent NATS JetStream. Each KV router/indexer replica acts as a durable consumer, pulling messages from this shared stream. This architecture ensures consistency across router replicas and persistence across restarts.
- **Best for**: Production deployments requiring durability and multi-replica router consistency
- **Tradeoffs**: Requires JetStream setup; slightly higher latency due to persistence guarantees
```mermaid
graph TD
subgraph Engines
E1[Engine 1<br/>KVPublisher]
E2[Engine 2<br/>KVPublisher]
E3[Engine 3<br/>KVPublisher]
end
subgraph "NATS JetStream"
JS[(Persistent KV Events Stream<br/>- Block created<br/>- Block removed)]
end
subgraph "NATS Object Store"
OS[(Radix Tree<br/>State Snapshot)]
end
subgraph "Router Replicas"
R1[Router 1<br/>KVIndexer]
R2[Router 2<br/>KVIndexer]
end
E1 -->|Publish Events| JS
E2 -->|Publish Events| JS
E3 -->|Publish Events| JS
JS -->|Consume as Durable Consumer| R1
JS -->|Consume as Durable Consumer| R2
JS -->|Periodic Snapshot| OS
style JS fill:#e1f5fe,stroke:#333,color:#333
style OS fill:#e1f5fe,stroke:#333,color:#333
style E1 fill:#f3e5f5,stroke:#333,color:#333
style E2 fill:#f3e5f5,stroke:#333,color:#333
style E3 fill:#f3e5f5,stroke:#333,color:#333
style R1 fill:#2e8b57,stroke:#333,color:#fff
style R2 fill:#2e8b57,stroke:#333,color:#fff
linkStyle 0,1,2,3,4,5 stroke:#2196f3,stroke-width:2px
```
#### Mode 2: NATS Core with Local Indexer
When workers are started with `--enable-local-indexer`, each worker maintains its own local radix tree (local indexer) and publishes events over NATS Core (fire-and-forget pub/sub) instead of JetStream. Each worker assigns monotonically increasing event IDs to its events. The router detects gaps in event sequences and recovers missed events by querying the worker's local indexer directly.
- **Best for**: Lower-latency setups; simpler deployments without JetStream; single-router scenarios
- **Tradeoffs**: State persists on workers (not centralized); recovery depends on workers being available
- **Enable with**: `--enable-local-indexer` flag on workers (vLLM, mocker)
```mermaid
graph TD
subgraph Engines
E1[Engine 1<br/>LocalKvIndexer]
E2[Engine 2<br/>LocalKvIndexer]
E3[Engine 3<br/>LocalKvIndexer]
end
subgraph "NATS Core"
NC[KV Events Pub/Sub<br/>- Block created<br/>- Block removed]
end
subgraph "Router Replicas"
R1[Router 1<br/>KVIndexer]
R2[Router 2<br/>KVIndexer]
end
E1 -->|Publish Events| NC
E2 -->|Publish Events| NC
E3 -->|Publish Events| NC
NC -->|Subscribe| R1
NC -->|Subscribe| R2
style NC fill:#e1f5fe,stroke:#333,color:#333
style E1 fill:#f3e5f5,stroke:#333,color:#333
style E2 fill:#f3e5f5,stroke:#333,color:#333
style E3 fill:#f3e5f5,stroke:#333,color:#333
style R1 fill:#2e8b57,stroke:#333,color:#fff
style R2 fill:#2e8b57,stroke:#333,color:#fff
linkStyle 0,1,2,3,4 stroke:#2196f3,stroke-width:2px
```
**How gap detection works:**
1. Each worker assigns monotonically increasing event IDs starting from 0
2. The router tracks the last received event ID per worker
3. If an event arrives with `event_id > last_id + 1`, the router detects a gap
4. The router queries the worker's local indexer for the missing event range `[last_id+1, event_id-1]`
5. On worker discovery (Added event), the router dumps the worker's entire local indexer state
**Startup behavior:**
- When a worker is discovered, the router queries and ingests its full local indexer state
- When a worker is removed, the router removes all its blocks from the global radix tree
>[!Note]
> The router automatically selects the transport mode based on worker configuration. If all connected workers have `enable_local_indexer=true`, the router uses NATS Core mode. Otherwise, it uses JetStream mode.
### Local Active Block Management with Replica Sync
Second, in addition to cached blocks, each router replica needs to track active blocks (blocks being used for ongoing generation) as load metrics. Since this information is highly time-sensitive, it should be predicted immediately when:
- The router receives and routes a request
- The first token is generated (prefill complete)
- The response ends (request freed)
This is managed locally in each router via a "slot manager". To maintain consistency across the system, router replicas synchronize these local predictions with each other through NATS core messaging.
```mermaid
sequenceDiagram
participant C1 as Client 1
participant R1 as Router 1<br/>(Slot Manager)
participant R2 as Router 2<br/>(Slot Manager)
participant C2 as Client 2
Note over R1,R2: Router Replica Sync Enabled
C1->>R1: Request A
activate R1
R1->>R1: Predict blocks & route to worker
R1-->>R2: Sync: AddRequest(A)
C2->>R2: Request B
activate R2
R2->>R2: Predict blocks & route to worker
R2-->>R1: Sync: AddRequest(B)
R1->>R1: First token received<br/>(prefill complete)
R1-->>R2: Sync: MarkPrefillCompleted(A)
R1->>C1: Stream response
R2->>R2: First token received<br/>(prefill complete)
R2-->>R1: Sync: MarkPrefillCompleted(B)
R2->>C2: Stream response
R1->>R1: Response complete<br/>(free blocks)
R1-->>R2: Sync: Free(A)
deactivate R1
R2->>R2: Response complete<br/>(free blocks)
R2-->>R1: Sync: Free(B)
deactivate R2
Note over R1,R2: Both routers have consistent<br/>view of active blocks
```
This dual-layer approach—persistent global KV cache state via JetStream and ephemeral active block synchronization via router replicas—enables the system to make optimal routing decisions that balance cache reuse with load distribution.
## Basic Routing
Dynamo supports several routing strategies when sending requests from one component to another component's endpoint.
First, we must create a client tied to a components endpoint, we can do this using the labels defined above. Here we are getting a client tied to the `generate` endpoint of the `VllmWorker` component.
```python
client = namespace('dynamo').component('VllmWorker').endpoint('generate').client()
```
We can then use the default routing methods exposed by the client class to send requests to the `VllmWorker` component.
- **Random routing**: Default strategy, available via `client.generate()` or `client.random()`
- **Round-robin routing**: Cycles through available workers via `client.round_robin()`
- **Direct routing**: Explicitly targets a specific worker via `client.direct(input, component_id)`
KV Cache routing uses direct routing with a special worker selection algorithm.
## Serving Multiple Router Replicas
For improved fault tolerance, you can launch multiple frontend + router replicas. Since the frontend and router are currently tied together, you'll need to use different HTTP ports for each instance. (The separation of the frontend and Router is WIP.)
### Router State Management
The KV Router tracks two types of state (see [KV Router Architecture](README.md) for details):
1. **Prefix blocks (cached KV blocks)**: Maintained in a radix tree, tracking which blocks are cached on each worker. This state is **persistent** - backed by NATS JetStream events and object store snapshots. New router replicas automatically sync this state on startup, ensuring consistent cache awareness across restarts.
2. **Active blocks (decoding blocks)**: Tracks blocks currently being used for active generation requests. This state is **ephemeral** - when a new router replica starts, it begins with zero active block knowledge but becomes eventually consistent as it handles requests.
### Enabling Router Replica Synchronization
```bash
# Router replica 1
python -m dynamo.frontend --router-mode kv --port 8000 --router-replica-sync
# Router replica 2 (can be started later)
python -m dynamo.frontend --router-mode kv --port 8001 --router-replica-sync
```
The `--router-replica-sync` flag enables active block synchronization between replicas:
- Active blocks are shared via NATS core messaging (fire-and-forget)
- Replicas exchange routing decisions to maintain consistent load estimates
- A new replica start with zero active blocks but quickly converge through request handling, by itself and active syncing with other replicas
Without this flag, each replica maintains its own isolated view of active blocks, potentially leading to suboptimal routing.
### Persistence and Recovery
Persistence behavior depends on which event transport mode is active:
**JetStream Mode (default):**
- Prefix blocks are stored in NATS JetStream with 1-hour retention
- Snapshots saved to NATS object store at configurable thresholds
- New replicas automatically restore this state on startup
- You can launch a third Router replica even if the first two are down, and it will recover the full prefix state
```bash
python -m dynamo.frontend --router-mode kv --port 8002 --router-replica-sync
```
**NATS Core with Local Indexer Mode:**
- State persists on workers—events are fire-and-forget but workers retain their local indexer state
- On startup, the router queries each worker's local indexer to rebuild state
- Recovery depends on workers being available; if a worker is down, its blocks cannot be recovered
- Simpler infrastructure (no JetStream required) but less resilient
>[!Note]
> If you need to start with a fresh state in JetStream mode, you have two options:
> 1. **Recommended**: Use a different namespace/component (see [Distributed Runtime](../design-docs/distributed-runtime.md)) which will start a new stream and NATS object store path
> 2. **Use with caution**: Launch a router with the `--router-reset-states` flag, which will purge the entire stream and radix snapshot. This should only be done when launching the first router replica in a component, as it can bring existing router replicas into an inconsistent state.
## Understanding KV Cache
The leading Large Language Models (LLMs) today are auto-regressive and based off of the [transformer architecture](https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf). One key inference optimization technique is to cache the already computed keys and values and to reuse them for the future tokens. This is called the [KV Cache](https://developer.nvidia.com/blog/mastering-llm-techniques-inference-optimization/#key-value_caching).
### KV Cache Optimizations
Every inference framework will have a KV Cache for each worker. A popular inference framework library is [vLLM](https://github.com/vllm-project/vllm) where a key contribution was [PagedAttention](https://arxiv.org/abs/2309.06180), which allowed them to manage KV Cache in an efficient way by chunking requests into blocks.
Another popular inference framework, [SGLang](https://github.com/sgl-project/sglang), contributed [RadixAttention](https://arxiv.org/abs/2312.07104) which introduced a
prefix tree which allows for efficient matching, inserting and eviction of KV Cache blocks. The prefix tree structure popularized KV Cache reuse.
In Dynamo, we introduce a KVPublisher which emits KV Cache events that occur at each worker and a KVIndexer which keeps track of these events globally.
To get a feel for how KV Cache management works on a single worker with KV Cache reuse turned on and where the KVPublisher gets plugged in, we can walk through the KV Block management flow:
1. Request tokenization: The incoming prompt is converted into tokens
2. Block partitioning: The token sequence is divided into fixed-size blocks (e.g., 16 or 64 tokens per block)
3. Block hashing: Each block of tokens is hashed to create a unique identifier
4. Cache lookup:
- For each block, the system checks if a matching block already exists in the KV cache
- If a match is found, the existing KV cache block is reused
- If no match is found, the system proceeds to the next step
5. Resource allocation:
- For blocks without matches, the system attempts to allocate new memory space
- If sufficient memory is available, allocate memory space and proceed to step 7
- If memory is constrained, proceed to step 6
6. Cache eviction (when necessary):
- The system applies an eviction policy (e.g., LRU, LFU) to identify blocks for removal
- Selected blocks are evicted from the cache
- **KVPublisher emits a KV removed event notifying KVIndexer about the removed block.**
- Alternatively, some systems may offload less-frequently used blocks to CPU memory.
7. KV computation:
- For new blocks, the model computes key and value tensors
- These tensors are stored in the newly allocated cache blocks
- **KVPublisher emits a kv stored event notifying KVIndexer about newly stored blocks**.
Further details can be found for: [TRT-LLM](https://developer.nvidia.com/blog/introducing-new-kv-cache-reuse-optimizations-in-nvidia-tensorrt-llm/), [vLLM](https://docs.vllm.ai/en/latest/design/automatic_prefix_caching.html#design-automatic-prefix-caching) and [SGLang](https://lmsys.org/blog/2024-01-17-sglang/).
## KV Cache Routing and Load Balancing
```mermaid
graph TD
T[Tokens] --> R[KV Aware Router]
R -.-> W1["Worker 1<br/>Cached: 2 blocks<br/>Prefill: 8 blks<br/>Decode: 10 blks"]
R ==>|Selected| W2["Worker 2<br/>Cached: 5 blocks<br/>Prefill: 5 blks<br/>Decode: 5 blks"]
R -.-> W3["Worker 3<br/>Cached: 8 blocks<br/>Prefill: 2 blks<br/>Decode: 9 blks"]
style T fill:#fff3e0,stroke:#333,color:#333
style R fill:#2e8b57,stroke:#333,color:#fff
style W1 fill:#f3e5f5,stroke:#333,color:#333
style W2 fill:#c8e6c9,stroke:#333,color:#333
style W3 fill:#f3e5f5,stroke:#333,color:#333
linkStyle 0,1,2,3 stroke:#8b4513,stroke-width:2px
```
KV Cache reuse introduces complexity to LLM serving load balancing. While it can significantly reduce computation costs, routing strategies that ignore worker-specific KV states can lead to:
- Missed cache reuse opportunities due to suboptimal worker selection
- System throughput degradation from uneven request distribution across workers
The router uses a cost function that considers both the prefill cost (influenced by cached blocks) and the decode load to make optimal routing decisions:
### Cost Calculation
1. **Prefill blocks**: Calculated by dividing the number of tokens requiring prefill processing by the block size. The system predicts this based on input tokens and available cached blocks per worker, updating the count when the first output token signals prefill completion.
2. **Decode blocks**: Estimated from the request's input tokens and each worker's active sequences. The count updates when requests complete and their blocks are freed.
3. **Cost formula**: `cost = overlap_score_weight * prefill_blocks + decode_blocks`
- Lower costs indicate better routing choices
- `overlap_score_weight` balances cache hit optimization against load distribution
- Higher weights favor cache reuse (improving TTFT), while lower weights prioritize even load distribution (improving ITL)
### Worker Selection
The router selects the worker with the lowest cost. When `router_temperature` is set to a non-zero value, the router uses softmax sampling on the normalized cost logits to introduce randomness in the selection, which can help with load distribution.
Example calculation with `overlap_score_weight = 1.0`:
- Worker 1: cost = 1.0 * 8 + 10 = 18
- **Worker 2: cost = 1.0 * 5 + 5 = 10** (selected - lowest cost)
- Worker 3: cost = 1.0 * 2 + 9 = 11
## Events
### KVPublisher
The KVPublisher can be initialized and then called in the inference framework where blocks are allocated and removed.
The two types of events are:
- KV stored event
- KV removed event
The publisher can be initialized and used through C bindings or Python bindings.
### Deterministic Event IDs
Engines do not need to emit deterministic block identifiers in KV events, as the router uses local block hashes (computed from token content) for tracking and matching blocks across workers. However, it is strongly preferred that engines do emit deterministic block identifiers, as this keeps the KvIndexer's internal lookup table smaller and more efficient. To ensure deterministic behavior, all workers should use identical engine versions/configuration. If your engine relies on Python's builtin `hash()` for any event IDs, set `PYTHONHASHSEED=0`; otherwise this setting has no effect.
### KVIndexer
The KVIndexer builds and maintains a global view of cached blocks in a prefix tree. We modify the original prefix tree by also storing the worker id on each node. This is so we can return the number of matched blocks for each worker.
The KVIndexer has a method `find_matches_for_request`, which takes in tokens and returns a dictionary with keys of worker id and values of the number of matched KV Blocks.
### Inter-Router Communication
In distributed deployments with multiple routers, each router maintains visibility over only a portion of the total requests. To ensure consistent routing decisions, routers synchronize their states through three event types:
1. **AddRequest**: Notifies other routers when a request is assigned to a worker. Includes request ID, worker ID, token sequence blocks, and overlap score to track block usage across the system.
2. **MarkPrefillCompleted**: Signals when a request moves from prefill to decode phase, allowing routers to update their worker load calculations by excluding completed prefill tokens.
3. **Free**: Indicates request completion and resource release, enabling accurate block reference counting across all routers.
Each event carries a unique router ID to prevent self-event processing. This asynchronous communication system ensures optimal routing decisions by maintaining consistent KV cache state across all routers, even as they handle different request streams.
## Using KvPushRouter Python API
Instead of launching the KV Router via command line, you can create a `KvPushRouter` object directly in Python. This allows per-request routing configuration overrides.
>[!Warning]
> **Multiple Routers in Same Process**: If you need to run multiple `KvPushRouter` instances for fault tolerance or load distribution, you must launch them in **separate processes** (e.g., using `python -m dynamo.frontend` with different ports). Creating multiple `KvPushRouter` objects in the same Python process is not supported - they share the same cancellation token from the component's primary lease, so dropping one router will cancel all routers in that process. For in-process routing, use a single `KvPushRouter` instance.
### Methods
The `KvPushRouter` provides the following methods:
- **`generate(token_ids, model, ...)`**: Route and execute a request, returning an async stream of responses. Automatically handles worker selection, state tracking, and lifecycle management.
- **`best_worker(token_ids, router_config_override=None, request_id=None)`**: Query which worker would be selected for given tokens. Returns `(worker_id, dp_rank, overlap_blocks)`.
- Without `request_id`: Query-only, doesn't update router state
- With `request_id`: Updates router state to track the request. **Note**: If used with `request_id`, you must call `mark_prefill_complete()` and `free()` at the appropriate lifecycle points to maintain accurate load tracking
- **`best_worker_id(token_ids, router_config_override=None, request_id=None)`**: **[DEPRECATED - use `best_worker()` instead]** Query which worker would be selected for given tokens. Returns `(worker_id, overlap_blocks)`.
- Without `request_id`: Query-only, doesn't update router state
- With `request_id`: Updates router state to track the request. **Note**: If used with `request_id`, you must call `mark_prefill_complete()` and `free()` at the appropriate lifecycle points to maintain accurate load tracking
- **`get_potential_loads(token_ids)`**: Get detailed load information for all workers, including potential prefill tokens and active decode blocks. Returns a list of load dictionaries.
- **`mark_prefill_complete(request_id)`**: Signal that a request has completed its prefill phase. Only used for [manual lifecycle management](#2-manual-state-management-advanced) when using `best_worker_id()` for manual routing instead of `generate()`.
- **`free(request_id)`**: Signal that a request has completed and its resources should be released. Only used for [manual lifecycle management](#2-manual-state-management-advanced) when using `best_worker_id()` for manual routing instead of `generate()`.
- **`dump_events()`**: Dump all KV cache events from the router's indexer as a JSON string. Useful for debugging and analysis.
### Setup
First, launch your backend engines:
```bash
python -m dynamo.vllm --model meta-llama/Llama-2-7b-hf
```
### Example Script
```python
import asyncio
from dynamo._core import DistributedRuntime, KvPushRouter, KvRouterConfig
async def main():
# Get runtime and create endpoint
runtime = DistributedRuntime.detached()
namespace = runtime.namespace("dynamo")
component = namespace.component("backend")
endpoint = component.endpoint("generate")
# Create KV router
kv_router_config = KvRouterConfig()
router = KvPushRouter(
endpoint=endpoint,
block_size=16,
kv_router_config=kv_router_config
)
# Your input tokens
token_ids = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
# Generate with per-request routing override
stream = await router.generate(
token_ids=token_ids,
model="meta-llama/Llama-2-7b-hf",
stop_conditions={
"max_tokens": 20, # Generate exactly 20 tokens
"ignore_eos": True, # Don't stop at EOS token
},
sampling_options={
"temperature": 0.7,
"top_p": 0.9,
},
router_config_override={
"overlap_score_weight": 2.0, # Prioritize cache hits for this request
"router_temperature": 0.5, # Add routing randomness
}
)
# Collect generated tokens
generated_tokens = []
async for response in stream:
if isinstance(response, dict) and "token_ids" in response:
generated_tokens.extend(response["token_ids"])
print(f"Generated {len(generated_tokens)} tokens: {generated_tokens}")
if __name__ == "__main__":
asyncio.run(main())
```
### Routing Patterns
The `KvPushRouter` supports multiple usage patterns depending on your control requirements:
#### 1. Automatic Routing (Recommended)
Call `generate()` directly and let the router handle everything:
```python
stream = await router.generate(token_ids=tokens, model="model-name")
```
- **Best for**: Most use cases
- **Router automatically**: Selects best worker, updates state, routes request, tracks lifecycle
#### 2. Manual State Management (Advanced)
Use `best_worker_id(request_id=...)` to select and track, then manage the request yourself:
```python
worker_id, overlap = await router.best_worker_id(tokens, request_id="req-123")
response = await client.generate(tokens, request_id="req-123")
# await anext(response) # Get first token
await router.mark_prefill_complete("req-123") # After first token
# async for _ in response: # Continue generating
# ...
await router.free("req-123") # After completion
```
- **Best for**: Custom request handling with router state tracking
- **Requires**: Calling `mark_prefill_complete()` and `free()` at correct lifecycle points
- **Caution**: Incorrect lifecycle management degrades load balancing accuracy
#### 3. Hierarchical Router Probing
Query without state updates, then route through a chosen router:
```python
# Probe multiple routers without updating state
worker_id_1, overlap_1 = await router_1.best_worker_id(tokens) # No request_id
worker_id_2, overlap_2 = await router_2.best_worker_id(tokens)
# Pick the best router based on results
chosen_router = router_1 if overlap_1 > overlap_2 else router_2
stream = await chosen_router.generate(tokens, model="model-name", worker_id=worker_id)
```
- **Best for**: Multi-tier deployments (e.g., Envoy Gateway routing to multiple router groups)
- **Advantage**: Query multiple routers before committing to one
#### 4. Custom Load-Based Routing
Use `get_potential_loads()` to implement custom routing logic:
```python
loads = await router.get_potential_loads(tokens)
# Apply custom logic (e.g., weighted scoring, constraints)
best_worker = min(loads, key=lambda x: custom_cost_fn(x))
stream = await router.generate(tokens, model="model-name", worker_id=best_worker['worker_id'])
```
- **Best for**: Custom optimization strategies beyond the built-in cost function
- **Advantage**: Full control over worker selection logic
- **See also**: Detailed example below in "Custom Routing Example: Minimizing TTFT"
All patterns support `router_config_override` to adjust routing behavior per-request without recreating the router.
### Custom Routing Example: Minimizing TTFT
Here's an example of using `get_potential_loads()` to implement custom routing that minimizes Time To First Token (TTFT) by selecting the worker with the least prefill work:
```python
import asyncio
from dynamo._core import DistributedRuntime, KvPushRouter, KvRouterConfig
async def minimize_ttft_routing():
# Setup router
runtime = DistributedRuntime.detached()
namespace = runtime.namespace("dynamo")
component = namespace.component("backend")
endpoint = component.endpoint("generate")
router = KvPushRouter(
endpoint=endpoint,
block_size=16,
kv_router_config=KvRouterConfig()
)
# Your input tokens
token_ids = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
# Get potential loads for all workers
potential_loads = await router.get_potential_loads(token_ids)
# Find worker with minimum prefill tokens (best for TTFT)
best_worker = min(potential_loads, key=lambda x: x['potential_prefill_tokens'])
print(f"Worker loads: {potential_loads}")
print(f"Selected worker {best_worker['worker_id']} with {best_worker['potential_prefill_tokens']} prefill tokens")
# Route directly to the selected worker
stream = await router.generate(
token_ids=token_ids,
model="meta-llama/Llama-2-7b-hf",
worker_id=best_worker['worker_id'], # Force routing to optimal worker
stop_conditions={"max_tokens": 20}
)
# Process response
async for response in stream:
if isinstance(response, dict) and "token_ids" in response:
print(f"Generated tokens: {response['token_ids']}")
if __name__ == "__main__":
asyncio.run(minimize_ttft_routing())
```
This approach gives you complete control over routing decisions, allowing you to optimize for different metrics based on your specific requirements. As some examples:
- **Minimize TTFT**: Select worker with lowest `potential_prefill_tokens`
- **Maximize cache reuse**: Use `best_worker_id()` which considers both prefill and decode loads
- **Balance load**: Consider both `potential_prefill_tokens` and `potential_decode_blocks` together
See [KV Router Architecture](README.md) for performance tuning details.
## Dynamic Threshold Configuration
The busy thresholds can be updated at runtime without restarting the frontend. The frontend exposes HTTP endpoints at `/busy_threshold`:
**Get or set a model's thresholds (POST):**
```bash
# Set both thresholds for a model
curl -X POST http://localhost:8000/busy_threshold \
-H "Content-Type: application/json" \
-d '{"model": "meta-llama/Llama-2-7b-hf", "active_decode_blocks_threshold": 0.85, "active_prefill_tokens_threshold": 1000}'
# Response: {"model": "meta-llama/Llama-2-7b-hf", "active_decode_blocks_threshold": 0.85, "active_prefill_tokens_threshold": 1000}
# Set only active decode blocks threshold
curl -X POST http://localhost:8000/busy_threshold \
-H "Content-Type: application/json" \
-d '{"model": "meta-llama/Llama-2-7b-hf", "active_decode_blocks_threshold": 0.85}'
# Response: {"model": "meta-llama/Llama-2-7b-hf", "active_decode_blocks_threshold": 0.85, "active_prefill_tokens_threshold": <current_value>}
# Get current thresholds (omit threshold fields)
curl -X POST http://localhost:8000/busy_threshold \
-H "Content-Type: application/json" \
-d '{"model": "meta-llama/Llama-2-7b-hf"}'
# Response: {"model": "meta-llama/Llama-2-7b-hf", "active_decode_blocks_threshold": 0.85, "active_prefill_tokens_threshold": 1000}
# Or if not configured: {"model": "...", "active_decode_blocks_threshold": null, "active_prefill_tokens_threshold": null}
```
**List all configured thresholds (GET):**
```bash
curl http://localhost:8000/busy_threshold
# Response: {"thresholds": [{"model": "meta-llama/Llama-2-7b-hf", "active_decode_blocks_threshold": 0.85, "active_prefill_tokens_threshold": 1000}]}
```
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# Navigation structure for Latest version
# Matching https://docs.nvidia.com/dynamo/latest/
navigation:
# ==================== Getting Started ====================
- section: Getting Started
contents:
- page: Quickstart
path: ../pages/getting-started/quickstart.md
- page: Installation
path: ../pages/getting-started/installation.md
- page: Support Matrix
path: ../pages/reference/support-matrix.md
- page: Examples
path: ../pages/getting-started/examples.md
# ==================== Kubernetes Deployment ====================
- section: Kubernetes Deployment
contents:
- section: Deployment Guide
contents:
- page: Kubernetes Quickstart
path: ../pages/kubernetes/README.md
- page: Detailed Installation Guide
path: ../pages/kubernetes/installation-guide.md
- page: Dynamo Operator
path: ../pages/kubernetes/dynamo-operator.md
- page: Minikube Setup
path: ../pages/kubernetes/deployment/minikube-setup.md
- page: Managing Models with DynamoModel
path: ../pages/kubernetes/deployment/dynamomodel-guide.md
- section: Observability (K8s)
contents:
- page: Metrics
path: ../pages/kubernetes/observability/metrics.md
- page: Logging
path: ../pages/kubernetes/observability/logging.md
- section: Multinode
contents:
- page: Multinode Deployments
path: ../pages/kubernetes/deployment/multinode-deployment.md
- page: Grove
path: ../pages/kubernetes/grove.md
# ==================== User Guides ====================
- section: User Guides
contents:
- page: Tool Calling
path: ../pages/agents/tool-calling.md
- page: Multimodality Support
path: ../pages/multimodal/index.md
- page: Finding Best Initial Configs
path: ../pages/performance/aiconfigurator.md
- page: Dynamo Benchmarking Guide
path: ../pages/benchmarks/benchmarking.md
- page: Tuning Disaggregated Performance
path: ../pages/performance/tuning.md
- page: Writing Python Workers in Dynamo
path: ../pages/development/runtime-guide.md
- section: Observability (Local)
contents:
- page: Overview
path: ../pages/observability/README.md
- page: Prometheus + Grafana Setup
path: ../pages/observability/prometheus-grafana.md
- page: Metrics
path: ../pages/observability/metrics.md
- page: Metrics Developer Guide
path: ../pages/observability/metrics-developer-guide.md
- page: Health Checks
path: ../pages/observability/health-checks.md
- page: Tracing
path: ../pages/observability/tracing.md
- page: Logging
path: ../pages/observability/logging.md
- section: Fault Tolerance
contents:
- page: Overview
path: ../pages/fault-tolerance/README.md
- page: Request Migration
path: ../pages/fault-tolerance/request-migration.md
- page: Request Cancellation
path: ../pages/fault-tolerance/request-cancellation.md
- page: Graceful Shutdown
path: ../pages/fault-tolerance/graceful-shutdown.md
- page: Request Rejection
path: ../pages/fault-tolerance/request-rejection.md
- page: Testing
path: ../pages/fault-tolerance/testing.md
- page: Glossary
path: ../pages/reference/glossary.md
# ==================== Components ====================
- section: Components
contents:
- section: Backends
contents:
- page: vLLM
path: ../pages/backends/vllm/README.md
- page: SGLang
path: ../pages/backends/sglang/README.md
- page: TensorRT-LLM
path: ../pages/backends/trtllm/README.md
- page: Router
path: ../pages/router/README.md
- section: Planner
contents:
- page: Overview
path: ../pages/planner/planner-intro.md
- page: SLA Planner Quick Start
path: ../pages/planner/sla-planner-quickstart.md
- page: SLA-Driven Profiling
path: ../pages/benchmarks/sla-driven-profiling.md
- page: SLA-based Planner
path: ../pages/planner/sla-planner.md
- section: KVBM
contents:
- page: Overview
path: ../pages/kvbm/kvbm-intro.md
- page: Motivation
path: ../pages/kvbm/kvbm-motivation.md
- page: Architecture
path: ../pages/kvbm/kvbm-architecture.md
- page: Components
path: ../pages/kvbm/kvbm-components.md
- page: Design Deep Dive
path: ../pages/kvbm/kvbm-design-deepdive.md
- page: Integrations
path: ../pages/kvbm/kvbm-integrations.md
- page: KVBM in vLLM
path: ../pages/kvbm/vllm-setup.md
- page: KVBM in TRTLLM
path: ../pages/kvbm/trtllm-setup.md
- page: LMCache Integration
path: ../pages/backends/vllm/LMCache-Integration.md
- page: Further Reading
path: ../pages/kvbm/kvbm-reading.md
# ==================== Design Docs ====================
- section: Design Docs
contents:
- page: Overall Architecture
path: ../pages/design-docs/architecture.md
- page: Architecture Flow
path: ../pages/design-docs/dynamo-flow.md
- page: Disaggregated Serving
path: ../pages/design-docs/disagg-serving.md
- page: Distributed Runtime
path: ../pages/design-docs/distributed-runtime.md
- page: Event Plane
path: ../pages/design-docs/event-plane.md
# ==================== Additional Resources ====================
# Hidden section - these pages are accessible via direct URL but not shown in navigation
- section: Additional Resources
hidden: true
contents:
- section: Advanced Kubernetes
contents:
- page: Create Deployment
path: ../pages/kubernetes/deployment/create-deployment.md
- page: Autoscaling
path: ../pages/kubernetes/autoscaling.md
- page: Service Discovery
path: ../pages/kubernetes/service-discovery.md
- page: Model Caching with Fluid
path: ../pages/kubernetes/model-caching-with-fluid.md
- page: FluxCD
path: ../pages/kubernetes/fluxcd.md
- page: Webhooks
path: ../pages/kubernetes/webhooks.md
- page: API Reference
path: ../pages/kubernetes/api-reference.md
- section: Multimodal Details
contents:
- page: vLLM
path: ../pages/multimodal/vllm.md
- page: SGLang
path: ../pages/multimodal/sglang.md
- page: TensorRT-LLM
path: ../pages/multimodal/trtllm.md
- section: Router Details
contents:
- page: KV Cache Routing
path: ../pages/router/kv-cache-routing.md
- section: Benchmarks
contents:
- page: KV Router A/B Testing
path: ../pages/benchmarks/kv-router-ab-testing.md
- section: Frontends
contents:
- page: KServe
path: ../pages/frontends/kserve.md
- section: Development
contents:
- page: Backend Guide
path: ../pages/development/backend-guide.md
- section: Guides
contents:
- page: Request Plane
path: ../pages/guides/request-plane.md
- page: Jail Stream
path: ../pages/guides/jail-stream-readme.md
- page: Load Planner
path: ../pages/planner/load-planner.md
- page: CLI Reference
path: ../pages/reference/cli.md
- section: API Reference
contents:
- section: NIXL Connect
contents:
- page: Overview
path: ../pages/api/nixl-connect/README.md
- page: Connector
path: ../pages/api/nixl-connect/connector.md
- page: Device
path: ../pages/api/nixl-connect/device.md
- page: Device Kind
path: ../pages/api/nixl-connect/device-kind.md
- page: Descriptor
path: ../pages/api/nixl-connect/descriptor.md
- page: Read Operation
path: ../pages/api/nixl-connect/read-operation.md
- page: Write Operation
path: ../pages/api/nixl-connect/write-operation.md
- page: Readable Operation
path: ../pages/api/nixl-connect/readable-operation.md
- page: Writable Operation
path: ../pages/api/nixl-connect/writable-operation.md
- page: Operation Status
path: ../pages/api/nixl-connect/operation-status.md
- page: RDMA Metadata
path: ../pages/api/nixl-connect/rdma-metadata.md
- section: Backend Details
contents:
- section: vLLM
contents:
- page: DeepSeek-R1
path: ../pages/backends/vllm/deepseek-r1.md
- page: GPT-OSS
path: ../pages/backends/vllm/gpt-oss.md
- page: Multi-Node
path: ../pages/backends/vllm/multi-node.md
- page: Speculative Decoding
path: ../pages/backends/vllm/speculative-decoding.md
- page: Prompt Embeddings
path: ../pages/backends/vllm/prompt-embeddings.md
- page: Prometheus
path: ../pages/backends/vllm/prometheus.md
- section: SGLang
contents:
- page: GPT-OSS
path: ../pages/backends/sglang/gpt-oss.md
- page: Disaggregation
path: ../pages/backends/sglang/sglang-disaggregation.md
- page: Expert Distribution (EPLB)
path: ../pages/backends/sglang/expert-distribution-eplb.md
- page: HiCache Example
path: ../pages/backends/sglang/sgl-hicache-example.md
- page: Profiling
path: ../pages/backends/sglang/profiling.md
- page: Prometheus
path: ../pages/backends/sglang/prometheus.md
- section: TensorRT-LLM
contents:
- page: GPT-OSS
path: ../pages/backends/trtllm/gpt-oss.md
- page: KV Cache Transfer
path: ../pages/backends/trtllm/kv-cache-transfer.md
- page: Gemma3 Sliding Window
path: ../pages/backends/trtllm/gemma3-sliding-window-attention.md
- page: Llama4 + Eagle
path: ../pages/backends/trtllm/llama4-plus-eagle.md
- page: Multinode Examples
path: ../pages/backends/trtllm/multinode/multinode-examples.md
- page: Prometheus
path: ../pages/backends/trtllm/prometheus.md
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment