"launch/dynamo-run/vscode:/vscode.git/clone" did not exist on "ffccc72268838c1d5acdd73fbde8570358f30c90"
Unverified Commit 39d645e5 authored by Jonathan Tong's avatar Jonathan Tong Committed by GitHub
Browse files

docs: migrate Fern docs from fern/ into docs/ (#6206)


Signed-off-by: default avatarJont828 <jt572@cornell.edu>
parent d381e6ff
<!--
SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
# KVBM Guide
The Dynamo KV Block Manager (KVBM) is a scalable runtime component designed to handle memory allocation, management, and remote sharing of Key-Value (KV) blocks for inference tasks across heterogeneous and distributed environments. It acts as a unified memory layer for frameworks like vLLM and TensorRT-LLM.
KVBM is modular and can be used standalone via `pip install kvbm` or as the memory management component in the full Dynamo stack. This guide covers installation, configuration, and deployment of the Dynamo KV Block Manager (KVBM) and other KV cache management systems.
## Table of Contents
- [Quick Start](#quick-start)
- [Run KVBM Standalone](#run-kvbm-standalone)
- [Run KVBM in Dynamo with vLLM](#run-kvbm-in-dynamo-with-vllm)
- [Run KVBM in Dynamo with TensorRT-LLM](#run-kvbm-in-dynamo-with-tensorrt-llm)
- [Run Dynamo with SGLang HiCache](#run-dynamo-with-sglang-hicache)
- [Disaggregated Serving with KVBM](#disaggregated-serving-with-kvbm)
- [Configuration](#configuration)
- [Enable and View KVBM Metrics](#enable-and-view-kvbm-metrics)
- [Benchmarking KVBM](#benchmarking-kvbm)
- [Troubleshooting](#troubleshooting)
- [Developing Locally](#developing-locally)
## Quick Start
## Run KVBM Standalone
KVBM can be used independently without using the rest of the Dynamo stack:
```bash
pip install kvbm
```
See the [support matrix](../../reference/support-matrix.md) for version compatibility.
### Build from Source
To build KVBM from source, see the detailed instructions in the [KVBM bindings README](../../../lib/bindings/kvbm/README.md#build-from-source).
## Run KVBM in Dynamo with vLLM
### Docker Setup
```bash
# Start up etcd for KVBM leader/worker registration and discovery
docker compose -f deploy/docker-compose.yml up -d
# Build a dynamo vLLM container (KVBM is built in by default)
./container/build.sh --framework vllm
# Launch the container
./container/run.sh --framework vllm -it --mount-workspace --use-nixl-gds
```
### Aggregated Serving
```bash
cd $DYNAMO_HOME/examples/backends/vllm
./launch/agg_kvbm.sh
```
#### Verify Deployment
```bash
curl localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen3-0.6B",
"messages": [{"role": "user", "content": "Hello, how are you?"}],
"stream": false,
"max_tokens": 10
}'
```
#### Alternative: Using Direct vllm serve
You can also use `vllm serve` directly with KVBM:
```bash
vllm serve --kv-transfer-config '{"kv_connector":"DynamoConnector","kv_role":"kv_both", "kv_connector_module_path": "kvbm.vllm_integration.connector"}' Qwen/Qwen3-0.6B
```
## Run KVBM in Dynamo with TensorRT-LLM
> [!NOTE]
> **Prerequisites:**
> - Ensure `etcd` and `nats` are running before starting
> - KVBM only supports TensorRT-LLM's PyTorch backend
> - Disable partial reuse (`enable_partial_reuse: false`) to increase offloading cache hits
> - KVBM requires TensorRT-LLM v1.2.0rc2 or newer
### Docker Setup
```bash
# Start up etcd for KVBM leader/worker registration and discovery
docker compose -f deploy/docker-compose.yml up -d
# Build a dynamo TRTLLM container (KVBM is built in by default)
./container/build.sh --framework trtllm
# Launch the container
./container/run.sh --framework trtllm -it --mount-workspace --use-nixl-gds
```
### Aggregated Serving
```bash
# Write the LLM API config
cat > "/tmp/kvbm_llm_api_config.yaml" <<EOF
backend: pytorch
cuda_graph_config: null
kv_cache_config:
enable_partial_reuse: false
free_gpu_memory_fraction: 0.80
kv_connector_config:
connector_module: kvbm.trtllm_integration.connector
connector_scheduler_class: DynamoKVBMConnectorLeader
connector_worker_class: DynamoKVBMConnectorWorker
EOF
# Start dynamo frontend
python3 -m dynamo.frontend --http-port 8000 &
# Serve the model with KVBM
python3 -m dynamo.trtllm \
--model-path Qwen/Qwen3-0.6B \
--served-model-name Qwen/Qwen3-0.6B \
--extra-engine-args /tmp/kvbm_llm_api_config.yaml &
```
#### Verify Deployment
```bash
curl localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen3-0.6B",
"messages": [{"role": "user", "content": "Hello, how are you?"}],
"stream": false,
"max_tokens": 30
}'
```
#### Alternative: Using trtllm-serve
```bash
trtllm-serve Qwen/Qwen3-0.6B --host localhost --port 8000 --backend pytorch --extra_llm_api_options /tmp/kvbm_llm_api_config.yaml
```
## Run Dynamo with SGLang HiCache
SGLang's Hierarchical Cache (HiCache) extends KV cache storage beyond GPU memory to include host CPU memory. When using NIXL as the storage backend, HiCache integrates with Dynamo's memory infrastructure.
### Quick Start
```bash
# Start SGLang worker with HiCache enabled
python -m dynamo.sglang \
--model-path Qwen/Qwen3-0.6B \
--host 0.0.0.0 --port 8000 \
--enable-hierarchical-cache \
--hicache-ratio 2 \
--hicache-write-policy write_through \
--hicache-storage-backend nixl
# In a separate terminal, start the frontend
python -m dynamo.frontend --http-port 8000
# Send a test request
curl localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen3-0.6B",
"messages": [{"role": "user", "content": "Hello!"}],
"stream": false,
"max_tokens": 30
}'
```
> **Learn more:** See the [SGLang HiCache Integration Guide](../../integrations/sglang_hicache.md) for detailed configuration, deployment examples, and troubleshooting.
## Disaggregated Serving with KVBM
KVBM supports disaggregated serving where prefill and decode operations run on separate workers. KVBM is enabled on the prefill worker to offload KV cache.
### Disaggregated Serving with vLLM
```bash
# 1P1D - one prefill worker and one decode worker
# NOTE: requires at least 2 GPUs
cd $DYNAMO_HOME/examples/backends/vllm
./launch/disagg_kvbm.sh
# 2P2D - two prefill workers and two decode workers
# NOTE: requires at least 4 GPUs
cd $DYNAMO_HOME/examples/backends/vllm
./launch/disagg_kvbm_2p2d.sh
```
### Disaggregated Serving with TRT-LLM
> [!NOTE]
> The latest TensorRT-LLM release (1.3.0rc1) is currently experiencing a request hang when running disaggregated serving with KVBM.
> Please include the TensorRT-LLM commit id `18e611da773026a55d187870ebcfa95ff00c8482` when building the Dynamo TensorRT-LLM runtime image to test the KVBM + disaggregated serving feature.
```bash
# Build the Dynamo TensorRT-LLM container using commit ID 18e611da773026a55d187870ebcfa95ff00c8482. Note: This build can take a long time.
./container/build.sh --framework trtllm --tensorrtllm-commit 18e611da773026a55d187870ebcfa95ff00c8482 --tensorrtllm-git-url https://github.com/NVIDIA/TensorRT-LLM.git
# Launch the container
./container/run.sh --framework trtllm -it --mount-workspace --use-nixl-gds
```
> [!NOTE]
> Important: After logging into the Dynamo TensorRT-LLM runtime container, copy the Triton kernels into the container’s virtual environment as a separate Python module.
```bash
# Clone the TensorRT-LLM repo and copy the triton_kernels folder into the container as a Python module.
git clone https://github.com/NVIDIA/TensorRT-LLM.git /tmp/TensorRT-LLM && \
cd /tmp/TensorRT-LLM && \
git checkout 18e611da773026a55d187870ebcfa95ff00c8482 && \
cp -r triton_kernels /opt/dynamo/venv/lib/python3.12/site-packages/ && \
cd /workspace && \
rm -rf /tmp/TensorRT-LLM
```
```bash
# Launch prefill worker with KVBM
python3 -m dynamo.trtllm \
--model-path Qwen/Qwen3-0.6B \
--served-model-name Qwen/Qwen3-0.6B \
--extra-engine-args /tmp/kvbm_llm_api_config.yaml \
--disaggregation-mode prefill &
```
## Configuration
### Cache Tier Configuration
Configure KVBM cache tiers using environment variables:
```bash
# Option 1: CPU cache only (GPU -> CPU offloading)
export DYN_KVBM_CPU_CACHE_GB=4 # 4GB of pinned CPU memory
# Option 2: Both CPU and Disk cache (GPU -> CPU -> Disk tiered offloading)
export DYN_KVBM_CPU_CACHE_GB=4
export DYN_KVBM_DISK_CACHE_GB=8 # 8GB of disk
# [Experimental] Option 3: Disk cache only (GPU -> Disk direct offloading)
# NOTE: Experimental, may not provide optimal performance
# NOTE: Disk offload filtering not supported with this option
export DYN_KVBM_DISK_CACHE_GB=8
```
You can also specify exact block counts instead of GB:
- `DYN_KVBM_CPU_CACHE_OVERRIDE_NUM_BLOCKS`
- `DYN_KVBM_DISK_CACHE_OVERRIDE_NUM_BLOCKS`
### SSD Lifespan Protection
When disk offloading is enabled, disk offload filtering is enabled by default to extend SSD lifespan. The current policy only offloads KV blocks from CPU to disk if the blocks have frequency ≥ 2. Frequency doubles on cache hit (initialized at 1) and decrements by 1 on each time decay step.
To disable disk offload filtering:
```bash
export DYN_KVBM_DISABLE_DISK_OFFLOAD_FILTER=true
```
## Enable and View KVBM Metrics
### Setup Monitoring Stack
```bash
# Start basic services (etcd & natsd), along with Prometheus and Grafana
docker compose -f deploy/docker-observability.yml up -d
```
### Enable Metrics for vLLM
```bash
DYN_KVBM_METRICS=true \
DYN_KVBM_CPU_CACHE_GB=20 \
python -m dynamo.vllm \
--model Qwen/Qwen3-0.6B \
--enforce-eager \
--connector kvbm
```
### Enable Metrics for TensorRT-LLM
```bash
DYN_KVBM_METRICS=true \
DYN_KVBM_CPU_CACHE_GB=20 \
python3 -m dynamo.trtllm \
--model-path Qwen/Qwen3-0.6B \
--served-model-name Qwen/Qwen3-0.6B \
--extra-engine-args /tmp/kvbm_llm_api_config.yaml &
```
### Firewall Configuration (Optional)
```bash
# If firewall blocks KVBM metrics ports
sudo ufw allow 6880/tcp
```
### View Metrics
Access Grafana at http://localhost:3000 (default login: `dynamo`/`dynamo`) and look for the **KVBM Dashboard**.
### Available Metrics
| Metric | Description |
|--------|-------------|
| `kvbm_matched_tokens` | Number of matched tokens |
| `kvbm_offload_blocks_d2h` | Offload blocks from device to host |
| `kvbm_offload_blocks_h2d` | Offload blocks from host to disk |
| `kvbm_offload_blocks_d2d` | Offload blocks from device to disk (bypassing host) |
| `kvbm_onboard_blocks_d2d` | Onboard blocks from disk to device |
| `kvbm_onboard_blocks_h2d` | Onboard blocks from host to device |
| `kvbm_host_cache_hit_rate` | Host cache hit rate (0.0-1.0) |
| `kvbm_disk_cache_hit_rate` | Disk cache hit rate (0.0-1.0) |
## Benchmarking KVBM
Use [LMBenchmark](https://github.com/LMCache/LMBenchmark) to evaluate KVBM performance.
### Setup
```bash
git clone https://github.com/LMCache/LMBenchmark.git
cd LMBenchmark/synthetic-multi-round-qa
```
### Run Benchmark
```bash
# Synthetic multi-turn chat dataset
# Arguments: model, endpoint, output prefix, qps
./long_input_short_output_run.sh \
"Qwen/Qwen3-0.6B" \
"http://localhost:8000" \
"benchmark_kvbm" \
1
```
Average TTFT and other performance numbers will be in the output.
> **TIP:** If metrics are enabled, observe KV offloading and onboarding in the Grafana dashboard.
### Baseline Comparison
#### vLLM Baseline (without KVBM)
```bash
vllm serve Qwen/Qwen3-0.6B
```
#### TensorRT-LLM Baseline (without KVBM)
```bash
# Create config without kv_connector_config
cat > "/tmp/llm_api_config.yaml" <<EOF
backend: pytorch
cuda_graph_config: null
kv_cache_config:
enable_partial_reuse: false
free_gpu_memory_fraction: 0.80
EOF
trtllm-serve Qwen/Qwen3-0.6B --host localhost --port 8000 --backend pytorch --extra_llm_api_options /tmp/llm_api_config.yaml
```
## Troubleshooting
### No TTFT Performance Gain
**Symptom:** Enabling KVBM does not show TTFT improvement or causes performance degradation.
**Cause:** Not enough prefix cache hits on KVBM to reuse offloaded KV blocks.
**Solution:** Enable KVBM metrics and check the Grafana dashboard for `Onboard Blocks - Host to Device` and `Onboard Blocks - Disk to Device`. Large numbers of onboarded KV blocks indicate good cache reuse:
![Grafana Example](../../images/kvbm_metrics_grafana.png)
### KVBM Worker Initialization Timeout
**Symptom:** KVBM fails to start when allocating large memory or disk storage.
**Solution:** Increase the leader-worker initialization timeout (default: 1800 seconds):
```bash
export DYN_KVBM_LEADER_WORKER_INIT_TIMEOUT_SECS=3600 # 1 hour
```
### Disk Offload Fails to Start
**Symptom:** KVBM fails to start when disk offloading is enabled.
**Cause:** `fallocate()` is not supported on the filesystem (e.g., Lustre, certain network filesystems).
**Solution:** Enable disk zerofill fallback:
```bash
export DYN_KVBM_DISK_ZEROFILL_FALLBACK=true
```
If you encounter "write all error" or EINVAL (errno 22), also try:
```bash
export DYN_KVBM_DISK_DISABLE_O_DIRECT=true
```
## Developing Locally
Inside the Dynamo container, after changing KVBM-related code (Rust and/or Python):
```bash
cd /workspace/lib/bindings/kvbm
uv pip install maturin[patchelf]
maturin build --release --out /workspace/dist
uv pip install --upgrade --force-reinstall --no-deps /workspace/dist/kvbm*.whl
```
## See Also
- [KVBM Overview](README.md) for a quick overview of KV Caching, KVBM and its architecture
- [KVBM Design](../../design_docs/kvbm_design.md) for a deep dive into KVBM architecture
- [LMCache Integration](../../integrations/lmcache_integration.md)
- [FlexKV Integration](../../integrations/flexkv_integration.md)
- [SGLang HiCache](../../integrations/sglang_hicache.md)
<!--
SPDX-FileCopyrightText: Copyright (c) 2024-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
# Planner
The Planner monitors system performance and automatically scales prefill/decode workers to meet latency SLAs. It runs as a component inside the Dynamo inference graph on Kubernetes.
> **New to the Planner?** Start with the [SLA Planner Quick Start Guide](planner_guide.md) for a complete workflow including profiling and deployment.
## Feature Matrix
| Category | Feature | Status |
|----------|---------|--------|
| **Backend** | Local (bare metal) | Deprecated |
| | Kubernetes | Supported |
| **LLM Framework** | vLLM | Supported |
| | TensorRT-LLM | Supported |
| | SGLang | Supported |
| **Serving Type** | Aggregated | Unsupported |
| | Disaggregated | Supported |
| **Scaling Mode** | SLA-based (TTFT/ITL targets) | Supported (primary) |
| | Load-based (KV cache/queue thresholds) | Deprecated |
| **Load Predictors** | ARIMA | Supported |
| | Prophet | Supported |
| | Kalman filter | Supported |
| | Constant (current = next) | Supported |
| **Connectors** | KubernetesConnector (native DGD scaling) | Supported |
| | VirtualConnector (external environments) | Supported |
## Quick Start
### Prerequisites
- Dynamo platform installed on Kubernetes ([Installation Guide](/docs/kubernetes/installation_guide.md))
- kube-prometheus-stack installed ([Metrics Setup](/docs/kubernetes/observability/metrics.md))
- Pre-deployment profiling completed ([Profiling Guide](/docs/components/profiler/profiler_guide.md))
### Deploy with DGDR (Recommended)
The fastest path to a planner-enabled deployment is through a DynamoGraphDeploymentRequest:
```bash
kubectl apply -f benchmarks/profiler/deploy/profile_sla_aic_dgdr.yaml -n $NAMESPACE
```
This automatically profiles your model and deploys with the SLA planner. See [SLA Planner Guide](planner_guide.md) for the full workflow.
### Deploy with DGD (Manual)
For manual control, use the disaggregated planner templates:
```bash
# After profiling is complete
kubectl apply -f examples/backends/vllm/deploy/disagg_planner.yaml -n $NAMESPACE
```
## Documentation
| Document | Description |
|----------|-------------|
| [Planner Guide](planner_guide.md) | Deployment, configuration, integration, troubleshooting |
| [Planner Examples](planner_examples.md) | DGDR YAML examples, sample configurations, advanced patterns |
| [SLA Planner Guide](planner_guide.md) | End-to-end DGDR workflow: define SLAs, profile, deploy, monitor |
| [SLA-based Planner](planner_guide.md) | Scaling algorithm, correction factors, load prediction details |
| [Load-based Planner](README.md) | Legacy load-based scaling (deprecated) |
| [SLA-Driven Profiling](/docs/components/profiler/profiler_guide.md) | Pre-deployment profiling process and configuration |
| [Planner Design](/docs/design_docs/planner_design.md) | Architecture deep-dive for contributors |
## Configuration Reference
### Key Arguments
| Argument | Default | Description |
|----------|---------|-------------|
| `--namespace` | `$DYN_NAMESPACE` or `dynamo` | Dynamo logical namespace |
| `--backend` | `vllm` | Backend framework (`vllm`, `sglang`, `trtllm`) |
| `--environment` | `kubernetes` | Deployment environment |
| `--adjustment-interval` | `180` | Seconds between scaling decisions |
| `--ttft` | `500.0` | Target Time To First Token (ms) |
| `--itl` | `50.0` | Target Inter-Token Latency (ms) |
| `--isl` | `3000` | Expected average input sequence length |
| `--osl` | `150` | Expected average output sequence length |
| `--load-predictor` | `arima` | Prediction model (`arima`, `prophet`, `kalman`, `constant`) |
| `--max-gpu-budget` | `8` | Maximum GPUs across all workers |
| `--min-endpoint` | `1` | Minimum replicas per worker type |
| `--decode-engine-num-gpu` | `1` | GPUs per decode engine |
| `--prefill-engine-num-gpu` | `1` | GPUs per prefill engine |
| `--no-operation` | `false` | Observation mode (no actual scaling) |
| `--no-correction` | `false` | Disable correction factors |
| `--profile-results-dir` | `profiling_results` | Path to profiling data (NPZ/JSON) |
### Environment Variables
| Variable | Default | Description |
|----------|---------|-------------|
| `DYN_NAMESPACE` | `dynamo` | Dynamo logical namespace |
| `DYN_PARENT_DGD_K8S_NAME` | (required) | Parent DGD K8s resource name |
| `PROMETHEUS_ENDPOINT` | `http://prometheus-kube-prometheus-prometheus.monitoring.svc.cluster.local:9090` | Prometheus URL |
| `PLANNER_PROMETHEUS_PORT` | `0` (disabled) | Port for planner's own Prometheus metrics |
## Monitoring
### Grafana Dashboard
Deploy the planner dashboard:
```bash
kubectl apply -n monitoring -f deploy/observability/k8s/grafana-planner-dashboard-configmap.yaml
```
The dashboard shows:
- Worker counts and GPU usage over time
- Observed TTFT, ITL, request rate, sequence lengths
- Predicted load and recommended replica counts
- Correction factors (actual vs. expected performance)
### Prometheus Metrics
The planner queries the frontend's `/metrics` endpoint via Prometheus. Required metrics:
- Request count and duration
- TTFT and ITL distributions
- Input/output sequence lengths
```{toctree}
:hidden:
planner_guide
planner_examples
```
<!--
SPDX-FileCopyrightText: Copyright (c) 2024-2026 NVIDIA CORPORATION & AFFILIATES.
All rights reserved.
SPDX-License-Identifier: Apache-2.0
-->
# Planner Examples
Practical examples for deploying the SLA Planner with different configurations. For deployment concepts, see the [Planner Guide](planner_guide.md). For a quick overview, see the [Planner README](README.md).
## Basic Examples
### Minimal DGDR with AIC (Fastest)
The simplest way to deploy with the SLA planner. Uses AI Configurator for offline profiling (20-30 seconds instead of hours):
```yaml
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeploymentRequest
metadata:
name: sla-aic
spec:
model: Qwen/Qwen3-32B
backend: vllm
profilingConfig:
profilerImage: "nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.6.1"
config:
sla:
isl: 3000
osl: 150
ttft: 200
itl: 20
sweep:
useAiConfigurator: true
aicSystem: h200_sxm
aicHfId: Qwen/Qwen3-32B
aicBackendVersion: "0.20.0"
deploymentOverrides:
workersImage: "nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.6.1"
autoApply: true
```
Deploy:
```bash
export NAMESPACE=your-namespace
kubectl apply -f benchmarks/profiler/deploy/profile_sla_aic_dgdr.yaml -n $NAMESPACE
```
### Online Profiling (Real Measurements)
Standard online profiling runs real GPU measurements for more accurate results. Takes 2-4 hours:
```yaml
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeploymentRequest
metadata:
name: sla-online
spec:
model: meta-llama/Llama-3.3-70B-Instruct
backend: vllm
profilingConfig:
profilerImage: "nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.6.1"
config:
sla:
isl: 3000
osl: 150
ttft: 200
itl: 20
sweep:
useAiConfigurator: false
prefillInterpolationGranularity: 16
decodeInterpolationGranularity: 6
deploymentOverrides:
workersImage: "nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.6.1"
autoApply: true
```
Deploy:
```bash
kubectl apply -f benchmarks/profiler/deploy/profile_sla_dgdr.yaml -n $NAMESPACE
```
Available sample DGDRs in `benchmarks/profiler/deploy/`:
- **`profile_sla_dgdr.yaml`**: Standard online profiling for dense models
- **`profile_sla_aic_dgdr.yaml`**: Fast offline profiling using AI Configurator
- **`profile_sla_moe_dgdr.yaml`**: Online profiling for MoE models (SGLang)
> **Profiling Config Cases**: Prior to 0.8.1, fields under `profilingConfig.config` use snake_case. Starting 0.8.1, fields use camelCase. There is backwards compatibility to snake_case, but example DGDRs use camelCase.
## Kubernetes Examples
### MoE Models (SGLang)
For Mixture-of-Experts models like DeepSeek-R1, use SGLang backend:
```yaml
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeploymentRequest
metadata:
name: sla-moe
spec:
model: deepseek-ai/DeepSeek-R1
backend: sglang
profilingConfig:
profilerImage: "nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.6.1"
config:
sla:
isl: 4000
osl: 500
ttft: 300
itl: 10
sweep:
useAiConfigurator: false
deploymentOverrides:
workersImage: "nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.6.1"
autoApply: true
```
Deploy:
```bash
kubectl apply -f benchmarks/profiler/deploy/profile_sla_moe_dgdr.yaml -n $NAMESPACE
```
### Using Existing DGD Configs (Custom Setups)
Reference an existing DynamoGraphDeployment config via ConfigMap:
**Step 1: Create ConfigMap from your DGD config:**
```bash
kubectl create configmap deepseek-r1-config \
--from-file=disagg.yaml=/path/to/your/disagg.yaml \
--namespace $NAMESPACE \
--dry-run=client -o yaml | kubectl apply -f -
```
**Step 2: Reference it in your DGDR:**
```yaml
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeploymentRequest
metadata:
name: deepseek-r1
spec:
model: deepseek-ai/DeepSeek-R1
backend: sglang
profilingConfig:
profilerImage: "nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.6.1"
configMapRef:
name: deepseek-r1-config
key: disagg.yaml # Must match the key used in --from-file
config:
sla:
isl: 4000
osl: 500
ttft: 300
itl: 10
sweep:
useAiConfigurator: true
aicSystem: h200_sxm
aicHfId: deepseek-ai/DeepSeek-V3
aicBackendVersion: "0.20.0"
deploymentOverrides:
workersImage: "nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.6.1"
autoApply: true
```
The profiler uses the DGD config from the ConfigMap as a **base template**, then optimizes it based on your SLA targets. The controller automatically injects `spec.model` and `spec.backend` into the final configuration.
### Inline Configuration (Simple Use Cases)
For simple use cases without a custom DGD config, provide profiler configuration directly. The profiler auto-generates a basic DGD configuration:
```yaml
profilingConfig:
config:
sla:
isl: 8000
osl: 200
ttft: 200.0
itl: 10.0
hardware:
minNumGpusPerEngine: 2
maxNumGpusPerEngine: 8
gpuType: h200_sxm
sweep:
prefillInterpolationGranularity: 16
decodeInterpolationGranularity: 6
```
### Mocker Deployment (Testing)
Deploy a mocker backend that simulates GPU timing behavior without real GPUs. Useful for:
- Large-scale experiments without GPU resources
- Testing planner behavior and infrastructure
- Validating deployment configurations
```yaml
spec:
model: <model-name>
backend: trtllm # Real backend for profiling
useMocker: true # Deploy mocker instead of real backend
profilingConfig:
profilerImage: "nvcr.io/nvidia/dynamo/trtllm-runtime:<image-tag>"
config:
sla:
isl: 3000
osl: 150
ttft: 200
itl: 20
sweep:
useAiConfigurator: true
aicSystem: h100_sxm
autoApply: true
```
Profiling runs against the real backend (via GPUs or AIC). The mocker deployment then uses profiling data to simulate realistic timing.
### Model Cache PVC (0.8.1+)
For large models, use a pre-populated PVC instead of downloading from HuggingFace:
See [SLA-Driven Profiling](/docs/components/profiler/profiler_guide.md) for configuration details.
## Advanced Examples
### Custom Load Predictors
#### Warm-starting with Trace Data
Pre-load predictors with historical request patterns before live traffic:
```yaml
# In planner arguments
args:
- --load-predictor arima
- --load-predictor-warmup-trace /data/trace.jsonl
- --load-predictor-log1p
```
The trace file should be in mooncake-style JSONL format with request-count, ISL, and OSL samples.
#### Kalman Filter Tuning
For workloads with rapid changes, tune the Kalman filter:
```yaml
args:
- --load-predictor kalman
- --kalman-q-level 2.0 # Higher = more responsive to level changes
- --kalman-q-trend 0.5 # Higher = trend changes faster
- --kalman-r 5.0 # Lower = trusts new measurements more
- --kalman-min-points 3 # Fewer points before forecasting starts
- --load-predictor-log1p # Often helps with request-rate series
```
#### Prophet for Seasonal Workloads
For workloads with daily/weekly patterns:
```yaml
args:
- --load-predictor prophet
- --prophet-window-size 100 # Larger window for seasonal detection
- --load-predictor-log1p
```
### Virtual Connector
For non-Kubernetes environments, use the VirtualConnector to communicate scaling decisions:
```python
from dynamo._core import DistributedRuntime, VirtualConnectorClient
# Initialize client
client = VirtualConnectorClient(distributed_runtime, namespace)
# Main loop: watch for planner decisions and execute them
while True:
# Block until the planner makes a new scaling decision
await client.wait()
# Read the decision
decision = await client.get()
print(f"Scale to: prefill={decision.num_prefill_workers}, "
f"decode={decision.num_decode_workers}, "
f"id={decision.decision_id}")
# Execute scaling in your environment
scale_prefill_workers(decision.num_prefill_workers)
scale_decode_workers(decision.num_decode_workers)
# Report completion
await client.complete(decision)
```
See `components/planner/test/test_virtual_connector.py` for a full working example.
### Planner Configuration Passthrough
Pass planner-specific settings through the DGDR:
```yaml
profilingConfig:
config:
planner:
plannerMinEndpoint: 2
```
### Review Before Deploy (autoApply: false)
Disable auto-deployment to inspect the generated DGD:
```yaml
spec:
autoApply: false
```
After profiling completes:
```bash
# Extract and review generated DGD
kubectl get dgdr sla-aic -n $NAMESPACE \
-o jsonpath='{.status.generatedDeployment}' > my-dgd.yaml
# Review and modify as needed
vi my-dgd.yaml
# Deploy manually
kubectl apply -f my-dgd.yaml -n $NAMESPACE
```
### Profiling Artifacts with PVC
Save detailed profiling artifacts (plots, logs, raw data) to a PVC:
```yaml
spec:
profilingConfig:
outputPVC: "dynamo-pvc"
config:
sla:
isl: 3000
osl: 150
ttft: 200
itl: 20
```
Setup:
```bash
export NAMESPACE=your-namespace
deploy/utils/setup_benchmarking_resources.sh
```
Access results:
```bash
kubectl apply -f deploy/utils/manifests/pvc-access-pod.yaml -n $NAMESPACE
kubectl wait --for=condition=Ready pod/pvc-access-pod -n $NAMESPACE --timeout=60s
kubectl cp $NAMESPACE/pvc-access-pod:/data ./profiling-results
kubectl delete pod pvc-access-pod -n $NAMESPACE
```
## Related Documentation
- [Planner README](README.md) -- Overview and quick start
- [Planner Guide](planner_guide.md) -- Deployment, configuration, integration
- [Planner Design](/docs/design_docs/planner_design.md) -- Architecture deep-dive
- [DGDR Configuration Reference](/docs/components/profiler/profiler_guide.md#dgdr-configuration-reference)
- [SLA-Driven Profiling](/docs/components/profiler/profiler_guide.md)
<!--
SPDX-FileCopyrightText: Copyright (c) 2024-2026 NVIDIA CORPORATION & AFFILIATES.
All rights reserved.
SPDX-License-Identifier: Apache-2.0
-->
# Planner Guide
Deployment, configuration, and integration guide for the Dynamo SLA Planner. For a quick overview, see the [Planner README](README.md). For architecture internals, see [Planner Design](/docs/design_docs/planner_design.md).
## Deployment
### Prerequisites
Before deploying the planner, ensure:
- **Dynamo platform installed** with the operator running (see [Installation Guide](/docs/kubernetes/installation_guide.md))
- **[kube-prometheus-stack](/docs/kubernetes/observability/metrics.md) installed and running** (required for SLA planner metric collection)
- **Image pull secrets configured** if using private registries (typically `nvcr-imagepullsecret` for NVIDIA images)
- **Sufficient GPU resources** available in your cluster for profiling
- **Runtime images available** that contain both profiler and runtime components
### Container Images
Each DGDR requires container images for the profiling and deployment process:
**profilingConfig.profilerImage** (Required):
The container image used for the profiling job. Must contain the profiler code and dependencies for SLA-based profiling.
**deploymentOverrides.workersImage** (Optional):
The container image used for DGD worker components (frontend, workers, planner). Used for:
- Temporary DGDs created during online profiling (for performance measurements)
- The final DGD deployed after profiling completes
If `workersImage` is omitted, the image from the base config file (e.g., `disagg.yaml`) is used. Public images are available from 0.6.1 onward.
```yaml
spec:
profilingConfig:
profilerImage: "nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.6.1"
deploymentOverrides:
workersImage: "nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.6.1" # Optional
```
### What is a DynamoGraphDeploymentRequest (DGDR)?
A **DGDR** is a Kubernetes Custom Resource that serves as the primary interface for deploying models with specific performance and resource constraints. It specifies:
- **What** model to deploy (`model`)
- **How** it should perform (SLA targets: `ttft`, `itl`)
- **Where** it should run (optional GPU preferences)
- **Which** backend to use (`backend`: vllm, sglang, or trtllm)
- **Which** images to use (`profilingConfig.profilerImage`, `deploymentOverrides.workersImage`)
The Dynamo Operator watches for DGDRs and automatically:
1. Discovers available GPU resources in your cluster
2. Runs profiling (online or offline) to find optimal configurations
3. Generates an optimized DynamoGraphDeployment (DGD) configuration
4. Deploys the DGD to your cluster
**Key Benefits:**
- **Declarative**: Specify what you want, not how to achieve it
- **Automated**: No manual profiling job setup or result processing
- **SLA-Driven**: Ensures deployments meet your performance requirements
- **Integrated**: Works seamlessly with the Dynamo Operator
### DGDR Workflow
The DGDR workflow automates the entire process from SLA specification to deployment:
1. **Define SLAs**: Specify performance requirements (TTFT, ITL) and model information
2. **Automatic Profiling**: The operator profiles your model to find optimal configurations
3. **Auto-Deploy**: The system deploys the optimal configuration that meets your SLAs
```mermaid
flowchart TD
A[Create DGDR] --> B[DGDR Controller]
B --> C{Profiling Method}
C -->|Online| D[Run Profiling Job<br/>2-4 hours]
C -->|Offline/AIC| E[AI Configurator<br/>20-30 seconds]
D --> F[Generate DGD Config]
E --> F
F --> G[Auto-Deploy DGD]
G --> H[Monitor & Scale]
style A fill:#e1f5fe
style D fill:#fff3e0
style E fill:#e8f5e8
style G fill:#f3e5f5
style H fill:#fff8e1
```
### Monitoring Progress
Watch DGDR status:
```bash
# View status
kubectl get dgdr -n $NAMESPACE
# Detailed status
kubectl describe dgdr sla-aic -n $NAMESPACE
# Watch profiling job logs
kubectl logs -f job/profile-sla-aic -n $NAMESPACE
```
**DGDR Status States:**
- `Pending`: Initial state, preparing to profile
- `Profiling`: Running profiling job (20-30 seconds for AIC, 2-4 hours for online)
- `Deploying`: Generating and applying DGD configuration
- `Ready`: DGD successfully deployed and running
- `Failed`: Error occurred (check events for details)
### Relationship to DGD
- **DGDR**: High-level "intent" -- what you want deployed
- **DGD**: Low-level "implementation" -- how it's deployed
The DGDR controller generates a DGD that:
- Uses optimal TP configurations from profiling
- Includes the SLA planner for autoscaling
- Has deployment and engine settings tuned for your SLAs
The generated DGD is tracked via labels:
```yaml
metadata:
labels:
dgdr.nvidia.com/name: sla-aic
dgdr.nvidia.com/namespace: your-namespace
```
## Configuration
### DGDR Configuration
#### Required Fields
| Field | Type | Description |
|-------|------|-------------|
| `spec.model` | string | Model identifier (e.g., `meta-llama/Llama-3-70b`) |
| `spec.backend` | enum | Inference backend: `vllm`, `sglang`, or `trtllm` |
| `spec.profilingConfig.profilerImage` | string | Container image for profiling job |
| `spec.profilingConfig.config.sla` | object | SLA targets (isl, osl, ttft, itl) |
#### Optional Fields
| Field | Type | Description |
|-------|------|-------------|
| `spec.deploymentOverrides.workersImage` | string | Container image for DGD workers. If omitted, uses image from base config. |
| `spec.autoApply` | boolean | Automatically deploy DGD after profiling (default: false) |
| `spec.useMocker` | boolean | Deploy mocker instead of real backend (default: false) |
| `spec.deploymentOverrides` | object | Customize metadata and image for auto-created DGD |
#### SLA Configuration
```yaml
sla:
isl: 3000 # Average input sequence length (tokens)
osl: 150 # Average output sequence length (tokens)
ttft: 200 # Target Time To First Token (milliseconds, float)
itl: 20 # Target Inter-Token Latency (milliseconds, float)
```
**Choosing SLA Values:**
- **ISL/OSL**: Based on your expected traffic patterns
- **TTFT**: First token latency target (lower = more GPUs needed)
- **ITL**: Token generation latency target (lower = more GPUs needed)
- **Trade-offs**: Tighter SLAs require more GPU resources
For comprehensive documentation of all configuration options, see the [DGDR Configuration Reference](/docs/components/profiler/profiler_guide.md#dgdr-configuration-reference).
### Profiling Methods
Choose between **online profiling** (real measurements, 2-4 hours) or **offline profiling** with AI Configurator (estimated, 20-30 seconds):
```yaml
# Online Profiling (Default)
sweep:
useAiConfigurator: false
# Offline Profiling (AI Configurator)
sweep:
useAiConfigurator: true
aicSystem: h200_sxm
aicHfId: Qwen/Qwen3-32B
aicBackendVersion: "0.20.0"
```
For detailed comparison, supported configurations, and limitations, see [SLA-Driven Profiling Documentation](/docs/components/profiler/profiler_guide.md#profiling-methods).
### Load Predictors
The SLA planner forecasts the number of requests, ISL, and OSL in the next adjustment interval. Four prediction models are supported:
#### Constant Predictor
- **Use case**: Stable workloads with long prediction intervals
- **Behavior**: Assumes next load equals current load
- **Configuration**: `load-predictor: "constant"`
#### ARIMA Predictor
- **Use case**: Time-series data with trends and seasonality
- **Behavior**: Uses auto-ARIMA to fit optimal model parameters
- **Configuration**: `load-predictor: "arima"`
- **Tunable parameters**:
- `--load-predictor-log1p`: model `log1p(y)` instead of `y`. If not set, ARIMA starts in raw space, and if it collapses to `(0,d,0)`, it falls back to `log1p` automatically.
#### Kalman Predictor
- **Use case**: Low-latency online forecasting (observe 1 -> predict 1) with smooth adaptation
- **Behavior**: Local linear trend Kalman filter (fast online updates; good default when ARIMA collapses to mean-only)
- **Configuration**: `load-predictor: "kalman"`
- **Tunable parameters**:
- `--kalman-q-level`: process noise for level (higher = more responsive)
- `--kalman-q-trend`: process noise for trend (higher = trend changes faster)
- `--kalman-r`: measurement noise (lower = trusts new measurements more)
- `--kalman-min-points`: minimum points before forecasting
- `--load-predictor-log1p`: model `log1p(y)` instead of `y` (often helps request-rate/count series)
#### Prophet Predictor
- **Use case**: Complex seasonal patterns and trend changes
- **Behavior**: Facebook's [Prophet](https://facebook.github.io/prophet/) model for time-series forecasting
- **Configuration**: `load-predictor: "prophet"`
- **Tunable parameters**:
- `--prophet-window-size`: bounds internal history to control refit cost
- `--load-predictor-log1p`: model `log1p(y)` instead of `y`
#### Warm-starting Load Predictors (Optional)
You can warm-start load predictors with a mooncake-style JSONL trace file:
- **CLI argument**: `--load-predictor-warmup-trace <path/to/trace.jsonl>`
- **Effect**: preloads predictors with historical request-count / ISL / OSL samples extracted from the trace
### Planner Scaling Parameters
| Argument | Default | Description |
|----------|---------|-------------|
| `--adjustment-interval` | `180` | Seconds between scaling decisions |
| `--ttft` | `500.0` | Target Time To First Token (ms) |
| `--itl` | `50.0` | Target Inter-Token Latency (ms) |
| `--isl` | `3000` | Expected average input sequence length |
| `--osl` | `150` | Expected average output sequence length |
| `--max-gpu-budget` | `8` | Maximum GPUs across all workers |
| `--min-endpoint` | `1` | Minimum replicas per worker type |
| `--decode-engine-num-gpu` | `1` | GPUs per decode engine |
| `--prefill-engine-num-gpu` | `1` | GPUs per prefill engine |
| `--no-operation` | `false` | Observation mode (no actual scaling) |
| `--no-correction` | `false` | Disable correction factors |
#### Planner Configuration Passthrough
Add planner-specific settings in the DGDR:
```yaml
profilingConfig:
config:
planner:
plannerMinEndpoint: 2
```
## Integration
### Prometheus Setup
The planner queries Prometheus to collect frontend request metrics. The architecture:
```mermaid
flowchart LR
Frontend --"/metrics"--> Prometheus
Planner --"query API"--> Prometheus
Planner --"scaling decisions"--> Workers
Frontend -.->|"requests"| Workers
```
**Components:**
- **Frontend**: Serves requests and exposes `/metrics`
- **Prometheus**: Scrapes frontend metrics every 5s (configurable in podmonitor manifest)
- **Planner**: Queries Prometheus and adjusts worker scaling every adjustment interval
- **Workers**: Prefill and backend workers handle inference
The planner requires a frontend that reports metrics at the `/metrics` HTTP endpoint with request count, ISL, OSL, TTFT, and ITL in the correct format. The Dynamo frontend provides these metrics automatically.
**Prometheus endpoint configuration:**
| Variable | Default |
|----------|---------|
| `PROMETHEUS_ENDPOINT` | `http://prometheus-kube-prometheus-prometheus.monitoring.svc.cluster.local:9090` |
If you see errors like "Failed to resolve prometheus service", ensure `PROMETHEUS_ENDPOINT` points to your Prometheus service.
### Virtual Deployment
The SLA planner supports virtual deployment mode for customized environments (e.g., custom orchestrators) through the `VirtualConnector`. This connector enables the planner to communicate scaling decisions without directly managing Kubernetes resources.
The `VirtualConnector` acts as a bridge between the SLA planner and external deployment environments. Instead of PATCHing DGD resources, it writes scaling decisions and waits for the external environment to acknowledge completion.
#### Scaling Decision Flow
1. **Decision Generation**: The planner calculates optimal worker counts
2. **Change Detection**: Skips scaling if target counts match current counts, logging: `"No scaling needed (prefill=X, decode=Y)"`
3. **Readiness Check**: Verifies previous scaling operations completed by checking `scaled_decision_id >= decision_id`
4. **Timeout Handling**: If not acknowledged within 30 minutes (1800 seconds), proceeds with new decisions
5. **Completion Tracking**: Optionally waits for scaling completion confirmation (blocking mode)
#### Configuration
To use virtual deployment mode:
```yaml
environment: "virtual"
backend: "vllm" # or "sglang"
```
#### Deployment Environment Requirements
The external deployment environment must use `VirtualConnectorClient`:
```python
from dynamo._core import DistributedRuntime, VirtualConnectorClient
client = VirtualConnectorClient(distributed_runtime, namespace)
```
1. **Monitor Planner**: Continuously watch for scaling decisions: `await client.wait()` (blocks until change)
2. **Parse Decisions**: Read values: `decision = await client.get()`
3. **Execute Scaling**: Apply the scaling decisions to your infrastructure
4. **Acknowledge Completion**: Mark done: `await client.complete(decision)`
A scaling decision (returned by `client.get()`) contains:
- `num_prefill_workers`: Target number of prefill workers (-1 if not set)
- `num_decode_workers`: Target number of decode workers (-1 if not set)
- `decision_id`: Incremental ID for each scaling decision
See `components/planner/test/test_virtual_connector.py` for a full example.
### Grafana Dashboard
Deploy the planner Grafana dashboard:
```bash
kubectl apply -n monitoring -f deploy/observability/k8s/grafana-planner-dashboard-configmap.yaml
```
Follow [Dynamo Metrics Collection on Kubernetes](/docs/kubernetes/observability/metrics.md) to access the Grafana UI and select the **Dynamo Planner Dashboard**.
The dashboard displays:
- **Worker Counts & GPU Usage**: Current prefill/decode worker counts and cumulative GPU hours
- **Observed Metrics**: Real-time TTFT, ITL, request rate, and sequence lengths from Prometheus
- **Predicted Metrics**: Planner's load predictions and recommended replica counts
- **Correction Factors**: How the planner adjusts predictions based on observed vs expected performance
> Use the **Namespace** dropdown at the top of the dashboard to filter metrics for your deployment namespace.
## DGDR Immutability
DGDRs are **immutable**. To update SLAs or configuration:
1. Delete the existing DGDR: `kubectl delete dgdr sla-aic`
2. Create a new DGDR with updated specifications
## Manual Deployment Control
### Option 1: Use DGDR-Generated Configuration (Recommended)
Disable auto-deployment to review the generated DGD before applying:
```yaml
spec:
autoApply: false
```
Then manually extract and apply:
```bash
# Extract generated DGD from DGDR status
kubectl get dgdr sla-aic -n $NAMESPACE -o jsonpath='{.status.generatedDeployment}' | kubectl apply -f -
# Or save to file first for review/modification
kubectl get dgdr sla-aic -n $NAMESPACE -o jsonpath='{.status.generatedDeployment}' > my-dgd.yaml
vi my-dgd.yaml
kubectl apply -f my-dgd.yaml -n $NAMESPACE
```
### Option 2: Use Standalone Planner Templates (Advanced)
For advanced use cases, use the standalone planner templates in `examples/backends/*/deploy/disagg_planner.yaml`:
```bash
# After profiling completes, profiling data is stored in ConfigMaps
kubectl get configmap dgdr-output-<dgdr-name> -n $NAMESPACE -o yaml
kubectl get configmap planner-profile-data -n $NAMESPACE -o yaml
# Update PROMETHEUS_ENDPOINT in the template, then deploy
kubectl apply -f examples/backends/<backend>/deploy/disagg_planner.yaml -n $NAMESPACE
```
## Accessing Profiling Artifacts
By default, profiling jobs save essential data to ConfigMaps. For detailed artifacts, configure the DGDR to use `dynamo-pvc`:
**ConfigMaps (always created):**
- Generated DGD configuration
- Profiling data for Planner (`.json` files)
**PVC (optional):**
- Performance plots (PNGs)
- DGD configuration and logs for each profiled deployment
- AIPerf profiling artifacts
- Raw profiling data (`.npz` files)
- Profiler log
```bash
# Setup PVC
deploy/utils/setup_benchmarking_resources.sh
# Access results after profiling
kubectl apply -f deploy/utils/manifests/pvc-access-pod.yaml -n $NAMESPACE
kubectl wait --for=condition=Ready pod/pvc-access-pod -n $NAMESPACE --timeout=60s
kubectl cp $NAMESPACE/pvc-access-pod:/data ./profiling-results
kubectl delete pod pvc-access-pod -n $NAMESPACE
```
## Troubleshooting
### Quick Diagnostics
```bash
# Check DGDR status and events
kubectl describe dgdr sla-aic -n $NAMESPACE
# Check operator logs
kubectl logs -n $NAMESPACE -l app.kubernetes.io/name=dynamo-operator --tail=100
# Check profiling job logs
kubectl logs -l job-name=profile-sla-aic -n $NAMESPACE
```
### Common Issues
| Issue | Quick Fix |
|-------|-----------|
| **DGDR stuck in Pending** | Check GPU availability: `kubectl get nodes -o jsonpath='{.items[*].status.allocatable.nvidia\.com/gpu}'` |
| **Image pull errors** | Verify secret exists: `kubectl get secret nvcr-imagepullsecret -n $NAMESPACE` |
| **Profiling fails** | Check job logs: `kubectl logs -l job-name=profile-sla-aic -n $NAMESPACE` |
| **SLA cannot be met** | Relax TTFT/ITL targets or add more GPUs |
| **DGD not deployed** | Verify `autoApply: true` in DGDR spec |
| **Prometheus errors** | Ensure `PROMETHEUS_ENDPOINT` env var points to your Prometheus service |
For comprehensive troubleshooting including AI Configurator constraints, performance debugging, and backend-specific issues, see [SLA-Driven Profiling Troubleshooting](/docs/components/profiler/profiler_guide.md#troubleshooting).
## Related Documentation
- [Planner README](README.md) -- Overview and quick start
- [Planner Examples](planner_examples.md) -- DGDR YAML examples and sample configurations
- [Planner Design](/docs/design_docs/planner_design.md) -- Architecture deep-dive for contributors
- [DGDR API Reference](/docs/kubernetes/api_reference.md)
- [Pre-Deployment Profiling](/docs/components/profiler/profiler_guide.md)
- [Dynamo Operator Guide](/docs/kubernetes/dynamo_operator.md)
<!--
SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
-->
# Profiler
The Dynamo Profiler is an automated performance analysis tool that measures model inference characteristics to optimize deployment configurations. It determines optimal tensor parallelism (TP) settings for prefill and decode phases, generates performance interpolation data, and enables SLA-driven autoscaling through the Planner.
## Feature Matrix
| Feature | vLLM | SGLang | TensorRT-LLM |
|---------|------|--------|--------------|
| Dense Model Profiling | ✅ | ✅ | ✅ |
| MoE Model Profiling | 🚧 | ✅ | 🚧 |
| AI Configurator (Offline) | ❌ | ❌ | ✅ |
| Online Profiling (AIPerf) | ✅ | ✅ | ✅ |
| Interactive WebUI | ✅ | ✅ | ✅ |
| Runtime Profiling Endpoints | ❌ | ✅ | ❌ |
## Quick Start
### Prerequisites
- Dynamo platform installed (see [Installation Guide](/docs/kubernetes/installation_guide.md))
- Kubernetes cluster with GPU nodes (for DGDR-based profiling)
- kube-prometheus-stack installed (required for SLA planner)
### Using DynamoGraphDeploymentRequest (Recommended)
The recommended way to profile models is through DGDRs, which automate the entire profiling and deployment workflow.
```yaml
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeploymentRequest
metadata:
name: my-model-profiling
spec:
model: "Qwen/Qwen3-0.6B"
backend: vllm
profilingConfig:
profilerImage: "nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.9.0"
config:
sla:
isl: 3000 # Average input sequence length
osl: 150 # Average output sequence length
ttft: 200.0 # Target Time To First Token (ms)
itl: 20.0 # Target Inter-Token Latency (ms)
deploymentOverrides:
workersImage: "nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.9.0"
autoApply: true
```
```bash
kubectl apply -f my-profiling-dgdr.yaml -n $NAMESPACE
```
### Using AI Configurator (Fast Offline Profiling)
For TensorRT-LLM, use AI Configurator for rapid profiling (~30 seconds):
```yaml
profilingConfig:
config:
sweep:
useAiConfigurator: true
aicSystem: h200_sxm
aicHfId: Qwen/Qwen3-32B
aicBackendVersion: "0.20.0"
```
### Direct Script Usage (Advanced)
For advanced scenarios, run the profiler directly:
```bash
python -m benchmarks.profiler.profile_sla \
--backend vllm \
--config path/to/disagg.yaml \
--model meta-llama/Llama-3-8B \
--ttft 200 --itl 15 \
--isl 3000 --osl 150
```
## Configuration
| Parameter | Default | Description |
|-----------|---------|-------------|
| `sla.isl` | - | Average input sequence length (tokens) |
| `sla.osl` | - | Average output sequence length (tokens) |
| `sla.ttft` | - | Target Time To First Token (milliseconds) |
| `sla.itl` | - | Target Inter-Token Latency (milliseconds) |
| `sweep.useAiConfigurator` | `false` | Use offline simulation instead of real profiling |
| `hardware.minNumGpusPerEngine` | auto | Minimum GPUs per engine (auto-detected from model size) |
| `hardware.maxNumGpusPerEngine` | 8 | Maximum GPUs per engine |
## Profiling Methods
| Method | Duration | Accuracy | GPU Required | Backends |
|--------|----------|----------|--------------|----------|
| Online (AIPerf) | 2-4 hours | Highest | Yes | All |
| Offline (AI Configurator) | 20-30 seconds | Estimated | No | TensorRT-LLM |
## Output
The profiler generates:
1. **Optimal Configuration**: Recommended TP sizes for prefill and decode engines
2. **Performance Data**: Interpolation models for the SLA Planner
3. **Generated DGD**: Complete deployment manifest with optimized settings
Example recommendations:
```text
Suggested prefill TP:4 (TTFT 48.37 ms, throughput 15505.23 tokens/s/GPU)
Suggested decode TP:4 (ITL 4.83 ms, throughput 51.22 tokens/s/GPU)
```
## Next Steps
| Document | Description |
|----------|-------------|
| [Profiler Guide](profiler_guide.md) | Configuration, methods, and troubleshooting |
| [Profiler Examples](profiler_examples.md) | Complete DGDR YAMLs, WebUI, script examples |
| [SLA Planner Guide](/docs/components/planner/planner_guide.md) | End-to-end deployment workflow |
| [SLA Planner Architecture](/docs/components/planner/planner_guide.md) | How the Planner uses profiling data |
```{toctree}
:hidden:
profiler_guide
profiler_examples
```
<!--
SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
-->
# Profiler Examples
Complete examples for profiling with DGDRs, the interactive WebUI, and direct script usage.
## DGDR Examples
### Dense Model: AIPerf on Real Engines
Standard online profiling with real GPU measurements:
```yaml
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeploymentRequest
metadata:
name: vllm-dense-online
spec:
model: "Qwen/Qwen3-0.6B"
backend: vllm
profilingConfig:
profilerImage: "nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.9.0"
config:
sla:
isl: 3000
osl: 150
ttft: 200.0
itl: 20.0
hardware:
minNumGpusPerEngine: 1
maxNumGpusPerEngine: 8
sweep:
useAiConfigurator: false
deploymentOverrides:
workersImage: "nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.9.0"
autoApply: true
```
### Dense Model: AI Configurator Simulation
Fast offline profiling (~30 seconds, TensorRT-LLM only):
```yaml
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeploymentRequest
metadata:
name: trtllm-aic-offline
spec:
model: "Qwen/Qwen3-32B"
backend: trtllm
profilingConfig:
profilerImage: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:0.9.0"
config:
sla:
isl: 4000
osl: 500
ttft: 300.0
itl: 10.0
sweep:
useAiConfigurator: true
aicSystem: h200_sxm # Also supports h100_sxm, b200_sxm, gb200_sxm, a100_sxm
aicHfId: Qwen/Qwen3-32B
aicBackendVersion: "0.20.0"
deploymentOverrides:
workersImage: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:0.9.0"
autoApply: true
```
### MoE Model
Multi-node MoE profiling with SGLang:
```yaml
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeploymentRequest
metadata:
name: sglang-moe
spec:
model: "deepseek-ai/DeepSeek-R1"
backend: sglang
profilingConfig:
profilerImage: "nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.9.0"
config:
sla:
isl: 2048
osl: 512
ttft: 300.0
itl: 25.0
hardware:
numGpusPerNode: 8
maxNumGpusPerEngine: 32
engine:
isMoeModel: true
deploymentOverrides:
workersImage: "nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.9.0"
autoApply: true
```
### Using Existing DGD Config (ConfigMap)
Reference a custom DGD configuration via ConfigMap:
```bash
# Create ConfigMap from your DGD config file
kubectl create configmap deepseek-r1-config \
--from-file=/path/to/your/disagg.yaml \
--namespace $NAMESPACE \
--dry-run=client -o yaml | kubectl apply -f -
```
```yaml
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeploymentRequest
metadata:
name: deepseek-r1
spec:
model: deepseek-ai/DeepSeek-R1
backend: sglang
profilingConfig:
profilerImage: "nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.9.0"
configMapRef:
name: deepseek-r1-config
key: disagg.yaml
config:
sla:
isl: 4000
osl: 500
ttft: 300
itl: 10
sweep:
useAiConfigurator: true
aicSystem: h200_sxm
aicHfId: deepseek-ai/DeepSeek-V3
aicBackendVersion: "0.20.0"
deploymentOverrides:
workersImage: "nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.9.0"
autoApply: true
```
## Interactive WebUI
Launch an interactive configuration selection interface:
```bash
python -m benchmarks.profiler.profile_sla \
--backend trtllm \
--config path/to/disagg.yaml \
--pick-with-webui \
--use-ai-configurator \
--model Qwen/Qwen3-32B-FP8 \
--aic-system h200_sxm \
--ttft 200 --itl 15
```
The WebUI launches on port 8000 by default (configurable with `--webui-port`).
### Features
- **Interactive Charts**: Visualize prefill TTFT, decode ITL, and GPU hours analysis with hover-to-highlight synchronization between charts and tables
- **Pareto-Optimal Analysis**: The GPU Hours table shows pareto-optimal configurations balancing latency and throughput
- **DGD Config Preview**: Click "Show Config" on any row to view the corresponding DynamoGraphDeployment YAML
- **GPU Cost Estimation**: Toggle GPU cost display to convert GPU hours to cost ($/1000 requests)
- **SLA Visualization**: Red dashed lines indicate your TTFT and ITL targets
### Selection Methods
1. **GPU Hours Table** (recommended): Click any row to select both prefill and decode configurations at once based on the pareto-optimal combination
2. **Individual Selection**: Click one row in the Prefill table AND one row in the Decode table to manually choose each
### Example DGD Config Output
When you click "Show Config", you see a DynamoGraphDeployment configuration:
```yaml
# DynamoGraphDeployment Configuration
# Prefill: 1 GPU(s), TP=1
# Decode: 4 GPU(s), TP=4
# Model: Qwen/Qwen3-32B-FP8
# Backend: trtllm
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
spec:
services:
PrefillWorker:
subComponentType: prefill
replicas: 1
extraPodSpec:
mainContainer:
args:
- --tensor-parallel-size=1
DecodeWorker:
subComponentType: decode
replicas: 1
extraPodSpec:
mainContainer:
args:
- --tensor-parallel-size=4
```
Once you select a configuration, the full DGD CRD is saved as `config_with_planner.yaml`.
## Direct Script Examples
### Basic Profiling
```bash
python -m benchmarks.profiler.profile_sla \
--backend vllm \
--config path/to/disagg.yaml \
--model meta-llama/Llama-3-8B \
--ttft 200 --itl 15 \
--isl 3000 --osl 150
```
### With GPU Constraints
```bash
python -m benchmarks.profiler.profile_sla \
--backend sglang \
--config examples/backends/sglang/deploy/disagg.yaml \
--model deepseek-ai/DeepSeek-R1-Distill-Llama-8B \
--ttft 200 --itl 15 \
--isl 3000 --osl 150 \
--min-num-gpus 2 \
--max-num-gpus 8
```
### AI Configurator (Offline)
```bash
python -m benchmarks.profiler.profile_sla \
--backend trtllm \
--config path/to/disagg.yaml \
--use-ai-configurator \
--model Qwen/Qwen3-32B-FP8 \
--aic-system h200_sxm \
--ttft 200 --itl 15 \
--isl 4000 --osl 500
```
## SGLang Runtime Profiling
Profile SGLang workers at runtime via HTTP endpoints:
```bash
# Start profiling
curl -X POST http://localhost:9090/engine/start_profile \
-H "Content-Type: application/json" \
-d '{"output_dir": "/tmp/profiler_output"}'
# Run inference requests to generate profiling data...
# Stop profiling
curl -X POST http://localhost:9090/engine/stop_profile
```
A test script is provided at `examples/backends/sglang/test_sglang_profile.py`:
```bash
python examples/backends/sglang/test_sglang_profile.py
```
View traces using Chrome's `chrome://tracing`, [Perfetto UI](https://ui.perfetto.dev/), or TensorBoard.
<!--
SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
-->
# Profiler Guide
This guide covers deployment, configuration, integration, and troubleshooting for the Dynamo Profiler.
## What is a DynamoGraphDeploymentRequest (DGDR)?
A **DynamoGraphDeploymentRequest (DGDR)** is a Kubernetes Custom Resource that serves as the primary interface for users to request model deployments with specific performance and resource constraints. You specify:
- **What** model you want to deploy (`model`)
- **How** it should perform (SLA targets: `ttft`, `itl`)
- **Where** it should run (optional GPU preferences)
- **Which** backend to use (`backend`: vllm, sglang, or trtllm)
- **Which** images to use (`profilingConfig.profilerImage`, `deploymentOverrides.workersImage`)
The Dynamo Operator watches for DGDRs and automatically:
1. Discovers available GPU resources in your cluster
2. Runs profiling (online or offline) to find optimal configurations
3. Generates an optimized DynamoGraphDeployment (DGD) configuration
4. Deploys the DGD to your cluster
**Relationship to DGD:**
- **DGDR**: High-level "intent" - what you want deployed
- **DGD**: Low-level "implementation" - how it's deployed
## Support Matrix
| Backend | Dense Models | MoE Models |
|---------|-------------|------------|
| vLLM | ✅ | 🚧 |
| SGLang | ✅ | ✅ |
| TensorRT-LLM | ✅ | 🚧 |
The profiler sweeps over the following parallelization mappings for prefill and decode:
| Model Architecture | Prefill Parallelization Mapping | Decode Parallelization Mapping |
|---------|-------------|------------|
| MLA+MoE (DeepseekV3ForCausalLM, DeepseekV32ForCausalLM) | TEP, DEP | TEP, DEP |
| GQA+MoE (Qwen3MoeForCausalLM) | TP, TEP, DEP | TP, TEP, DEP |
| Other Models | TP | TP |
> [!NOTE]
> Exact model x parallelization mapping support is dependent on the backend. The profiler does not guarantee that the recommended P/D engine configuration is supported and bug-free by the backend.
## Deployment
### Kubernetes Deployment (DGDR)
The recommended deployment method is through DGDRs. Sample configurations are provided in `benchmarks/profiler/deploy/`:
| Sample | Description |
|--------|-------------|
| `profile_sla_dgdr.yaml` | Standard online profiling with AIPerf |
| `profile_sla_aic_dgdr.yaml` | Fast offline profiling with AI Configurator |
| `profile_sla_moe_dgdr.yaml` | MoE model profiling (SGLang) |
#### Container Images
Each DGDR requires container images for profiling and deployment:
- **`profilingConfig.profilerImage`** (Required): Container image for the profiling job. Must contain the profiler code and dependencies.
- **`deploymentOverrides.workersImage`** (Optional): Container image for DGD worker components (frontend, workers, planner). If omitted, uses image from the base config file.
```yaml
spec:
profilingConfig:
profilerImage: "nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.9.0"
deploymentOverrides:
workersImage: "nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.9.0"
```
#### Quick Start: Deploy with DGDR
**Step 1: Create Your DGDR**
Use a sample configuration or create your own:
```yaml
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeploymentRequest
metadata:
name: my-model-profiling
spec:
model: "Qwen/Qwen3-0.6B"
backend: vllm
profilingConfig:
profilerImage: "nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.9.0"
config:
sla:
isl: 3000
osl: 150
ttft: 200.0
itl: 20.0
deploymentOverrides:
workersImage: "nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.9.0"
autoApply: true
```
**Step 2: Apply the DGDR**
```bash
export NAMESPACE=your-namespace
kubectl apply -f my-profiling-dgdr.yaml -n $NAMESPACE
```
**Step 3: Monitor Progress**
```bash
# View status
kubectl get dgdr -n $NAMESPACE
# Detailed status
kubectl describe dgdr my-model-profiling -n $NAMESPACE
# Watch profiling job logs
kubectl logs -f job/profile-my-model-profiling -n $NAMESPACE
```
**DGDR Status States:**
- `Pending`: Initial state, preparing to profile
- `Profiling`: Running profiling job (20-30 seconds for AIC, 2-4 hours for online)
- `Deploying`: Generating and applying DGD configuration
- `Ready`: DGD successfully deployed and running
- `Failed`: Error occurred (check events for details)
**Step 4: Access Your Deployment**
```bash
# Find the frontend service
kubectl get svc -n $NAMESPACE | grep frontend
# Port-forward to access locally
kubectl port-forward svc/<deployment>-frontend 8000:8000 -n $NAMESPACE
# Test the endpoint
curl http://localhost:8000/v1/models
```
> [!NOTE]
> DGDRs are **immutable**. To update SLAs or configuration, delete the existing DGDR and create a new one.
### Direct Script Execution
For advanced use cases or local development:
```bash
python -m benchmarks.profiler.profile_sla \
--backend vllm \
--config path/to/disagg.yaml \
--model meta-llama/Llama-3-8B \
--ttft 200 --itl 15 \
--isl 3000 --osl 150 \
--min-num-gpus 1 \
--max-num-gpus 8
```
## Profiling Method
The profiler follows a 5-step process:
1. **Hardware Setup**: Uses defaults or user-specified hardware configuration. Optionally, cluster-scoped operators can enable automatic GPU discovery to detect specifications from cluster nodes.
2. **Identify Sweep Ranges**: Automatically determine minimum and maximum number of GPUs per engine. Minimum is determined by the model size and GPU VRAM. Maximum is set to one node for dense models and 4 nodes for MoE models.
3. **Parallelization Mapping Sweep**: Test performance of engines with different parallelization mappings using the input ISL and OSL.
- For dense models, test different TP sizes for both prefill and decode.
- For MoE models (SGLang), evaluate both TEP and DEP as candidates for prefill and decode.
- **Prefill**:
- TP/TEP: Measure TTFT with batch size = 1 (assuming ISL is long enough to saturate compute) without KV reuse.
- DEP: Attention uses data parallelism. Send a single burst with total concurrency `attention_dp_size × attn_dp_num_req_ratio` (defaults to 4) and compute the reported TTFT as `time_to_first_token.max / attn_dp_num_req_ratio` from the AIPerf summary of that burst.
![Prefill Performance](../../images/h100_prefill_performance.png)
- **Decode**: Measure the ITL under different numbers of in-flight requests, from 1 to the maximum the KV cache can hold. To measure ITL without being affected by piggy-backed prefill requests, the script enables KV-reuse and warms up the engine by issuing the same prompts before measuring.
![Decode Performance](../../images/h100_decode_performance.png)
4. **Recommendation**: Select optimal parallelization mapping for prefill and decode that achieves the highest per-GPU throughput while adhering to the SLA on TTFT and ITL.
5. **In-Depth Profiling on the Recommended P/D Engine**: Interpolate TTFT with ISL and ITL with active KV cache and decode context length for more accurate performance estimation.
![ITL Interpolation](../../images/pd_interpolation.png)
- **Prefill**: Measures TTFT and throughput per GPU across different input lengths with batch size=1.
- **Decode**: Measures ITL and throughput per GPU under various KV cache loads and decode context lengths.
### AIPerf on Real Engines
Profiles your model by creating real test deployments in Kubernetes and measuring their performance.
- **Duration**: 2-4 hours
- **Accuracy**: Highest (real measurements)
- **GPU Requirements**: Full access to test different parallelization mappings
- **Backends**: vLLM, SGLang, TensorRT-LLM
```yaml
profilingConfig:
config:
sweep:
useAiConfigurator: false # Default
```
### AI Configurator Simulation
Uses performance simulation to rapidly estimate optimal configurations without running real deployments.
- **Duration**: 20-30 seconds
- **Accuracy**: Estimated (may have errors for unusual configurations)
- **GPU Requirements**: None
- **Backends**: TensorRT-LLM only (vLLM/SGLang coming soon)
```yaml
profilingConfig:
config:
sweep:
useAiConfigurator: true
aicSystem: h200_sxm
aicHfId: Qwen/Qwen3-32B
aicBackendVersion: "0.20.0" # TRT-LLM version simulated by AIC
```
> [!NOTE]
> `aicBackendVersion` specifies the TensorRT-LLM version that AI Configurator simulates. See the [AI Configurator supported features](https://github.com/ai-dynamo/aiconfigurator#supported-features) for available versions.
**Currently supports:**
- **Backends**: TensorRT-LLM (versions 0.20.0, 1.0.0rc3, 1.0.0rc6)
- **Systems**: H100 SXM, H200 SXM, B200 SXM, GB200 SXM, A100 SXM
- **Models**: Wide range including GPT, Llama, Mixtral, DeepSeek, Qwen, and more
See [AI Configurator documentation](https://github.com/ai-dynamo/aiconfigurator#supported-features) for the full list.
### Automatic GPU Discovery
Cluster-scoped operators can optionally enable automatic GPU discovery:
```yaml
spec:
enableGpuDiscovery: true
```
This is only available with cluster-scoped operators (`namespaceRestriction.enabled=false`) as it requires cluster-wide node access permissions.
## Configuration
### DGDR Configuration Structure
All profiler configuration goes under `spec.profilingConfig.config`:
```yaml
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeploymentRequest
metadata:
name: my-deployment
spec:
model: "Qwen/Qwen3-0.6B"
backend: vllm
profilingConfig:
profilerImage: "nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.9.0"
configMapRef: # Optional: base DGD config
name: my-config
key: disagg.yaml
config:
sla: { ... }
hardware: { ... }
sweep: { ... }
planner: { ... }
deploymentOverrides:
workersImage: "nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.9.0"
```
### SLA Configuration (Required)
```yaml
sla:
isl: 3000 # Average input sequence length (tokens)
osl: 150 # Average output sequence length (tokens)
ttft: 200.0 # Target Time To First Token (milliseconds)
itl: 20.0 # Target Inter-Token Latency (milliseconds)
```
- **ISL/OSL**: Based on your expected traffic patterns
- **TTFT**: First token latency target (lower = more GPUs needed, affects prefill engine)
- **ITL**: Token generation latency target (lower = more GPUs needed, affects decode engine)
- **Trade-offs**: Tighter SLAs require more GPU resources
### Hardware Configuration (Optional)
```yaml
hardware:
minNumGpusPerEngine: 2 # Auto-determined from model size and VRAM if not provided
maxNumGpusPerEngine: 8 # Maximum GPUs to test
numGpusPerNode: 8 # GPUs per node (for multi-node MoE)
gpuType: h200_sxm # GPU type hint (informational, auto-detected)
```
- **minNumGpusPerEngine**: Skip small TP sizes if your model is large
- **maxNumGpusPerEngine**: Limit search space or work around constraints (e.g., [AIC attention heads](#ai-configurator-attention-head-constraint-error))
- **numGpusPerNode**: Determine the upper bound of GPUs per node for dense models and configure Grove for multi-node MoE engines
- **gpuType**: Informational only, auto-detected by the controller. For AI Configurator, use `aicSystem` in the [sweep configuration](#ai-configurator-configuration) instead
> [!TIP]
> If you don't specify hardware constraints, the controller auto-detects based on your model size and available cluster resources.
### Sweep Configuration (Optional)
```yaml
sweep:
useAiConfigurator: false # Use real profiling (default)
prefillInterpolationGranularity: 16 # Samples for prefill TTFT curve
decodeInterpolationGranularity: 6 # Samples for decode ITL curve
```
- **useAiConfigurator**: Set to `true` for 20-30 second profiling (TensorRT-LLM only)
- **prefillInterpolationGranularity**: Samples for prefill TTFT curve (lower = faster but less accurate)
- **decodeInterpolationGranularity**: Samples for decode ITL curve. Since ITL interpolation is 3D and takes longer, we default to fewer samples. Increasing this value may quadratically increase profiling time.
### AI Configurator Configuration
Required if `useAiConfigurator: true`:
```yaml
sweep:
useAiConfigurator: true
aicSystem: h200_sxm # h100_sxm, h200_sxm, b200_sxm, gb200_sxm, a100_sxm
aicHfId: Qwen/Qwen3-32B # HuggingFace model ID
aicBackendVersion: "0.20.0" # TensorRT-LLM version
```
### Planner Configuration (Optional)
Pass arguments to the SLA planner:
```yaml
planner:
planner_min_endpoint: 2 # Minimum endpoints to maintain
planner_adjustment_interval: 60 # Adjustment interval (seconds)
planner_load_predictor: linear # Load prediction method
```
> [!NOTE]
> Planner arguments use `planner_` prefix. See [SLA Planner documentation](/docs/components/planner/planner_guide.md) for full list.
### Model Cache PVC (Advanced)
For large models, use a pre-populated PVC containing model weights instead of downloading from HuggingFace:
```yaml
deployment:
modelCache:
pvcName: "model-cache"
pvcPath: "hub/models--deepseek-ai--DeepSeek-R1"
mountPath: "/opt/model-cache"
```
Requirements:
- The PVC must exist in the same namespace as the DGDR
- The model weights must be accessible at `{mountPath}/{pvcPath}`
### Engine Configuration (Auto-configured)
The controller automatically injects these from high-level fields:
```yaml
# You specify:
spec:
model: "Qwen/Qwen3-0.6B"
backend: vllm
# Controller auto-injects:
profilingConfig:
config:
deployment:
model: "Qwen/Qwen3-0.6B"
engine:
backend: vllm
config: /path/to/configmap
```
You should **not** manually set `deployment.model` or `engine.backend` in `profilingConfig.config`.
### Using Existing DGD Configs (ConfigMap)
Reference an existing DGD config via ConfigMap:
```bash
kubectl create configmap my-config \
--from-file=disagg.yaml=/path/to/your/disagg.yaml \
--namespace $NAMESPACE \
--dry-run=client -o yaml | kubectl apply -f -
```
```yaml
profilingConfig:
configMapRef:
name: my-config
key: disagg.yaml
```
The profiler uses the DGD config as a **base template**, then optimizes it based on your SLA targets.
### CLI Arguments
| Argument | Type | Default | Description |
|----------|------|---------|-------------|
| `--backend` | string | - | Inference backend: vllm, sglang, trtllm |
| `--config` | string | - | Path to DGD YAML config file |
| `--model` | string | - | HuggingFace model ID |
| `--ttft` | float | - | Target TTFT in milliseconds |
| `--itl` | float | - | Target ITL in milliseconds |
| `--isl` | int | - | Average input sequence length |
| `--osl` | int | - | Average output sequence length |
| `--min-num-gpus` | int | auto | Minimum GPUs per engine |
| `--max-num-gpus` | int | 8 | Maximum GPUs per engine |
| `--use-ai-configurator` | flag | false | Use offline AI Configurator |
| `--pick-with-webui` | flag | false | Launch interactive WebUI |
| `--webui-port` | int | 8000 | Port for WebUI |
> [!NOTE]
> CLI arguments map to DGDR config fields: `--min-num-gpus` = `hardware.minNumGpusPerEngine`, `--max-num-gpus` = `hardware.maxNumGpusPerEngine`, `--use-ai-configurator` = `sweep.useAiConfigurator`. See [DGDR Configuration Structure](#dgdr-configuration-structure) for all field mappings.
## Integration
### With SLA Planner
The Profiler generates interpolation data that the SLA Planner uses for autoscaling decisions.
**Prefill Interpolation** (`selected_prefill_interpolation/raw_data.npz`):
- `prefill_isl`: 1D array of input sequence lengths tested
- `prefill_ttft`: 1D array of TTFTs (ms) at each ISL
- `prefill_thpt_per_gpu`: 1D array of throughput (tokens/s/GPU) at each ISL
**Decode Interpolation** (`selected_decode_interpolation/raw_data.npz`):
- `max_kv_tokens`: Total KV tokens capacity in decode engine
- `x_kv_usage`: 1D array of active KV usage percentages [0, 1]
- `y_context_length`: 1D array of average context lengths tested
- `z_itl`: 1D array of ITLs (ms) at each (KV usage, context length) point
- `z_thpt_per_gpu`: 1D array of throughput (tokens/s/GPU) at each point
### With Dynamo Operator
When using DGDR, the Dynamo Operator:
1. Creates profiling jobs automatically
2. Stores profiling data in ConfigMaps (`planner-profile-data`)
3. Generates optimized DGD configurations
4. Deploys the DGD with SLA Planner integration
The generated DGD is tracked via labels:
```yaml
metadata:
labels:
dgdr.nvidia.com/name: my-deployment
dgdr.nvidia.com/namespace: your-namespace
```
### With Observability
Monitor profiling jobs:
```bash
kubectl logs -f job/profile-<dgdr-name> -n $NAMESPACE
kubectl describe dgdr <name> -n $NAMESPACE
```
## Advanced Topics
### Manual Deployment Control
Disable auto-deployment to review the generated DGD before applying:
```yaml
spec:
autoApply: false
```
Then manually extract and apply:
```bash
# Extract generated DGD from DGDR status
kubectl get dgdr my-deployment -n $NAMESPACE -o jsonpath='{.status.generatedDeployment}' | kubectl apply -f -
# Or save to file for review
kubectl get dgdr my-deployment -n $NAMESPACE -o jsonpath='{.status.generatedDeployment}' > my-dgd.yaml
```
### Mocker Deployment
Deploy a mocker deployment that simulates engines without GPUs:
```yaml
spec:
model: <model-name>
backend: trtllm
useMocker: true # Deploy mocker instead of real backend
autoApply: true
```
Profiling still runs against the real backend to collect performance data. The mocker uses this data to simulate realistic timing behavior. Useful for large-scale experiments, testing Planner behavior, and validating configurations.
### Accessing Profiling Artifacts
By default, profiling data is stored in ConfigMaps. For detailed artifacts (plots, logs, raw data), attach a PVC:
```yaml
profilingConfig:
outputPVC: "dynamo-pvc"
```
**ConfigMaps (always created):**
- `dgdr-output-<name>`: Generated DGD configuration
- `planner-profile-data`: Profiling data for Planner (JSON)
**PVC artifacts (optional):**
- Performance plots (PNGs)
- DGD configurations for each profiled deployment
- AIPerf profiling artifacts
- Raw profiling data (`.npz` files)
- Profiler logs
Access PVC results:
```bash
kubectl apply -f deploy/utils/manifests/pvc-access-pod.yaml -n $NAMESPACE
kubectl wait --for=condition=Ready pod/pvc-access-pod -n $NAMESPACE --timeout=60s
kubectl cp $NAMESPACE/pvc-access-pod:/data ./profiling-results
kubectl delete pod pvc-access-pod -n $NAMESPACE
```
### Output Performance Plots
The profiler generates plots to visualize performance data:
**Parallelization Mapping Sweep Plots:**
- `prefill_performance.png`: TTFT vs Parallelization Mapping size
- `decode_performance.png`: ITL vs Parallelization Mapping size and in-flight requests
**In-Depth Profiling Plots:**
- `selected_prefill_interpolation/prefill_ttft_interpolation.png`: TTFT vs ISL
- `selected_prefill_interpolation/prefill_throughput_interpolation.png`: Throughput vs ISL
- `selected_decode_interpolation/decode_itl_interplation.png`: ITL vs KV usage and context length
- `selected_decode_interpolation/decode_throughput_interpolation.png`: Throughput vs KV usage and context length
## Runtime Profiling (SGLang)
SGLang workers expose profiling endpoints for runtime performance analysis:
```bash
# Start profiling
curl -X POST http://localhost:9090/engine/start_profile \
-H "Content-Type: application/json" \
-d '{"output_dir": "/tmp/profiler_output"}'
# Run inference requests...
# Stop profiling
curl -X POST http://localhost:9090/engine/stop_profile
```
View traces using Chrome's `chrome://tracing`, [Perfetto UI](https://ui.perfetto.dev/), or TensorBoard.
## Troubleshooting
### Profiling Takes Too Long
**Solution 1**: Use AI Configurator for rapid profiling (TensorRT-LLM only):
```yaml
sweep:
useAiConfigurator: true
```
**Solution 2**: Reduce search space:
```yaml
hardware:
minNumGpusPerEngine: 4 # Skip TP1, TP2
maxNumGpusPerEngine: 8 # Don't test beyond TP8
```
### SLA Cannot Be Met
**Symptoms**: Profiler reports no configuration meets targets
**Solutions:**
1. Relax SLA targets (increase TTFT/ITL)
2. Add more GPU resources
3. Try a different backend
4. Use a smaller model
### AI Configurator: Attention Head Constraint Error
**Symptoms**: Profiling fails with error:
```text
AssertionError: num_heads <N> should be divisible by tp_size <M> and the division result should be >= 4
```
**Cause**: AI Configurator requires **≥4 attention heads per GPU**. Small models with few heads cannot use high TP sizes.
**Affected Models:**
- **Qwen3-0.6B** (16 heads): Max TP = 4
- **GPT-2** (12 heads): Max TP = 3
- Most models **<1B parameters**: May hit this constraint
**Solution**: Limit `maxNumGpusPerEngine`:
```yaml
hardware:
maxNumGpusPerEngine: 4 # For Qwen3-0.6B (16 heads / 4 = max TP of 4)
```
**Calculate Max TP**: `max_tp = num_attention_heads / 4`
> [!NOTE]
> This is an AI Configurator limitation. Online profiling doesn't have this constraint.
### Image Pull Errors
**Symptoms**: `ErrImagePull` or `ImagePullBackOff`
**Solution**: Ensure image pull secrets are configured:
```bash
kubectl create secret docker-registry nvcr-imagepullsecret \
--docker-server=nvcr.io \
--docker-username='$oauthtoken' \
--docker-password=<NGC_API_KEY> \
--namespace <your-namespace>
```
### Out of Memory During Profiling
**Symptoms**: OOM errors in profiling jobs
**Solutions:**
1. Reduce `gpu_memory_utilization` in engine config
2. Reduce `--max-context-length`
3. Skip larger TP configurations
4. Use fewer GPUs per test
### Unsupported Parallelization Mapping in Backend
**Symptoms**: Startup/runtime error in the backend (e.g., prime number of attention heads constraining TP to 1, or backend not supporting different TP sizes for prefill and decode).
**Solutions:**
1. Contact the backend to add support and bump backend version in Dynamo
2. Constrain the max and min number of GPUs per engine to the supported range
## See Also
- [Profiler Examples](profiler_examples.md) - Complete DGDR YAML examples
- [SLA Planner Guide](/docs/components/planner/planner_guide.md) - End-to-end deployment workflow
- [SLA Planner Architecture](/docs/components/planner/planner_guide.md) - How the Planner uses profiling data
- [DGDR API Reference](/docs/kubernetes/api_reference.md) - DGDR specification
- [Profiler Arguments Reference](/benchmarks/profiler/utils/profiler_argparse.py) - Full CLI reference
<!--
SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
-->
# Router
The Dynamo KV Router intelligently routes requests by evaluating their computational costs across different workers. It considers both decoding costs (from active blocks) and prefill costs (from newly computed blocks), using KV cache overlap to minimize redundant computation. Optimizing the KV Router is critical for achieving maximum throughput and minimum latency in distributed inference setups.
## Quick Start
### Python / CLI Deployment
To launch the Dynamo frontend with the KV Router:
```bash
python -m dynamo.frontend --router-mode kv --http-port 8000
```
This command:
- Launches the Dynamo frontend service with KV routing enabled
- Exposes the service on port 8000 (configurable)
- Automatically handles all backend workers registered to the Dynamo endpoint
Backend workers register themselves using the `register_llm` API, after which the KV Router automatically tracks worker state and makes routing decisions based on KV cache overlap.
#### CLI Arguments
| Argument | Default | Description |
|----------|---------|-------------|
| `--router-mode kv` | `round_robin` | Enable KV cache-aware routing |
| `--router-temperature <float>` | `0.0` | Controls routing randomness (0.0 = deterministic, higher = more random) |
| `--kv-cache-block-size <size>` | Backend-specific | KV cache block size (should match backend config) |
| `--kv-events` / `--no-kv-events` | `--kv-events` | Enable/disable real-time KV event tracking |
| `--kv-overlap-score-weight <float>` | `1.0` | Balance prefill vs decode optimization (higher = better TTFT) |
For all available options: `python -m dynamo.frontend --help`
### Kubernetes Deployment
To enable the KV Router in Kubernetes, add the `DYN_ROUTER_MODE` environment variable to your frontend service:
```yaml
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
name: my-deployment
spec:
services:
Frontend:
dynamoNamespace: my-namespace
componentType: frontend
replicas: 1
envs:
- name: DYN_ROUTER_MODE
value: kv # Enable KV Smart Router
```
**Key Points:**
- Set `DYN_ROUTER_MODE=kv` on the **Frontend** service only
- Workers automatically report KV cache events to the router
- No worker-side configuration changes needed
#### Environment Variables
All CLI arguments can be configured via environment variables using the `DYN_` prefix:
| CLI Argument | Environment Variable | Default |
|--------------|---------------------|---------|
| `--router-mode kv` | `DYN_ROUTER_MODE=kv` | `round_robin` |
| `--router-temperature` | `DYN_ROUTER_TEMPERATURE` | `0.0` |
| `--kv-cache-block-size` | `DYN_KV_CACHE_BLOCK_SIZE` | Backend-specific |
| `--no-kv-events` | `DYN_KV_EVENTS=false` | `true` |
| `--kv-overlap-score-weight` | `DYN_KV_OVERLAP_SCORE_WEIGHT` | `1.0` |
For complete K8s examples and advanced configuration, see [K8s Examples](router_examples.md#k8s-examples).
For A/B testing and advanced K8s setup, see the [KV Router A/B Benchmarking Guide](../../benchmarks/kv-router-ab-testing.md).
For more configuration options and tuning guidelines, see the [Router Guide](router_guide.md).
## Prerequisites and Limitations
**Requirements:**
- **Dynamic endpoints only**: KV router requires `register_llm()` with `model_input=ModelInput.Tokens`. Your backend handler receives pre-tokenized requests with `token_ids` instead of raw text.
- Backend workers must call `register_llm()` with `model_input=ModelInput.Tokens` (see [Backend Guide](../../development/backend-guide.md))
- You cannot use `--static-endpoint` mode with KV routing (use dynamic discovery instead)
**Multimodal Support:**
- **vLLM and TRT-LLM**: Multimodal routing supported for images via multimodal hashes
- **SGLang**: Image routing not yet supported
- **Other modalities** (audio, video, etc.): Not yet supported
**Limitations:**
- Static endpoints not supported—KV router requires dynamic model discovery via etcd to track worker instances and their KV cache states
For basic model registration without KV routing, use `--router-mode round-robin` or `--router-mode random` with both static and dynamic endpoints.
## Next Steps
- **[Router Guide](router_guide.md)**: Deep dive into KV cache routing, configuration, disaggregated serving, and tuning
- **[Router Examples](router_examples.md)**: Python API usage, K8s examples, and custom routing patterns
- **[Router Design](../../design_docs/router_design.md)**: Architecture details, algorithms, and event transport modes
```{toctree}
:hidden:
router_guide
router_examples
```
<!--
SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
-->
# Router Examples
For quick start instructions, see the [Router README](README.md). This document provides further examples for using the Dynamo Router, including Python API usage, Kubernetes deployments, and custom routing patterns.
## Table of Contents
- [Using KvPushRouter Python API](#using-kvpushrouter-python-api)
- [K8s Examples](#k8s-examples)
- [Routing Patterns](#routing-patterns)
- [Custom Routing Example: Minimizing TTFT](#custom-routing-example-minimizing-ttft)
- [KV Event Publishing for Custom Engines](#kv-event-publishing-for-custom-engines)
## Using KvPushRouter Python API
Instead of launching the KV Router via command line, you can create a `KvPushRouter` object directly in Python. This allows per-request routing configuration overrides.
>[!Warning]
> **Multiple Routers in Same Process**: If you need to run multiple `KvPushRouter` instances for fault tolerance or load distribution, you must launch them in **separate processes** (e.g., using `python -m dynamo.frontend` with different ports). Creating multiple `KvPushRouter` objects in the same Python process is not supported - they share the same cancellation token from the component's primary lease, so dropping one router will cancel all routers in that process. For in-process routing, use a single `KvPushRouter` instance.
### Methods
The `KvPushRouter` provides the following methods:
- **`generate(token_ids, model, ...)`**: Route and execute a request, returning an async stream of responses. Automatically handles worker selection, state tracking, and lifecycle management.
- **`best_worker(token_ids, router_config_override=None, request_id=None)`**: Query which worker would be selected for given tokens. Returns `(worker_id, dp_rank, overlap_blocks)`.
- Without `request_id`: Query-only, doesn't update router state
- With `request_id`: Updates router state to track the request. **Note**: If used with `request_id`, you must call `mark_prefill_complete()` and `free()` at the appropriate lifecycle points to maintain accurate load tracking
- **`get_potential_loads(token_ids)`**: Get detailed load information for all workers, including potential prefill tokens and active decode blocks. Returns a list of load dictionaries.
- **`mark_prefill_complete(request_id)`**: Signal that a request has completed its prefill phase. Only used for [manual lifecycle management](#2-manual-state-management-advanced) when using `best_worker()` for manual routing instead of `generate()`.
- **`free(request_id)`**: Signal that a request has completed and its resources should be released. Only used for [manual lifecycle management](#2-manual-state-management-advanced) when using `best_worker()` for manual routing instead of `generate()`.
- **`dump_events()`**: Dump all KV cache events from the router's indexer as a JSON string. Useful for debugging and analysis.
### Setup
First, launch your backend engines:
```bash
python -m dynamo.vllm --model meta-llama/Llama-2-7b-hf
```
### Example Script
```python
import asyncio
from dynamollm import DistributedRuntime, KvPushRouter, KvRouterConfig
async def main():
# Get runtime and create endpoint
runtime = DistributedRuntime.detached()
namespace = runtime.namespace("dynamo")
component = namespace.component("backend")
endpoint = component.endpoint("generate")
# Create KV router
kv_router_config = KvRouterConfig()
router = KvPushRouter(
endpoint=endpoint,
block_size=16,
kv_router_config=kv_router_config
)
# Your input tokens
token_ids = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
# Generate with per-request routing override
stream = await router.generate(
token_ids=token_ids,
model="meta-llama/Llama-2-7b-hf",
stop_conditions={
"max_tokens": 20, # Generate exactly 20 tokens
"ignore_eos": True, # Don't stop at EOS token
},
sampling_options={
"temperature": 0.7,
"top_p": 0.9,
},
router_config_override={
"overlap_score_weight": 2.0, # Prioritize cache hits for this request
"router_temperature": 0.5, # Add routing randomness
}
)
# Collect generated tokens
generated_tokens = []
async for response in stream:
if isinstance(response, dict) and "token_ids" in response:
generated_tokens.extend(response["token_ids"])
print(f"Generated {len(generated_tokens)} tokens: {generated_tokens}")
if __name__ == "__main__":
asyncio.run(main())
```
## K8s Examples
For basic Kubernetes deployment with the KV Router, see the [Kubernetes Deployment section](README.md#kubernetes-deployment) in the Quick Start guide.
### Complete K8s Examples
- [TRT-LLM aggregated router example](../../examples/backends/trtllm/deploy/agg_router.yaml)
- [vLLM aggregated router example](../../examples/backends/vllm/deploy/agg_router.yaml)
- [SGLang aggregated router example](../../examples/backends/sglang/deploy/agg_router.yaml)
- [Distributed inference tutorial](../../examples/basics/kubernetes/Distributed_Inference/agg_router.yaml)
**For A/B Testing and Advanced K8s Setup:**
See the comprehensive [KV Router A/B Benchmarking Guide](../../benchmarks/kv-router-ab-testing.md) for step-by-step instructions on deploying, configuring, and benchmarking the KV router in Kubernetes.
### Example with Advanced Configuration
```yaml
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
name: my-deployment
spec:
services:
Frontend:
dynamoNamespace: my-namespace
componentType: frontend
replicas: 1
envs:
- name: DYN_ROUTER_MODE
value: kv
- name: DYN_ROUTER_TEMPERATURE
value: "0.5" # Add some randomness to prevent worker saturation
- name: DYN_KV_OVERLAP_SCORE_WEIGHT
value: "1.5" # Prioritize TTFT over ITL
- name: DYN_KV_CACHE_BLOCK_SIZE
value: "16"
extraPodSpec:
mainContainer:
image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.6.0
```
### Alternative: Using Command Args in K8s
You can also pass CLI arguments directly in the container command:
```yaml
extraPodSpec:
mainContainer:
image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.6.0
command:
- /bin/sh
- -c
args:
- "python3 -m dynamo.frontend --router-mode kv --router-temperature 0.5 --http-port 8000"
```
**Recommendation:** Use environment variables for easier configuration management and consistency with Dynamo's K8s patterns.
## Routing Patterns
The `KvPushRouter` supports multiple usage patterns depending on your control requirements:
### 1. Automatic Routing (Recommended)
Call `generate()` directly and let the router handle everything:
```python
stream = await router.generate(token_ids=tokens, model="model-name")
```
- **Best for**: Most use cases
- **Router automatically**: Selects best worker, updates state, routes request, tracks lifecycle
### 2. Manual State Management (Advanced)
Use `best_worker(request_id=...)` to select and track, then manage the request yourself:
```python
worker_id, _dp_rank, overlap = await router.best_worker(tokens, request_id="req-123")
response = await client.generate(tokens, request_id="req-123")
# await anext(response) # Get first token
await router.mark_prefill_complete("req-123") # After first token
# async for _ in response: # Continue generating
# ...
await router.free("req-123") # After completion
```
- **Best for**: Custom request handling with router state tracking
- **Requires**: Calling `mark_prefill_complete()` and `free()` at correct lifecycle points
- **Caution**: Incorrect lifecycle management degrades load balancing accuracy
### 3. Hierarchical Router Probing
Query without state updates, then route through a chosen router:
```python
# Probe multiple routers without updating state
worker_id_1, dp_rank, overlap_1 = await router_1.best_worker(tokens) # No request_id
worker_id_2, dp_rank, overlap_2 = await router_2.best_worker(tokens)
# Pick the best router based on results
chosen_router = router_1 if overlap_1 > overlap_2 else router_2
stream = await chosen_router.generate(tokens, model="model-name", worker_id=worker_id)
```
- **Best for**: Multi-tier deployments (e.g., Envoy Gateway routing to multiple router groups)
- **Advantage**: Query multiple routers before committing to one
### 4. Custom Load-Based Routing
Use `get_potential_loads()` to implement custom routing logic:
```python
loads = await router.get_potential_loads(tokens)
# Apply custom logic (e.g., weighted scoring, constraints)
best_worker = min(loads, key=lambda x: custom_cost_fn(x))
stream = await router.generate(tokens, model="model-name", worker_id=best_worker['worker_id'])
```
- **Best for**: Custom optimization strategies beyond the built-in cost function
- **Advantage**: Full control over worker selection logic
- **See also**: Detailed example below in "Custom Routing Example: Minimizing TTFT"
All patterns support `router_config_override` to adjust routing behavior per-request without recreating the router.
## Custom Routing Example: Minimizing TTFT
Here's an example of using `get_potential_loads()` to implement custom routing that minimizes Time To First Token (TTFT) by selecting the worker with the least prefill work:
```python
import asyncio
from dynamo.llm import DistributedRuntime, KvPushRouter, KvRouterConfig
async def minimize_ttft_routing():
# Setup router
runtime = DistributedRuntime.detached()
namespace = runtime.namespace("dynamo")
component = namespace.component("backend")
endpoint = component.endpoint("generate")
router = KvPushRouter(
endpoint=endpoint,
block_size=16,
kv_router_config=KvRouterConfig()
)
# Your input tokens
token_ids = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
# Get potential loads for all workers
potential_loads = await router.get_potential_loads(token_ids)
# Find worker with minimum prefill tokens (best for TTFT)
best_worker = min(potential_loads, key=lambda x: x['potential_prefill_tokens'])
print(f"Worker loads: {potential_loads}")
print(f"Selected worker {best_worker['worker_id']} with {best_worker['potential_prefill_tokens']} prefill tokens")
# Route directly to the selected worker
stream = await router.generate(
token_ids=token_ids,
model="meta-llama/Llama-2-7b-hf",
worker_id=best_worker['worker_id'], # Force routing to optimal worker
stop_conditions={"max_tokens": 20}
)
# Process response
async for response in stream:
if isinstance(response, dict) and "token_ids" in response:
print(f"Generated tokens: {response['token_ids']}")
if __name__ == "__main__":
asyncio.run(minimize_ttft_routing())
```
This approach gives you complete control over routing decisions, allowing you to optimize for different metrics based on your specific requirements. As some examples:
- **Minimize TTFT**: Select worker with lowest `potential_prefill_tokens`
- **Maximize cache reuse**: Use `best_worker()` which considers both prefill and decode loads
- **Balance load**: Consider both `potential_prefill_tokens` and `potential_decode_blocks` together
See [Router Design](../../design_docs/router_design.md) for architecture details and the cost function algorithm.
## KV Event Publishing for Custom Engines
The KV Router relies on real-time events from backend workers to track which KV cache blocks are stored on each worker. When your custom engine allocates or evicts KV cache blocks, it should publish these events so the router can make optimal routing decisions. There are two main publishing pathways: direct NATS publishing (`KvEventPublisher`) which publishes events directly to NATS and is the simplest approach for custom engines, and ZMQ-based publishing for engines with ZMQ event output (like vLLM) which uses a ZMQ publisher in the engine and `ZmqKvEventPublisher` to forward events to NATS.
### Event Types
The KV cache supports three event types:
| Event Type | Description | When to Publish |
|------------|-------------|-----------------|
| `BlockStored` | New blocks added to cache | After KV cache allocation succeeds |
| `BlockRemoved` | Blocks evicted from cache | When blocks are evicted or freed |
| `AllBlocksCleared` | All blocks removed | On cache reset or worker restart |
### Event Structure
Each event contains:
- **`event_id`**: Monotonically increasing identifier per worker
- **`dp_rank`**: Data parallel rank (0 if DP not enabled)
- **`data`**: One of `Stored`, `Removed`, or `Cleared`
For `BlockStored` events:
- **`token_ids`**: List of token IDs for the stored blocks
- **`block_hashes`**: List of **sequence block hashes** from the engine's block manager. These are cumulative hashes that incorporate all tokens from the start of the sequence up to and including the current block (not just the tokens within that block). This enables prefix matching across requests.
- **`num_block_tokens`**: Number of tokens per block (should all equal `kv_block_size`)
- **`parent_hash`**: Hash of the parent block. Required for all blocks except the first block in a sequence (which has no parent).
- **`lora_id`**: LoRA adapter ID (0 if not using LoRA)
For `BlockRemoved` events:
- **`block_hashes`**: List of sequence block hashes being evicted
### Option 1: Direct NATS Publishing (Recommended)
The `KvEventPublisher` class publishes events directly to NATS. This is the simplest approach for custom engines.
```mermaid
flowchart LR
subgraph Engine["Custom Engine"]
cache["KV Cache Manager"]
end
subgraph Worker["Dynamo Worker Process"]
pub["KvEventPublisher"]
end
subgraph NATS["NATS"]
subject["kv-events subject"]
end
subgraph Router["KV Router"]
indexer["KvIndexer"]
end
cache -->|"on_blocks_stored()<br/>on_blocks_removed()"| pub
pub -->|"publish to NATS"| subject
subject --> indexer
```
**When to use:**
- Building a custom inference engine from scratch
- Your engine doesn't have a ZMQ-based event system
- You want the simplest integration path
#### Basic Setup
```python
from dynamo.llm import KvEventPublisher
class CustomEnginePublisher:
def __init__(self, component, worker_id: int, block_size: int, dp_rank: int = 0):
self.block_size = block_size
self.event_id = 0
self.kv_publisher = KvEventPublisher(
component=component,
worker_id=worker_id,
kv_block_size=block_size,
dp_rank=dp_rank,
enable_local_indexer=False,
)
def on_blocks_stored(self, token_ids: list[int], block_hashes: list[int],
lora_id: int = 0, parent_hash: int | None = None):
"""Call after KV cache blocks are allocated."""
self.event_id += 1
num_block_tokens = [self.block_size] * len(block_hashes)
self.kv_publisher.publish_stored(
event_id=self.event_id,
token_ids=token_ids,
num_block_tokens=num_block_tokens,
block_hashes=block_hashes,
lora_id=lora_id,
parent_hash=parent_hash,
)
def on_blocks_removed(self, block_hashes: list[int]):
"""Call when KV cache blocks are evicted."""
self.event_id += 1
self.kv_publisher.publish_removed(event_id=self.event_id, block_hashes=block_hashes)
```
#### Integration with Your Engine
```python
from dynamo.llm import register_llm
async def main():
# Register your engine with Dynamo
component, endpoint = await register_llm(
model="my-model",
generator=my_generate_fn,
)
# Initialize publisher
publisher = CustomEnginePublisher(
component=component,
worker_id=endpoint.connection_id(),
block_size=16, # Match your engine's block size
)
# Hook into your engine's cache events
def on_prefill_complete(request_id, token_ids, blocks):
block_hashes = [block.hash for block in blocks]
publisher.on_blocks_stored(token_ids=token_ids, block_hashes=block_hashes)
def on_cache_eviction(evicted_blocks):
block_hashes = [block.hash for block in evicted_blocks]
publisher.on_blocks_removed(block_hashes=block_hashes)
```
### Option 2: ZMQ-based Publishing
For engines that publish events via ZMQ (like vLLM), this option uses two components that work together:
1. **ZMQ Publisher** (in your engine) - Publishes events to a ZMQ socket
2. **ZmqKvEventPublisher** (Dynamo binding) - Subscribes to ZMQ and forwards to NATS
```mermaid
flowchart LR
subgraph Engine["Custom Engine / vLLM"]
cache["KV Cache Manager"]
zmq_pub["ZMQ Publisher<br/>(Pure Python)"]
end
subgraph ZMQ["ZMQ Socket"]
socket["tcp://127.0.0.1:5557"]
end
subgraph Worker["Dynamo Worker Process"]
zmq_sub["ZmqKvEventPublisher<br/>(Rust bindings)"]
end
subgraph NATS["NATS"]
subject["kv-events subject"]
end
subgraph Router["KV Router"]
indexer["KvIndexer"]
end
cache --> zmq_pub
zmq_pub -->|"PUB"| socket
socket -->|"SUB"| zmq_sub
zmq_sub --> subject
subject --> indexer
```
**When to use:**
- Your engine already has a ZMQ-based event system (like vLLM)
- You're integrating with a consolidator (like KVBM)
- You want to decouple event publishing from your engine's main loop
#### Part 1: ZMQ Subscriber (Dynamo Bindings)
If your engine already publishes to ZMQ, use `ZmqKvEventPublisher` to subscribe and forward to NATS:
```python
from dynamo.llm import ZmqKvEventPublisher, ZmqKvEventPublisherConfig
# Configure the ZMQ subscriber
config = ZmqKvEventPublisherConfig(
worker_id=endpoint.connection_id(),
kv_block_size=block_size,
zmq_endpoint="tcp://127.0.0.1:5557", # Where your engine publishes
zmq_topic="", # Subscribe to all topics
enable_local_indexer=False,
)
# Create publisher - it automatically subscribes to ZMQ and forwards to NATS
kv_publisher = ZmqKvEventPublisher(
component=component,
config=config,
)
```
#### Part 2: ZMQ Publisher (Pure Python)
If your engine needs to publish to ZMQ (e.g., for consolidator integration), implement the ZMQ protocol:
```python
import zmq
import msgpack
import time
class ZmqKvEventPublisher:
"""Pure Python ZMQ publisher for KV events (vLLM-compatible format)."""
def __init__(self, zmq_endpoint: str, kv_block_size: int, topic: str = ""):
self.kv_block_size = kv_block_size
self.topic = topic
self.ctx = zmq.Context()
self.socket = self.ctx.socket(zmq.PUB)
self.socket.bind(zmq_endpoint)
self.sequence = 0
self.data_parallel_rank = 0
def _to_signed_i64(self, value: int | None) -> int | None:
if value is None:
return None
return value - 0x10000000000000000 if value > 0x7FFFFFFFFFFFFFFF else value
def publish_stored(self, event_id: int, token_ids: list[int], num_block_tokens: list[int],
block_hashes: list[int], lora_id: int = 0, parent_hash: int | None = None):
event = {
"type": "BlockStored",
"block_hashes": [self._to_signed_i64(h) for h in block_hashes],
"parent_block_hash": self._to_signed_i64(parent_hash),
"token_ids": token_ids,
"block_size": self.kv_block_size,
"lora_id": lora_id if lora_id != 0 else None,
}
self._publish_event(event)
def publish_removed(self, event_id: int, block_hashes: list[int]):
event = {"type": "BlockRemoved", "block_hashes": [self._to_signed_i64(h) for h in block_hashes]}
self._publish_event(event)
def publish_all_cleared(self):
self._publish_event({"type": "AllBlocksCleared"})
def _publish_event(self, event: dict):
batch = [time.time(), [event], self.data_parallel_rank]
payload = msgpack.packb(batch, use_bin_type=True)
sequence_bytes = self.sequence.to_bytes(8, byteorder="big")
self.sequence += 1
self.socket.send_multipart([self.topic.encode(), sequence_bytes, payload])
def shutdown(self):
self.socket.close()
self.ctx.term()
```
### ZMQ Wire Format
The ZMQ message format (compatible with vLLM):
| Frame | Description |
|-------|-------------|
| 1 | Topic (empty string for all topics) |
| 2 | Sequence number (8 bytes, big-endian) |
| 3 | Msgpack payload: `[timestamp, [events], dp_rank]` |
Each event in the payload is a dictionary with `type` field (`BlockStored`, `BlockRemoved`, or `AllBlocksCleared`).
### Best Practices
1. **Event IDs must be monotonically increasing** per worker (use a thread-safe counter)
2. **Block size must match** your engine's actual `kv_block_size`
3. **`parent_hash` is required** for all blocks except the first in a sequence - it links blocks to enable prefix matching
## See Also
- **[Router README](README.md)**: Quick start guide for the KV Router
- **[Router Guide](router_guide.md)**: Configuration, tuning, and production setup
- **[Router Design](../../design_docs/router_design.md)**: Architecture details and event transport modes
<!--
SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
-->
# Router Guide
## Overview
The Dynamo KV Router intelligently routes requests by evaluating their computational costs across different workers. It considers both decoding costs (from active blocks) and prefill costs (from newly computed blocks), using KV cache overlap to minimize redundant computation. Optimizing the KV Router is critical for achieving maximum throughput and minimum latency in distributed inference setups.
This guide helps you get started with using the Dynamo router, with further details on configuration, disaggregated serving setup, and parameter tuning.
## Quick start
### Python / CLI Deployment
To launch the Dynamo frontend with the KV Router:
```bash
python -m dynamo.frontend --router-mode kv --http-port 8000
```
This command:
- Launches the Dynamo frontend service with KV routing enabled
- Exposes the service on port 8000 (configurable)
- Automatically handles all backend workers registered to the Dynamo endpoint
Backend workers register themselves using the `register_llm` API, after which the KV Router automatically tracks worker state and makes routing decisions based on KV cache overlap.
#### CLI Arguments
| Argument | Default | Description |
|----------|---------|-------------|
| `--router-mode kv` | `round_robin` | Enable KV cache-aware routing |
| `--router-temperature <float>` | `0.0` | Controls routing randomness (0.0 = deterministic, higher = more random) |
| `--kv-cache-block-size <size>` | Backend-specific | KV cache block size (should match backend config) |
| `--kv-events` / `--no-kv-events` | `--kv-events` | Enable/disable real-time KV event tracking |
| `--kv-overlap-score-weight <float>` | `1.0` | Balance prefill vs decode optimization (higher = better TTFT) |
For all available options: `python -m dynamo.frontend --help`
For detailed configuration options and tuning parameters, see [Using the KV Cache Router](#using-the-kv-cache-router).
### Kubernetes Deployment
To enable the KV Router in Kubernetes, add the `DYN_ROUTER_MODE` environment variable to your frontend service:
```yaml
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
name: my-deployment
spec:
services:
Frontend:
dynamoNamespace: my-namespace
componentType: frontend
replicas: 1
envs:
- name: DYN_ROUTER_MODE
value: kv # Enable KV Smart Router
```
**Key Points:**
- Set `DYN_ROUTER_MODE=kv` on the **Frontend** service only
- Workers automatically report KV cache events to the router
- No worker-side configuration changes needed
#### Environment Variables
All CLI arguments can be configured via environment variables using the `DYN_` prefix:
| CLI Argument | Environment Variable | Default |
|--------------|---------------------|---------|
| `--router-mode kv` | `DYN_ROUTER_MODE=kv` | `round_robin` |
| `--router-temperature` | `DYN_ROUTER_TEMPERATURE` | `0.0` |
| `--kv-cache-block-size` | `DYN_KV_CACHE_BLOCK_SIZE` | Backend-specific |
| `--no-kv-events` | `DYN_KV_EVENTS=false` | `true` |
| `--kv-overlap-score-weight` | `DYN_KV_OVERLAP_SCORE_WEIGHT` | `1.0` |
For complete K8s examples and advanced configuration, see [K8s Examples](router_examples.md#k8s-examples).
For A/B testing and advanced K8s setup, see the [KV Router A/B Benchmarking Guide](../../benchmarks/kv-router-ab-testing.md).
## KV Cache Routing
KV cache routing optimizes large language model inference by intelligently directing requests to workers with the most relevant cached data. By maximizing cache reuse, it reduces redundant computation and improves both throughput and latency.
```mermaid
graph TD
T[Tokens] --> R[KV Aware Router]
R -.-> W1["Worker 1<br/>Cached: 2 blocks<br/>Prefill: 8 blks<br/>Decode: 10 blks"]
R ==>|Selected| W2["Worker 2<br/>Cached: 5 blocks<br/>Prefill: 5 blks<br/>Decode: 5 blks"]
R -.-> W3["Worker 3<br/>Cached: 8 blocks<br/>Prefill: 2 blks<br/>Decode: 9 blks"]
style T fill:#fff3e0,stroke:#333,color:#333
style R fill:#2e8b57,stroke:#333,color:#fff
style W1 fill:#f3e5f5,stroke:#333,color:#333
style W2 fill:#c8e6c9,stroke:#333,color:#333
style W3 fill:#f3e5f5,stroke:#333,color:#333
linkStyle 0,1,2,3 stroke:#8b4513,stroke-width:2px
```
KV Cache reuse introduces complexity to LLM serving load balancing. While it can significantly reduce computation costs, routing strategies that ignore worker-specific KV states can lead to:
- Missed cache reuse opportunities due to suboptimal worker selection
- System throughput degradation from uneven request distribution across workers
The router uses a cost function that considers both the prefill cost (influenced by cached blocks) and the decode load to make optimal routing decisions:
### Cost Calculation
1. **Prefill blocks**: Calculated by dividing the number of tokens requiring prefill processing by the block size. The system predicts this based on input tokens and available cached blocks per worker, updating the count when the first output token signals prefill completion.
2. **Decode blocks**: Estimated from the request's input tokens and each worker's active sequences. The count updates when requests complete and their blocks are freed.
3. **Cost formula**: `cost = overlap_score_weight * prefill_blocks + decode_blocks`
- Lower costs indicate better routing choices
- `overlap_score_weight` balances cache hit optimization against load distribution
- Higher weights favor cache reuse (improving TTFT), while lower weights prioritize even load distribution (improving ITL)
### Worker Selection
The router selects the worker with the lowest cost. When `router_temperature` is set to a non-zero value, the router uses softmax sampling on the normalized cost logits to introduce randomness in the selection, which can help with load distribution.
Example calculation with `overlap_score_weight = 1.0`:
- Worker 1: cost = 1.0 * 8 + 10 = 18
- **Worker 2: cost = 1.0 * 5 + 5 = 10** (selected - lowest cost)
- Worker 3: cost = 1.0 * 2 + 9 = 11
### Using the KV Cache Router
To enable KV cache-aware routing, start the frontend node like this:
```bash
python -m dynamo.frontend --router-mode kv
```
When KV blocks are created or removed, the engine notifies the Dynamo router, which then identifies the worker with the best matching blocks and routes traffic accordingly.
To evaluate the benefits of KV-aware routing, compare your workload's performance using `--router-mode random|round-robin` against KV-aware routing.
The main KV-aware routing arguments:
- `--kv-overlap-score-weight`: Controls the importance of prefix cache overlaps in prefill cost calculations. Higher values improve Time To First Token (TTFT) at the cost of Inter-Token Latency (ITL). When set to 0, the router ignores prefix caches and uses pure load balancing. Defaults to 1.
- `--router-temperature`: Controls worker selection randomness through softmax sampling of router cost logits. A value of 0 (default) ensures deterministic selection of the lowest-cost worker, while higher values introduce more randomness.
- `--no-kv-events`: Disables KV event tracking. By default (when this flag is not provided), the router uses KV events to monitor block creation and deletion from workers. When disabled with this flag, the router predicts cache state based on routing decisions with TTL-based expiration (default 120s) and pruning. Use this flag if your backend doesn't support KV events (or you are not confident in the accuracy or responsiveness of the events).
- `--durable-kv-events`: Enables JetStream mode for KV event transport. Must be specified on **both** the frontend **and** all workers. When enabled, workers publish to JetStream instead of the local indexer, and the frontend consumes from JetStream as a durable consumer. Without this flag (default), workers use the local indexer with NATS Core or ZMQ event plane.
- `--router-replica-sync`: Disabled by default. Enables NATS-based synchronization of local routing decisions between router replicas. When enabled, routers share their active sequence information and local predictions of block usage, improving routing consistency across instances. Note that this does not sync the radix tree or cached KV block states themselves - in JetStream mode those are synchronized through JetStream events; in local indexer mode (default) each router queries workers directly.
- `--router-reset-states`: Only applies in JetStream mode (`--durable-kv-events`). When specified, resets the router state on startup by clearing both the JetStream event stream and NATS object store, starting with a fresh state. **Warning**: Using `--router-reset-states` can bring existing router replicas into an inconsistent state. Only use this flag when launching the first router replica in a component, or consider using a different namespace/component for a clean slate.
- `--router-snapshot-threshold`: Only applies in JetStream mode (`--durable-kv-events`). Sets the number of messages in the JetStream before triggering a snapshot. When the message count exceeds this threshold, a router will attempt to purge acknowledged messages from the stream and create a snapshot of the current radix tree state in NATS object store. Defaults to 1000000. This helps manage stream size and provides faster initialization for routers that restart.
- `--no-track-active-blocks`: Disables tracking of active blocks (blocks being used for ongoing generation/decode phases). By default, the router tracks active blocks for load balancing. Disable this when routing to workers that only perform prefill (no decode phase), as tracking decode load is not relevant. This reduces router overhead and simplifies state management.
- `--track-output-blocks`: Enables tracking of output blocks during generation (default: disabled). When enabled, the router adds placeholder blocks as tokens are generated and applies fractional decay based on progress toward `expected_output_tokens`. This improves load balancing accuracy for long-running generation requests by accounting for output-side KV cache growth.
- `--no-assume-kv-reuse`: When tracking active blocks, disables the assumption of KV cache reuse. By default (`router_assume_kv_reuse=true`), the router computes actual block hashes for sequence tracking to deduplicate blocks and optimize load balancing. When disabled via this flag, the router generates random hashes for sequence blocks, treating each request's blocks as unique. This is useful in disaggregated setups where prefill transfers blocks to decode workers that may already have those blocks cached, but the engine cannot coordinate transfers to avoid duplication. Without this flag, the router's load balancing heuristics would undercount decode blocks when duplicates exist.
- `--active-decode-blocks-threshold`: Initial threshold (0.0-1.0) for determining when a worker is considered busy based on KV cache block utilization. When a worker's KV cache active blocks exceed this percentage of total blocks, it will be marked as busy and excluded from routing. If not set, blocks-based busy detection is disabled. This feature works with all routing modes (`--router-mode kv|round-robin|random`) as long as backend engines publish load metrics. The threshold can be dynamically updated at runtime via the `/busy_threshold` HTTP endpoint (see [Dynamic Threshold Configuration](#dynamic-threshold-configuration)).
- `--active-prefill-tokens-threshold`: Literal token count threshold for determining when a worker is considered busy based on prefill token utilization. When active prefill tokens exceed this threshold, the worker is marked as busy. If not set, tokens-based busy detection is disabled.
- `--active-prefill-tokens-threshold-frac`: Fraction of `max_num_batched_tokens` for busy detection. A worker is marked busy when `active_prefill_tokens > frac * max_num_batched_tokens`. Uses OR logic with `--active-prefill-tokens-threshold` (worker is busy if either threshold is exceeded). If not set, fractional busy detection is disabled.
- `--router-ttl`: Time-to-live in seconds for blocks in the router's local cache predictions. Blocks older than this duration will be automatically expired and removed from the router's radix tree. Defaults to 120.0 seconds when `--no-kv-events` is used. This helps manage memory usage by removing stale cache predictions that are unlikely to be accurate.
- `--router-max-tree-size`: Maximum tree size (number of blocks) before pruning is triggered. When the total number of blocks in the radix tree exceeds this threshold, the router will prune the least recently used blocks. Defaults to 1048576 (2^20 blocks) when `--no-kv-events` is used. This prevents unbounded memory growth in long-running deployments.
- `--router-prune-target-ratio`: Target size ratio to prune down to when `--router-max-tree-size` is exceeded. For example, with a value of 0.8 (default) and max tree size of 1048576, the router will prune down to approximately 838860 blocks when the threshold is exceeded. Defaults to 0.8 when `--no-kv-events` is used. This creates headroom before the next pruning cycle.
>[!Note]
> **State persistence** depends on the event transport mode:
> - **NATS Core / Event Plane mode** (default): State persists on workers—router rebuilds state by querying workers on startup. This is the default when workers have `local_indexer` enabled (which is the default). Works with both NATS Core and ZMQ event planes.
> - **JetStream mode** (`--durable-kv-events` on **both** frontend **and** workers): State persists across router restarts via JetStream and NATS object store snapshots.
> - **No KV events** (`--no-kv-events`): State persistence is not supported.
>
> **Request plane is independent of KV event transport.**
> The router can run without etcd or NATS when using ZMQ event plane (`--event-plane zmq`) and file/mem store (`--store-kv file` or `--store-kv mem`); in this case, KV events use ZMQ transport instead of NATS.
> `DYN_REQUEST_PLANE` controls how **requests** are sent (TCP/HTTP/NATS), but KV-aware routing uses **NATS** for KV events only in JetStream or NATS Core modes (not ZMQ mode).
> When KV events are enabled (default) with NATS-based event plane, NATS is automatically initialized. You can optionally set `NATS_SERVER=nats://...` to specify a custom NATS server; otherwise, it defaults to `localhost:4222`.
> `--no-kv-events` disables KV event transport entirely.
>
> When `--kv-overlap-score-weight` is set to 0, no KVIndexer is created and prefix matching is disabled (pure load balancing). When `--no-kv-events` is set, a KVIndexer is still created but no event subscriber is launched to consume KV events from workers. Instead, the router predicts cache state based on its own routing decisions with TTL-based expiration and pruning.
>
> **Backend Configuration:** When using `--no-kv-events`, configure your backend workers to disable KV event publishing:
> - **vLLM**: Use `--kv-events-config '{"enable_kv_cache_events": false}'`
> - **SGLang**: Do not use `--kv-events-config`
> - **TRT-LLM**: Do not use `--publish-events-and-metrics`
>
> The cli args `--router-ttl`, `--router-max-tree-size`, and `--router-prune-target-ratio` control local cache management when the router operates without receiving events from workers. When KV events are enabled (default), the router relies on worker-side eviction events and these parameters are ignored.
To implement KV event publishing for custom inference engines, enabling them to participate in Dynamo's KV cache-aware routing, see [KV Event Publishing for Custom Engines](../../integrations/kv_events_custom_engines.md).
## Basic Routing
Dynamo supports several routing strategies when sending requests from one component to another component's endpoint.
First, we must create a client tied to a components endpoint, we can do this using the labels defined above. Here we are getting a client tied to the `generate` endpoint of the `VllmWorker` component.
```python
client = namespace('dynamo').component('VllmWorker').endpoint('generate').client()
```
We can then use the default routing methods exposed by the client class to send requests to the `VllmWorker` component.
- **Random routing**: Default strategy, available via `client.generate()` or `client.random()`
- **Round-robin routing**: Cycles through available workers via `client.round_robin()`
- **Direct routing**: Explicitly targets a specific worker via `client.direct(input, component_id)`
KV Cache routing uses direct routing with a special worker selection algorithm.
For benchmarking KV router performance, see the [KV Router A/B Benchmarking Guide](../../benchmarks/kv-router-ab-testing.md).
For custom routing logic and advanced patterns, see [Routing Patterns](router_examples.md#routing-patterns) in the examples documentation.
## Tuning Guidelines
### 1. Understand Your Workload Characteristics
- **Prefill-heavy workloads** (long prompts, short generations): Increase `kv-overlap-score-weight`
- **Decode-heavy workloads** (short prompts, long generations): Decrease `kv-overlap-score-weight`
### 2. Monitor Key Metrics
The router logs the cost calculation for each worker:
```text
Formula for worker_1: 125.3 = 1.0 * 100.5 + 25.0 (cached_blocks: 15)
```
This shows:
- Total cost (125.3)
- Overlap weight × prefill blocks (1.0 × 100.5)
- Active blocks (25.0)
- Cached blocks that contribute to overlap (15)
### 3. Temperature-Based Routing
The `router_temperature` parameter controls routing randomness:
- **0.0 (default)**: Deterministic selection of the best worker
- **> 0.0**: Probabilistic selection, higher values increase randomness
- Useful for preventing worker saturation and improving load distribution
### 4. Iterative Optimization
1. Begin with default settings
2. Monitor TTFT and ITL metrics
3. Adjust `kv-overlap-score-weight` to meet your performance goals:
- To reduce TTFT: Increase the weight
- To reduce ITL: Decrease the weight
4. If you observe severe load imbalance, increase the temperature setting
## Disaggregated Serving
Dynamo supports disaggregated serving where prefill (prompt processing) and decode (token generation) are handled by separate worker pools. When you register workers with `ModelType.Prefill` (see [Backend Guide](../../development/backend-guide.md)), the frontend automatically detects them and activates an internal prefill router.
### Automatic Prefill Router Activation
The prefill router is automatically created when:
1. A decode model is registered (e.g., via `register_llm()` with `ModelType.Chat | ModelType.Completions`)
2. A prefill worker is detected with the same model name and `ModelType.Prefill`
**Key characteristics of the prefill router:**
- **Always disables active block tracking** (`track_active_blocks=false`) since prefill workers don't perform decode
- **Seamlessly integrated** into the request pipeline between preprocessing and decode routing
- **Falls back gracefully** to decode-only mode if prefill fails or no prefill workers are available
### Setup Example
When both workers are registered, requests are automatically routed.
```python
# Decode worker registration (in your decode worker)
decode_endpoint = runtime.namespace("dynamo").component("decode").endpoint("generate")
await register_llm(
model_input=ModelInput.Tokens,
model_type=ModelType.Chat | ModelType.Completions,
endpoint=decode_endpoint,
model_name="meta-llama/Llama-2-7b-hf",
# ... other parameters
)
await decode_endpoint.serve_endpoint(decode_handler.generate)
# Prefill worker registration (in your prefill worker)
prefill_endpoint = runtime.namespace("dynamo").component("prefill").endpoint("generate")
await register_llm(
model_input=ModelInput.Tokens,
model_type=ModelType.Prefill, # <-- Mark as prefill worker
endpoint=prefill_endpoint,
model_name="meta-llama/Llama-2-7b-hf", # Must match decode model name
# ... other parameters
)
await prefill_endpoint.serve_endpoint(prefill_handler.generate)
```
> [!Note]
> The unified frontend with automatic prefill routing is currently enabled for vLLM and TensorRT-LLM backends. For SGLang (work in progress), you need to launch a separate standalone router as the prefill router targeting the prefill endpoints. See example script: [`examples/backends/sglang/launch/disagg_router.sh`](../../examples/backends/sglang/launch/disagg_router.sh).
### Request Flow
The following diagram shows an overview of the major components in disaggregated serving:
```mermaid
graph TD
HTTP[HTTP]
ROUTER[Router]
PREFILL[Prefill Worker]
DECODE[Decode Worker]
classDef worker_style fill:#f3e5f5,stroke:#333,stroke-width:2px,color:#333;
classDef router_style fill:#2e8b57,stroke:#333,stroke-width:2px,color:#fff;
class PREFILL,DECODE worker_style
class ROUTER router_style
HTTP <--> |"request/response"| ROUTER
ROUTER --> |"1. send to prefill"| PREFILL
PREFILL --> |"2. return NIXL metadata"| ROUTER
ROUTER --> |"3. send with metadata"| DECODE
DECODE --> |"4. stream response"| ROUTER
PREFILL -.-> |"publish kv events"| ROUTER
linkStyle 0,1,2,3,4 stroke:#8b4513,stroke-width:2px
linkStyle 5 stroke:#2196f3,stroke-width:2px
```
## Serving Multiple Router Replicas
For improved fault tolerance, you can launch multiple frontend + router replicas. Since the frontend and router are currently tied together, you'll need to use different HTTP ports for each instance. (The separation of the frontend and Router is WIP.)
### Router State Management
The KV Router tracks two types of state (see [Router Design](../../design_docs/router_design.md) for details):
1. **Prefix blocks (cached KV blocks)**: Maintained in a radix tree, tracking which blocks are cached on each worker. This state is **persistent** - in local indexer mode (default) state is rebuilt from workers on startup; in JetStream mode (`--durable-kv-events`) it is backed by JetStream events and object store snapshots.
2. **Active blocks (decoding blocks)**: Tracks blocks currently being used for active generation requests. This state is **ephemeral** - when a new router replica starts, it begins with zero active block knowledge but becomes eventually consistent as it handles requests.
### Enabling Router Replica Synchronization
```bash
# Router replica 1
python -m dynamo.frontend --router-mode kv --http-port 8000 --router-replica-sync
# Router replica 2 (can be started later)
python -m dynamo.frontend --router-mode kv --http-port 8001 --router-replica-sync
```
The `--router-replica-sync` flag enables active block synchronization between replicas:
- Active blocks are shared via NATS core messaging (fire-and-forget)
- Replicas exchange routing decisions to maintain consistent load estimates
- A new replica starts with zero active blocks but quickly converges through request handling, by itself and active syncing with other replicas
Without this flag, each replica maintains its own isolated view of active blocks, potentially leading to suboptimal routing.
### Persistence and Recovery
Persistence behavior depends on which event transport mode is active:
**NATS Core / Event Plane with Local Indexer Mode (default):**
- State persists on workers—events are fire-and-forget but workers retain their local indexer state
- On startup, the router queries each worker's local indexer to rebuild state
- Recovery depends on workers being available; if a worker is down, its blocks cannot be recovered
- Simpler infrastructure (no JetStream required)
**JetStream Mode** (`--durable-kv-events` on **both** frontend **and** workers)**:**
- Prefix blocks are stored in NATS JetStream with 1-hour retention
- Snapshots saved to NATS object store at configurable thresholds
- New replicas automatically restore this state on startup
- You can launch a third Router replica even if the first two are down, and it will recover the full prefix state
```bash
python -m dynamo.frontend --router-mode kv --http-port 8002 --router-replica-sync
```
>[!Note]
> If you need to start with a fresh state in JetStream mode, you have two options:
> 1. **Recommended**: Use a different namespace/component (see [Distributed Runtime](/docs/design_docs/distributed_runtime.md)) which will start a new stream and NATS object store path
> 2. **Use with caution**: Launch a router with the `--router-reset-states` flag, which will purge the entire stream and radix snapshot. This should only be done when launching the first router replica in a component, as it can bring existing router replicas into an inconsistent state.
## Dynamic Threshold Configuration
Dynamic threshold configuration allows you to adjust worker busy thresholds at runtime without restarting the frontend, enabling real-time tuning of load balancing behavior based on observed system performance.
The busy thresholds can be updated at runtime without restarting the frontend. The frontend exposes HTTP endpoints at `/busy_threshold`:
**Get or set a model's thresholds (POST):**
```bash
# Set both thresholds for a model
curl -X POST http://localhost:8000/busy_threshold \
-H "Content-Type: application/json" \
-d '{"model": "meta-llama/Llama-2-7b-hf", "active_decode_blocks_threshold": 0.85, "active_prefill_tokens_threshold": 1000}'
# Response: {"model": "meta-llama/Llama-2-7b-hf", "active_decode_blocks_threshold": 0.85, "active_prefill_tokens_threshold": 1000}
# Set only active decode blocks threshold
curl -X POST http://localhost:8000/busy_threshold \
-H "Content-Type: application/json" \
-d '{"model": "meta-llama/Llama-2-7b-hf", "active_decode_blocks_threshold": 0.85}'
# Response: {"model": "meta-llama/Llama-2-7b-hf", "active_decode_blocks_threshold": 0.85, "active_prefill_tokens_threshold": <current_value>}
# Get current thresholds (omit threshold fields)
curl -X POST http://localhost:8000/busy_threshold \
-H "Content-Type: application/json" \
-d '{"model": "meta-llama/Llama-2-7b-hf"}'
# Response: {"model": "meta-llama/Llama-2-7b-hf", "active_decode_blocks_threshold": 0.85, "active_prefill_tokens_threshold": 1000}
# Or if not configured: {"model": "...", "active_decode_blocks_threshold": null, "active_prefill_tokens_threshold": null}
```
**List all configured thresholds (GET):**
```bash
curl http://localhost:8000/busy_threshold
# Response: {"thresholds": [{"model": "meta-llama/Llama-2-7b-hf", "active_decode_blocks_threshold": 0.85, "active_prefill_tokens_threshold": 1000}]}
```
## See Also
- **[Router README](README.md)**: Quick start guide for the KV Router
- **[Router Examples](router_examples.md)**: Python API usage, K8s examples, and custom routing patterns
- **[Router Design](../../design_docs/router_design.md)**: Architecture details and event transport modes
- **[KV Event Publishing for Custom Engines](../../integrations/kv_events_custom_engines.md)**: Integrate custom inference engines with KV-aware routing
# SPDX-FileCopyrightText: Copyright (c) 2023-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
# Configuration file for the Sphinx documentation builder.
import os
import sys
# -- Project information -----------------------------------------------------
project = "NVIDIA Dynamo"
copyright = "2024-2026, NVIDIA CORPORATION & AFFILIATES"
author = "NVIDIA"
# Version is set via DYNAMO_DOCS_VERSION env var during build (e.g., "0.3.0")
# Defaults to "dev" for main branch and PR builds
release = os.environ.get("DYNAMO_DOCS_VERSION", "dev")
# -- General configuration ---------------------------------------------------
# Standard extensions
extensions = [
"ablog",
"myst_parser",
"sphinx_copybutton",
"sphinx_design",
"sphinx_prompt",
# "sphinxcontrib.bibtex",
"sphinx_tabs.tabs",
"sphinx_sitemap",
"sphinx.ext.autodoc",
"sphinx.ext.autosummary",
"sphinx.ext.mathjax",
"sphinx.ext.napoleon",
"sphinx.ext.ifconfig",
"sphinx.ext.extlinks",
"sphinxcontrib.mermaid",
"sphinx_reredirects",
]
# Redirects configuration
redirects = {
# Frontend migration
"frontends/kserve": "../components/frontend/frontend_guide.html",
# PR #3802
"guides/tool-calling": "../agents/tool-calling.html", # key format corrected
"architecture/architecture": "../design_docs/architecture.html",
"architecture/disagg_serving": "../design_docs/disagg_serving.html",
"architecture/distributed_runtime": "../design_docs/distributed_runtime.html",
"architecture/dynamo_flow": "../design_docs/dynamo_flow.html",
"architecture/request_cancellation": "../fault_tolerance/request_cancellation.html",
"architecture/request_migration": "../fault_tolerance/request_migration.html",
"kubernetes/create_deployment": "../kubernetes/deployment/create_deployment.html",
"kubernetes/minikube": "../kubernetes/deployment/minikube.html",
"kubernetes/multinode-deployment": "../kubernetes/deployment/multinode-deployment.html",
"kubernetes/logging": "../kubernetes/observability/logging.html",
"kubernetes/metrics": "../kubernetes/observability/metrics.html",
"architecture/kv_cache_routing": "../components/router/router_guide.html",
# PR #3658
"API/nixl_connect/README": "../../api/nixl_connect/README.html",
"API/nixl_connect/connector": "../../api/nixl_connect/connector.html",
"API/nixl_connect/descriptor": "../../api/nixl_connect/descriptor.html",
"API/nixl_connect/device": "../../api/nixl_connect/device.html",
"API/nixl_connect/device_kind": "../../api/nixl_connect/device_kind.html",
"API/nixl_connect/operation_status": "../../api/nixl_connect/operation_status.html",
"API/nixl_connect/rdma_metadata": "../../api/nixl_connect/rdma_metadata.html",
"API/nixl_connect/read_operation": "../../api/nixl_connect/read_operation.html",
"API/nixl_connect/readable_operation": "../../api/nixl_connect/readable_operation.html",
"API/nixl_connect/writable_operation": "../../api/nixl_connect/writable_operation.html",
"API/nixl_connect/write_operation": "../../api/nixl_connect/write_operation.html",
"guides/backend": "../development/backend-guide.html",
"runtime/README": "../development/runtime-guide.html",
"guides/tool_calling": "../agents/tool-calling.html",
"architecture/kvbm_architecture": "../design_docs/kvbm_design.html",
"architecture/kvbm_components": "../design_docs/kvbm_design.html",
"architecture/kvbm_intro": "../components/kvbm/README.html",
"architecture/kvbm_motivation": "../design_docs/kvbm_design.html",
"architecture/kvbm_reading": "../design_docs/kvbm_design.html",
"guides/run_kvbm_in_trtllm": "../components/kvbm/kvbm_guide.html",
"guides/run_kvbm_in_vllm": "../components/kvbm/kvbm_guide.html",
"guides/health_check": "../observability/health-checks.html",
"guides/logging": "../observability/logging.html",
"guides/metrics": "../observability/metrics.html",
"guides/disagg_perf_tuning": "../performance/tuning.html",
"architecture/load_planner": "../components/planner/README.html",
"architecture/planner_intro": "../components/planner/README.html",
"architecture/sla_planner": "../components/planner/planner_guide.html",
"kubernetes/sla_planner_quickstart": "../components/planner/planner_guide.html",
"guides/dynamo_run": "../reference/cli.html",
"dynamo_glossary": "../reference/glossary.html",
"support_matrix": "../reference/support-matrix.html",
# Multimodal documentation consolidation (all redirect to features/multimodal/)
"backends/vllm/multimodal": "../../features/multimodal/multimodal_vllm.html",
"backends/vllm/multimodal_vllm_guide": "../../features/multimodal/multimodal_vllm.html",
"backends/trtllm/multimodal_support": "../../features/multimodal/multimodal_trtllm.html",
"backends/trtllm/multimodal_trtllm_guide": "../../features/multimodal/multimodal_trtllm.html",
"backends/trtllm/multinode/multinode-multimodal-example": "../../../features/multimodal/multimodal_trtllm.html",
"backends/sglang/multimodal_epd": "../../features/multimodal/multimodal_sglang.html",
"backends/sglang/multimodal_sglang_guide": "../../features/multimodal/multimodal_sglang.html",
"multimodal/multimodal_intro": "../features/multimodal/README.html",
# Speculative decoding consolidation
"backends/vllm/speculative_decoding": "../../features/speculative_decoding/speculative_decoding_vllm.html",
# Multimodal migration to features/multimodal/
"multimodal/index": "../features/multimodal/README.html",
"multimodal/vllm": "../features/multimodal/multimodal_vllm.html",
"multimodal/sglang": "../features/multimodal/multimodal_sglang.html",
"multimodal/trtllm": "../features/multimodal/multimodal_trtllm.html",
# Component consolidation into docs/components/
"router/README": "../components/router/README.html",
"router/kv_cache_routing": "../components/router/router_guide.html",
"router/kv_events": "../integrations/kv_events_custom_engines.html",
"planner/planner_intro": "../components/planner/README.html",
"planner/README": "../components/planner/README.html",
"planner/planner_guide": "../components/planner/planner_guide.html",
"planner/planner_examples": "../components/planner/planner_examples.html",
"planner/sla_planner_quickstart": "../components/planner/planner_guide.html",
"planner/sla_planner": "../components/planner/planner_guide.html",
"planner/load_planner": "../components/planner/README.html",
"kvbm/kvbm_intro": "../components/kvbm/README.html",
"kvbm/README": "../components/kvbm/README.html",
"kvbm/kvbm_guide": "../components/kvbm/kvbm_guide.html",
"kvbm/kvbm_design": "../design_docs/kvbm_design.html",
# Profiler consolidation
"benchmarks/sla_driven_profiling": "../components/profiler/profiler_guide.html",
}
# Custom extensions
sys.path.insert(0, os.path.abspath("_extensions"))
extensions.append("github_alerts")
# Handle Mermaid diagrams as code blocks (not directives) to avoid warnings
myst_fence_as_directive = ["mermaid"] # Uncomment if sphinxcontrib-mermaid is installed
# File extensions (myst_parser automatically handles .md files)
source_suffix = [".rst", ".md"]
# MyST parser configuration
myst_enable_extensions = [
"colon_fence", # ::: code blocks
"deflist", # Definition lists
"html_image", # HTML images
"tasklist", # Task lists
]
# Templates path
templates_path = ["_templates"]
# List of patterns to ignore when looking for source files
exclude_patterns = ["_build", "Thumbs.db", ".DS_Store", "build"]
# -- Options for HTML output -------------------------------------------------
html_theme = "nvidia_sphinx_theme"
html_static_path = ["_static"]
html_extra_path = ["project.json"]
html_theme_options = {
"collapse_navigation": False,
"icon_links": [
{
"name": "GitHub",
"url": "https://github.com/ai-dynamo/dynamo",
"icon": "fa-brands fa-github",
}
],
"switcher": {
# Use single shared URL so all versions see the same switcher list
# When a new version is added, all old docs automatically see it
"json_url": "https://docs.nvidia.com/dynamo/versions1.json",
"version_match": release,
},
"extra_head": {
"""
<script src="https://assets.adobedtm.com/5d4962a43b79/c1061d2c5e7b/launch-191c2462b890.min.js" ></script>
"""
},
"extra_footer": {
"""
<script type="text/javascript">if (typeof _satellite !== "undefined") {_satellite.pageBottom();}</script>
"""
},
"navbar_start": ["navbar-logo"],
"primary_sidebar_end": [],
}
# Document settings
master_doc = "index"
html_title = f"{project} Documentation"
html_short_title = project
html_baseurl = "https://docs.nvidia.com/dynamo/latest/"
# Suppress warnings for external links and missing references
suppress_warnings = [
"myst.xref_missing", # Missing cross-references of relative links outside docs folder
]
# Additional MyST configuration
myst_heading_anchors = 7 # Generate anchors for headers
myst_substitutions = {} # Custom substitutions
<!--
SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES.
All rights reserved.
SPDX-License-Identifier: Apache-2.0
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
# High Level Architecture
Dynamo is NVIDIA's high-throughput, low-latency inference framework that's designed to serve generative AI and reasoning models in multi-node distributed environments. It's inference engine agnostic, supporting TRT-LLM, vLLM, SGLang and others, while capturing essential LLM capabilities:
- **Disaggregated prefill & decode inference**: Maximizes GPU throughput and helps you balance throughput and latency
- **Dynamic GPU scheduling**: Optimizes performance based on real-time demand
- **LLM-aware request routing**: Eliminates unnecessary KV cache recomputation
- **Accelerated data transfer**: Reduces inference response time using NIXL
- **KV cache offloading**: Uses multiple memory hierarchies for higher system throughput and lower latency
Built in Rust for performance and in Python for extensibility, Dynamo is fully open-source and driven by a transparent, Open Source Software (OSS)-first development approach
## Motivation behind Dynamo
Scaling inference for generative AI and reasoning models presents complex challenges in three key areas: performance, correctness, and efficiency. Here's what we're solving:
There are multi-faceted challenges:
- *Difficult UX*: User experience is critical for distributed inference runtimes because managing large-scale inference systems is already complex, and poor usability further complicates matters. Developers need a clear, intuitive way to define, optimize, and update inference execution without wrestling with low-level infrastructure details. Without simple UX, inference runtimes remain inaccessible, prone to errors, and inefficient, hindering model deployment and innovation. A modern distributed inference stack must consider usability at its core—empowering developers to scale AI effortlessly for agentic workflows while ensuring correctness and performance.
- *GPU underutilization*: Traditional monolithic inference pipelines often leave GPUs idle due to the imbalance between prefill and decode stages. Prefill (which generates large prompt embeddings) is highly compute-intensive, while decode (which generates tokens) is latency-sensitive. A disaggregated approach that separate prefill and decode ensures optimal GPU utilization and increases overall throughput ([DistServe](https://arxiv.org/abs/2401.09670)).
- *Expensive KV cache re-computation*: When requests aren't efficiently routed, KV caches (intermediate states of transformer model) often get flushed and recomputed, leading to wasted computation cycles and increased latency. KV-aware request routing eliminates redundant KV cache regeneration, significantly boosting efficiency.([DeepSeek](https://arxiv.org/abs/2501.12948))
- *Memory bottlenecks*: Large-scale inference workloads demand extensive KV cache storage, which can quickly overwhelm GPU memory capacity. KV cache offloading across memory hierarchies (HBM, DDR, NVMe or remote storage) enables models to scale beyond GPU memory limits and speeds up latency. ([Mooncake](https://kvcache-ai.github.io/Mooncake/design/mooncake-store.html), [AIBrix](https://blog.vllm.ai/2025/02/21/aibrix-release.html), [LMCache](https://lmcache.ai/))
- *Fluctuating demand and inefficient GPU allocation*: Inference workloads are use-case specific and dynamic—demand surges inherently cause unpredictably, yet traditional serving stacks allocate GPUs statically. Dynamic GPU scheduling ensures that resources are allocated based on real-time demand, preventing over-provisioning and improving utilization ([AzureTrace](https://github.com/Azure/AzurePublicDataset))
- *Inefficient data transfer*: Distributed inference workloads introduce unique and highly dynamic communication patterns that differ fundamentally from training. Unlike training, where worker roles remain largely static, inference requires real-time worker scaling, dynamic load balancing, and adaptive memory management—necessitating a communication layer that can efficiently handle these evolving requirements. Contemporary libraries are built for static, synchronous operations and lack the dynamicity needed for inference serving. While UCX provides high-performance networking, it requires deep networking expertise to configure correctly, making it impractical for broad inference use cases. Developers need a library optimized for inference workloads that can abstract heterogeneous memory (remote memory or storage) and dynamically select the best transport mechanism via a unified API.
To address the growing demands of distributed inference serving, NVIDIA introduces Dynamo. This innovative product tackles key challenges in scheduling, memory management, and data transfer. Dynamo employs KV-aware routing for optimized decoding, leveraging existing KV caches. For efficient global memory management at scale, it strategically stores and evicts KV caches across multiple memory tiers—GPU, CPU, SSD, and object storage—enhancing both time-to-first-token and overall throughput. Dynamo features NIXL (NVIDIA Inference tranXfer Library), a new data transfer engine designed for dynamic scaling and low-latency storage access.
## Key benefits
The following diagram outlines Dynamo's high-level architecture. To enable large-scale distributed and disaggregated inference serving, Dynamo includes five key features:
- [Dynamo Disaggregated Serving](disagg_serving.md)
- [Dynamo Smart Router](../components/router/README.md)
- [Dynamo KV Cache Block Manager](../kvbm/kvbm_intro.rst)
- [Planner](../planner/planner_intro.rst)
- [NVIDIA Inference Transfer Library (NIXL)](https://github.com/ai-dynamo/nixl/blob/main/docs/nixl.md)
Every component in the Dynamo architecture is independently scalable and portable. The API server can adapt to task-specific deployment. A smart router processes user requests to route them to the optimal worker for performance. Specifically, for Large Language Models (LLMs), Dynamo employs KV cache-aware routing, which directs requests to the worker with the highest cache hit rate while maintaining load balance, expediting decoding. This routing strategy leverages a KV cache manager that maintains a global radix tree registry for hit rate calculation. The KV cache manager also oversees a multi-tiered memory system, enabling rapid KV cache storage and eviction. This design results in substantial TTFT reductions, increased throughput, and the ability to process extensive context lengths.
![Diagram of the NVIDIA Dynamo architecture for distributed AI inference, including User Requests, Planner, API Server, Smart Router, and Disaggregated Serving](../images/architecture.png "Dynamo Architecture")
Dynamo enables dynamic worker scaling, responding to real-time deployment signals. These signals, captured and communicated through an event plane, empower the Planner to make intelligent, zero-downtime adjustments. For instance, if Dynamo detects an increase in requests with long input sequences, the Planner automatically scales up prefill workers to meet the heightened demand.
Beyond efficient event communication, data transfer across multi-node deployments is crucial at scale. To address this, Dynamo utilizes NIXL, a technology designed to expedite transfers through reduced synchronization and intelligent batching. This acceleration is particularly vital for disaggregated serving, ensuring minimal latency when prefill workers pass KV cache data to decode workers.
Dynamo prioritizes seamless integration. Its modular design enables it to work harmoniously with your existing infrastructure and preferred open-source components. To achieve optimal performance and extensibility, Dynamo leverages the strengths of both Rust and Python. We built critical performance-sensitive modules with Rust for speed, memory safety, and robust concurrency. Meanwhile, we used Python for its flexibility, enabling rapid prototyping and effortless customization.
## Performance benefits of key features
### Disaggregated serving
Disaggregating prefill and decode boosts performance, gaining efficiency when more GPUs are involved in inference. For example, for Llama 70B, single-node tests show a 30% throughput/GPU improvement, while two-node setups achieve over 2X gains due to better parallelization.
![Two scatter plots comparing the performance of disagg and baseline configurations on one node versus two nodes](../images/disagg_perf_benefit.png)
* Tested on H100s with R1 Distilled Llama 70B model FP8 using vLLM. 3K ISL/ 150 OSL
The disaggregation of prefill and decode phases offers valuable flexibility. Since these phases directly correlate with time-to-first-token (TTFT) and inter-token latency (ITL) respectively, adjusting worker allocation can provide tailored performance. This enables optimization for specific service level agreements (SLAs), whether prioritizing faster TTFT, lower ITL, or higher throughput.
### KV aware routing
![Two bar charts comparing Random routing and Dynamo with KV aware routing for Time To First Token (3x faster with Dynamo) and Avg request latency (2x faster with Dynamo).](../images/kv_routing.png)
* Tested with 100K requests to R1 using R1 Distilled Llama 70B FP8 on 2 nodes of H100s. Avg 4K ISL / 800 OSL
Existing routing methods, including load-based routing, overlook the specific properties of LLMs that could improve performance. Addressing this, routing user queries to workers with the highest KV cache hit rate (rather than simply the least busy node) allows for immediate processing, even under heavy load. The preceeding figures illustrate the effectiveness of KV aware routing on 100,000 real R1 user queries, achieving a 3x improvement in TTFT and a 2x reduction in average request latency. Depending on traffic, this approach can also enhance throughput.
### KV cache manager
The Dynamo KV Block Manager (KVBM) enables KV cache offloading to system CPU memory, local SSDs, and network-attached storage, allowing more KV blocks to be reused instead of recomputed. In many cases, KV transfer is faster than recomputation, so KVBM helps reduce time-to-first-token (TTFT). The following plot highlights the performance gains achieved through CPU memory offloading. In a scenario involving 20 multi-turn conversations with 15 users, KVBM with CPU memory offloading achieved a 2.2×–12× improvement in TTFT (depending on QPS), demonstrating benefits that extend beyond basic prefix caching.
![Line graph comparing Pure GPU prefix caching with vLLM and KVBM host offloading for TTFT (Time To First Token)](../images/kvbm_agg_performance.png)
* Tested with different QPS using Qwen3-8B on H100. Avg 20K ISL / 100 OSL.
### NVIDIA Inference Transfer Library (NIXL)
NIXL streamlines data transfer through simplified synchronization and batching and simplified source and destination abstractions. NIXL can abstract data movement across different types of memory and fast storage, whereas other data transfer libraries typically support a single tier of memory. These enhancements yield significant performance gains, accelerating both time-to-first-token (TTFT) and throughput.
## Acknowledgements
We'd like to acknowledge several open source software stacks that motivated our creation Dynamo.
- vLLM and vLLM-project
- SGLang
- DistServe
- Mooncake
- AIBrix
- BentoML
<!--
SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES.
All rights reserved.
SPDX-License-Identifier: Apache-2.0
-->
# Dynamo Disaggregation: Separating Prefill and Decode for Enhanced Performance
The prefill and decode phases of LLM requests have different computation characteristics and memory footprints. Disaggregating these phases into specialized llm engines allows for better hardware allocation, improved scalability, and overall enhanced performance. For example, using a larger TP for the memory-bound decoding phase while a smaller TP for the computation-bound prefill phase allows both phases to be computed efficiently. In addition, for requests with long context, separating their prefill phase into dedicated prefill engines allows the ongoing decoding requests to be efficiently processed without being blocked by these long prefills.
Disaggregated execution of a request has three main steps:
1. Prefill engine computes prefill phase and generates KV cache
2. Prefill engine transfers the KV cache to decode engine
3. Decode engine computes decode phase.
The disaggregation design in Dynamo features a flexible framework that delivers strong performance across various conditions.
## Efficient KV Transfer
The key to high-performance disaggregation is efficient KV transfer. Dynamo leverages NIXL to transfer KV cache directly from the VRAM of the prefill engine to the VRAM of the decode engine. The KV transfer is non-blocking, allowing GPU forward passes to continue serving other requests during the transfer.
### Router Orchestration
The disaggregated serving flow is orchestrated by the `PrefillRouter`:
```mermaid
sequenceDiagram
participant Client
participant Frontend
participant Router as PrefillRouter
participant Prefill as Prefill Worker
participant Decode as Decode Worker
Client->>Frontend: Request
Frontend->>Router: Preprocessed Request
Router->>Router: Select prefill worker
Router->>Prefill: Prefill request
Prefill->>Prefill: Compute KV cache
Prefill-->>Router: disaggregated_params
Router->>Router: Select decode worker
Router->>Decode: Decode request + transfer metadata
Decode<<->>Prefill: KV transfer (NIXL)
Decode->>Decode: Generate tokens
Decode-->>Frontend: Stream tokens
Frontend-->>Client: Response
```
1. **Worker Selection**: The router selects a prefill worker using KV-aware routing (based on cache overlap scores and load) or simple load balancing.
2. **Prefill Execution**: The router sends the prefill request to the selected prefill worker. The prefill worker computes the KV cache and returns `disaggregated_params` containing backend-specific transfer metadata.
3. **Decode Routing**: The router injects the prefill result into the decode request, then routes to the decode worker.
4. **KV Transfer**: The decode worker uses the transfer metadata to coordinate with the prefill worker. NIXL handles the direct GPU-to-GPU transfer using the optimal available transport (NVLink, InfiniBand/UCX, etc.).
### Backend-Specific Transfer Metadata
The transfer metadata format varies by backend:
- **SGLang**: Uses `bootstrap_info` (host, port, room_id) for RDMA bootstrap coordination. SGLang prefill workers publish their bootstrap endpoint to the discovery service during initialization. With this mechanism, prefill can run as a background task, allowing the decode phase to begin immediately while the KV transfer proceeds in parallel.
- **vLLM**: Uses `kv_transfer_params` containing block IDs and remote worker connection info. Prefill runs synchronously; decode waits for prefill to complete before proceeding.
- **TRTLLM**: Uses `opaque_state` containing serialized TRT-LLM internal metadata. Prefill runs synchronously; decode waits for prefill to complete before proceeding.
## Runtime-Reconfigurable xPyD
Dynamo's disaggregation design supports runtime-reconfigurable xPyD (x prefill workers, y decode workers). Workers can be added and removed at runtime:
- **Add worker**: Worker registers with the discovery service and publishes its `RuntimeConfig` (including KV capacity).
- **Remove worker**: Worker drains active requests and deregisters from discovery.
The router automatically discovers new workers via the discovery service and incorporates them into routing decisions.
<!--
SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
# Dynamo Distributed Runtime
## Overview
Dynamo's `DistributedRuntime` is the core infrastructure in the framework that enables distributed communication and coordination between different Dynamo components. It is implemented in Rust (`/lib/runtime`) and exposed to other programming languages via bindings (i.e., Python bindings can be found in `/lib/bindings/python`). The runtime supports multiple discovery backends (Kubernetes-native or etcd) and request planes (TCP, HTTP, or NATS). `DistributedRuntime` follows a hierarchical structure:
- `DistributedRuntime`: This is the highest level object that exposes the distributed runtime interface. It manages connections to discovery backends (K8s API or etcd) and optional messaging (NATS for KV events), and handles lifecycle with cancellation tokens.
- `Namespace`: A `Namespace` is a logical grouping of components that isolate between different model deployments.
- `Component`: A `Component` is a discoverable object within a `Namespace` that represents a logical unit of workers.
- `Endpoint`: An `Endpoint` is a network-accessible service that provides a specific service or function.
While theoretically each `DistributedRuntime` can have multiple `Namespace`s as long as their names are unique (similar logic also applies to `Component/Namespace` and `Endpoint/Component`), in practice, each dynamo components typically are deployed with its own process and thus has its own `DistributedRuntime` object. However, they share the same namespace to discover each other.
For example, a typical deployment configuration (like `examples/backends/vllm/deploy/agg.yaml` or `examples/backends/sglang/deploy/agg.yaml`) has multiple components:
- `Frontend`: Starts an HTTP server (OpenAI-compatible API on port 8000), handles incoming requests, applies chat templates, performs tokenization, and routes requests to workers. The `make_engine` function encapsulates this functionality.
- `Worker` components (e.g., `VllmDecodeWorker`, `VllmPrefillWorker`, `SGLangDecodeWorker`, `TRTLLMWorker`): Perform the actual inference computation using their respective engines (vLLM, SGLang, TensorRT-LLM).
Since these components are deployed in different processes, each has its own `DistributedRuntime`. Within their own `DistributedRuntime`, they all share the same `Namespace` (e.g., `vllm-agg`, `sglang-disagg`). Under their namespace, each has its own `Component`:
- `Frontend` uses the `make_engine` function which handles HTTP serving, request preprocessing, and worker discovery automatically
- Worker components register with names like `backend`, `prefill`, `decode`, or `encoder` depending on their role
- Workers register endpoints like `generate`, `clear_kv_blocks`, or `load_metrics`
Their `DistributedRuntime`s are initialized in their respective main functions, their `Namespace`s are configured in the deployment YAML, their `Component`s are created programmatically (e.g., `runtime.namespace("dynamo").component("backend")`), and their `Endpoint`s are created using the `component.endpoint()` method.
## Initialization
In this section, we explain what happens under the hood when `DistributedRuntime/Namespace/Component/Endpoint` objects are created. There are multiple modes for `DistributedRuntime` initialization based on the deployment environment.
```{caution}
The hierarchy and naming may change over time, and this document might not reflect the latest changes. Regardless of such changes, the main concepts would remain the same.
```
### Service Discovery Backends
The `DistributedRuntime` supports two service discovery backends, configured via `DYN_DISCOVERY_BACKEND`:
- **KV Store Discovery** (`DYN_DISCOVERY_BACKEND=kv_store`): Uses etcd for service discovery. **This is the global default** for all deployments unless explicitly overridden.
- **Kubernetes Discovery** (`DYN_DISCOVERY_BACKEND=kubernetes`): Uses native Kubernetes resources (DynamoWorkerMetadata CRD, EndpointSlices) for service discovery. **Must be explicitly set.** The Dynamo operator automatically sets this environment variable for Kubernetes deployments. **No etcd required.**
> **Note:** There is no automatic detection of the deployment environment. The runtime always defaults to `kv_store`. For Kubernetes deployments, the operator injects `DYN_DISCOVERY_BACKEND=kubernetes` into pod environments.
When using Kubernetes discovery, the KV store backend automatically switches to in-memory storage since etcd is not needed.
### Runtime Initialization
- `DistributedRuntime`: When a `DistributedRuntime` object is created, it establishes connections based on the discovery backend:
- **Kubernetes mode**: Uses K8s API for service registration via DynamoWorkerMetadata CRD. No external dependencies required.
- **KV Store mode**: Connects to etcd for service discovery. Creates a primary lease with a background keep-alive task. All objects registered under this `DistributedRuntime` use this lease_id to maintain their lifecycle.
- **NATS** (optional): Used for KV event messaging when using KV-aware routing. Can be disabled via `--no-kv-events` flag, which enables prediction-based routing without event persistence.
- **Request Plane**: TCP by default. Can be configured to use HTTP or NATS via `DYN_REQUEST_PLANE` environment variable.
- `Namespace`: `Namespace`s are primarily a logical grouping mechanism. They provide the root path for all components under this `Namespace`.
- `Component`: When a `Component` object is created, it registers a service in the internal registry of the `DistributedRuntime`, which tracks all services and endpoints.
- `Endpoint`: When an Endpoint object is created and started, it performs registration based on the discovery backend:
- **Kubernetes mode**: Endpoint information is stored in DynamoWorkerMetadata CRD resources, which are watched by other components for discovery.
- **KV Store mode**: Endpoint information is stored in etcd at a path following the naming: `/services/{namespace}/{component}/{endpoint}-{lease_id}`. Note that endpoints of different workers of the same type (i.e., two `VllmPrefillWorker`s in one deployment) share the same `Namespace`, `Component`, and `Endpoint` name. They are distinguished by their different primary `lease_id`.
## Calling Endpoints
Dynamo uses a `Client` object to call an endpoint. When a `Client` is created, it is given the name of the `Namespace`, `Component`, and `Endpoint`. It then watches for endpoint changes:
- **Kubernetes mode**: Watches DynamoWorkerMetadata CRD resources for endpoint updates.
- **KV Store mode**: Sets up an etcd watcher to monitor the prefix `/services/{namespace}/{component}/{endpoint}`.
The watcher continuously updates the `Client` with information about available `Endpoint`s.
The user can decide which load balancing strategy to use when calling the `Endpoint` from the `Client`, which is done in [push_router.rs](../../lib/runtime/src/pipeline/network/egress/push_router.rs). Dynamo supports three load balancing strategies:
- `random`: randomly select an endpoint to hit
- `round_robin`: select endpoints in round-robin order
- `direct`: direct the request to a specific endpoint by specifying the instance ID
After selecting which endpoint to hit, the `Client` sends the request using the configured request plane (TCP by default). The request plane handles the actual transport:
- **TCP** (default): Direct TCP connection with connection pooling
- **HTTP**: HTTP/2-based transport
- **NATS**: Message broker-based transport (legacy)
## Examples
We provide native rust and python (through binding) examples for basic usage of `DistributedRuntime`:
- Rust: `/lib/runtime/examples/`
- Python: We also provide complete examples of using `DistributedRuntime`. Please refer to the engines in `components/src/dynamo` for full implementation details.
<!--
SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
# Dynamo Architecture Flow
This diagram shows the NVIDIA Dynamo disaggregated inference system. Color-coded flows indicate different types of operations.
## 🔵 Main Request Flow (Blue)
The primary user journey through the system:
1. **Request (S1)**: HTTP client sends API request to Frontend (OpenAI-compatible server on port 8000)
2. **Preprocess (S2)**: Frontend preprocesses the request (applies chat template, tokenizes) and validates it
3. **Route to Prefill (S3)**: PrefillRouter selects a prefill worker using KV-aware routing or load balancing
## 🟢 Prefill Flow (Green)
The prefill processing pipeline:
4. **Prefill (S4)**: Prefill worker executes the prefill computation on the input tokens and generates KV cache
5. **Return Metadata (S5)**: Prefill worker returns `disaggregated_params` containing backend-specific transfer metadata
## 🟠 Decode Routing Flow (Orange)
Router orchestration to decode phase:
6. **Route to Decode (S6)**: PrefillRouter injects prefill result into decode request and routes to decode worker
7. **KV Transfer (S7)**: Decode worker coordinates with prefill worker for direct GPU-to-GPU KV cache transfer via NIXL
## 🟣 Completion Flow (Purple)
The response generation and delivery:
8. **Decode (S8)**: Decode worker generates tokens using the transferred KV cache
9. **Response (S9)**: Generated tokens stream back through Frontend for post-processing (detokenization) and delivery to Client
## 🔗 Infrastructure Connections (Dotted lines)
Coordination and messaging support:
### Service Discovery
- **On Kubernetes** (default): Uses native K8s resources (DynamoWorkerMetadata CRD, EndpointSlices). No etcd required.
- **On bare metal**: Uses etcd or filesystem for service discovery and endpoint registration.
### Request Plane
- **TCP** (default): Direct TCP connections between Frontend and Workers for request/response transport.
- **HTTP/NATS**: Alternative transports configurable via `DYN_REQUEST_PLANE`.
### NATS Connections (Optional, for KV routing)
- **KV Events**: Cache state events for KV-aware routing (can be disabled with `--no-kv-events`)
### Planning Connections (Gold, dotted)
- **Frontend → Planner**: Metrics collection for auto-scaling decisions
- **Planner → Workers**: Resource scaling commands for workers
## Technical Implementation Details
### PrefillRouter Orchestration:
- The `PrefillRouter` sits between the Frontend and workers, orchestrating disaggregated serving
- Selects prefill workers using KV-aware routing (cache overlap scores + load) or simple load balancing
- Injects transfer metadata into decode requests for KV cache coordination
### NIXL (NVIDIA Interchange Library):
- Enables high-speed GPU-to-GPU data transfers using NVLink, InfiniBand/UCX, or PCIe
- Transfer metadata exchanged via `disaggregated_params` in prefill response
- Backend-specific coordination: SGLang uses bootstrap connections, vLLM uses block IDs, TRTLLM uses opaque state
### Disaggregated KV Cache:
- Each worker maintains local KV cache in its GPU memory
- No shared storage bottlenecks—transfers are direct worker-to-worker via NIXL
- Non-blocking transfers allow GPU forward passes to continue during KV transfer
```mermaid
%%{init: {'theme':'dark', 'themeVariables': {'primaryColor': '#f4f4f4', 'primaryTextColor': '#333333', 'primaryBorderColor': '#888888', 'lineColor': '#4A90E2', 'sectionBkgColor': '#f9f9f9', 'altSectionBkgColor': '#eeeeee', 'tertiaryColor': '#f0f0f0', 'background': '#ffffff', 'mainBkg': '#f8f8f8', 'secondaryColor': '#f4f4f4', 'nodeTextColor': '#333333'}, 'flowchart': {'htmlLabels': true, 'curve': 'basis'}, 'fontFamily': 'Inter, system-ui, -apple-system, "Segoe UI", Roboto, sans-serif', 'fontSize': '18px'}%%
graph TD
%% Top Layer - Client & Frontend
Client["<b>HTTP Client</b>"]
Frontend["<b>Frontend</b><br/><i>OpenAI Compatible Server<br/>Port 8000</i>"]
S1[["<b>1 REQUEST</b>"]]
S2[["<b>2 PREPROCESS</b>"]]
%% Router Layer
PrefillRouter["<b>PrefillRouter</b><br/><i>Orchestrates Disaggregated Serving</i>"]
S3[["<b>3 ROUTE TO PREFILL</b>"]]
%% Infrastructure
subgraph INF["<b>Infrastructure Layer</b>"]
Discovery[("<b>Discovery</b><br/><i>Service Registry<br/>(ETCD or K8s)</i>")]
NATS[("<b>NATS</b><br/><i>KV Events<br/>(Optional)</i>")]
Planner["<b>Planner</b><br/><i>Auto-scaling</i>"]
end
%% Worker Layer
subgraph WL["<b>Worker Layer</b>"]
%% Prefill Worker
PrefillWorker["<b>Prefill Worker</b><br/><i>Computes KV Cache</i>"]
S4[["<b>4 PREFILL</b>"]]
S5[["<b>5 RETURN METADATA</b>"]]
%% Decode Worker
DecodeWorker["<b>Decode Worker</b><br/><i>Token Generation</i>"]
S6[["<b>6 ROUTE TO DECODE</b>"]]
S7[["<b>7 KV TRANSFER</b>"]]
S8[["<b>8 DECODE</b>"]]
S9[["<b>9 RESPONSE</b>"]]
%% KV Cache
PrefillKVCache[("<b>Prefill KV Cache</b><br/><i>GPU VRAM</i>")]
DecodeKVCache[("<b>Decode KV Cache</b><br/><i>GPU VRAM</i>")]
end
%% Main Request Flow (Blue)
Client --> S1
S1 -->|HTTP API Call| Frontend
Frontend --> S2
S2 -->|Tokenize & Validate| PrefillRouter
PrefillRouter --> S3
S3 -->|Select Prefill Worker| PrefillWorker
%% Prefill Flow (Green)
PrefillWorker --> S4
S4 -->|Compute KV Cache| PrefillKVCache
PrefillWorker --> S5
S5 -->|disaggregated_params| PrefillRouter
%% Decode Routing Flow (Orange)
PrefillRouter --> S6
S6 -->|Inject Transfer Metadata| DecodeWorker
DecodeWorker --> S7
S7 -->|NIXL GPU-to-GPU| PrefillKVCache
PrefillKVCache -.->|Direct Transfer| DecodeKVCache
%% Completion Flow (Purple)
DecodeWorker --> S8
S8 -->|Generate Tokens| DecodeKVCache
DecodeWorker --> S9
S9 -->|Stream Tokens| Frontend
Frontend -->|HTTP Response| Client
%% Infrastructure Connections
Frontend -.->|Service Discovery| Discovery
PrefillRouter -.->|Worker Discovery| Discovery
PrefillWorker -.->|Register| Discovery
DecodeWorker -.->|Register| Discovery
Planner -.->|Service Discovery| Discovery
%% NATS for KV events (optional)
PrefillWorker -.->|KV Events| NATS
DecodeWorker -.->|KV Events| NATS
%% Planning Connections
Frontend -.->|Metrics| Planner
Planner -.->|Auto-scaling| PrefillWorker
Planner -.->|Auto-scaling| DecodeWorker
%% Styling
classDef client fill:#e8f5e8,stroke:#2E7D32,stroke-width:3px
classDef frontend fill:#fff3e0,stroke:#F57C00,stroke-width:3px
classDef router fill:#f3e5f5,stroke:#7B1FA2,stroke-width:3px
classDef worker fill:#e3f2fd,stroke:#1565C0,stroke-width:3px
classDef prefillWorker fill:#e8f5e9,stroke:#388E3C,stroke-width:3px
classDef planner fill:#f1f8e9,stroke:#558B2F,stroke-width:3px
classDef storage fill:#e0f2f1,stroke:#00695C,stroke-width:3px
classDef discovery fill:#fff9c4,stroke:#F9A825,stroke-width:3px
classDef nats fill:#ede7f6,stroke:#5E35B1,stroke-width:3px
classDef infraLayer fill:#fff9c4,stroke:#FFC107,stroke-width:3px
classDef workerLayer fill:#e3f2fd,stroke:#2196F3,stroke-width:3px
class Client client
class Frontend frontend
class PrefillRouter router
class DecodeWorker worker
class PrefillWorker prefillWorker
class Planner planner
class PrefillKVCache,DecodeKVCache storage
class Discovery discovery
class NATS nats
class INF infraLayer
class WL workerLayer
%% Flow Colors
%% Main Request Flow - Blue
linkStyle 0,1,2,3,4,5 stroke:#1565C0,stroke-width:4px
%% Prefill Flow - Green
linkStyle 6,7,8,9 stroke:#2E7D32,stroke-width:4px
%% Decode Routing Flow - Orange
linkStyle 10,11,12,13,14 stroke:#E65100,stroke-width:4px
%% Completion Flow - Purple
linkStyle 15,16,17,18,19 stroke:#6A1B9A,stroke-width:4px
%% Infrastructure - Gray dotted
linkStyle 20,21,22,23,24,25,26,27,28,29 stroke:#757575,stroke-width:2px,stroke-dasharray: 8 8
```
<!--
SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
# Event Plane Architecture
This document describes Dynamo's event plane architecture, which handles service discovery, coordination, and event distribution using etcd and NATS.
## Overview
Dynamo's coordination layer adapts to the deployment environment:
| Deployment | Service Discovery | KV Events | Request Plane |
|------------|-------------------|-----------|---------------|
| **Kubernetes** (with operator) | Native K8s (CRDs, EndpointSlices) | NATS (optional) | TCP |
| **Bare metal / Local** (default) | etcd | NATS (optional) | TCP |
> **Note:** The runtime always defaults to `kv_store` (etcd) for service discovery. Kubernetes deployments must explicitly set `DYN_DISCOVERY_BACKEND=kubernetes` - the Dynamo operator handles this automatically.
```
┌─────────────────────────────────────────────────────────────────────┐
│ Coordination Layer │
│ │
│ ┌─────────────────────────┐ ┌─────────────────────────────────┐ │
│ │ Service Discovery │ │ NATS │ │
│ │ │ │ (Optional) │ │
│ │ • K8s: CRDs + API │ │ • KV Cache Events │ │
│ │ • Bare metal: etcd │ │ • Router Replica Sync │ │
│ │ │ │ • JetStream Persistence │ │
│ └─────────────────────────┘ └─────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────┘
│ │
┌──────────┴──────────┐ ┌─────────┴──────────┐
▼ ▼ ▼ ▼
┌─────────┐ ┌─────────┐ ┌─────────┐
│Frontend │ │ Planner │ │ Worker │
└─────────┘ └─────────┘ └─────────┘
```
## Kubernetes-Native Service Discovery
When running on Kubernetes with the Dynamo operator, service discovery uses native Kubernetes resources instead of etcd.
### Configuration
The operator explicitly sets:
```bash
DYN_DISCOVERY_BACKEND=kubernetes
```
> **Important:** This must be explicitly configured. The runtime defaults to `kv_store` in all environments.
### How It Works
1. **DynamoWorkerMetadata CRD**: Workers register their endpoints by creating/updating DynamoWorkerMetadata custom resources
2. **EndpointSlices**: Used to signal readiness status to the system
3. **K8s API Watches**: Components watch for CRD changes to discover available endpoints
### Benefits
- No external etcd cluster required
- Native integration with Kubernetes lifecycle
- Automatic cleanup when pods terminate
- Works with standard K8s RBAC
### Environment Variables (Injected by Operator)
| Variable | Description |
|----------|-------------|
| `DYN_DISCOVERY_BACKEND` | Set to `kubernetes` |
| `POD_NAME` | Current pod name |
| `POD_NAMESPACE` | Current namespace |
| `POD_UID` | Pod unique identifier |
---
## etcd Architecture (Default for All Deployments)
When `DYN_DISCOVERY_BACKEND=kv_store` (the global default), etcd is used for service discovery.
### Connection Configuration
etcd connection is configured via environment variables:
| Variable | Description | Default |
|----------|-------------|---------|
| `ETCD_ENDPOINTS` | Comma-separated etcd URLs | `http://localhost:2379` |
| `ETCD_AUTH_USERNAME` | Basic auth username | None |
| `ETCD_AUTH_PASSWORD` | Basic auth password | None |
| `ETCD_AUTH_CA` | CA certificate path (TLS) | None |
| `ETCD_AUTH_CLIENT_CERT` | Client certificate path | None |
| `ETCD_AUTH_CLIENT_KEY` | Client key path | None |
Example:
```bash
export ETCD_ENDPOINTS=http://etcd-0:2379,http://etcd-1:2379,http://etcd-2:2379
```
### Lease Management
Each `DistributedRuntime` maintains a primary lease with etcd:
```
┌────────────────────┐ ┌──────────────┐
│ DistributedRuntime │◄────────│ Primary Lease │
│ │ │ TTL: 10s │
│ • Namespace │ └───────┬───────┘
│ • Components │ │
│ • Endpoints │ │ Keep-Alive
│ │ │ Heartbeat
└────────────────────┘ ▼
┌──────────────┐
│ etcd │
└──────────────┘
```
**Lease Lifecycle:**
1. **Creation**: Lease created during `DistributedRuntime` initialization
2. **Keep-Alive**: Background task sends heartbeats at 50% of remaining TTL
3. **Expiration**: If heartbeats stop, lease expires after TTL (10 seconds default)
4. **Cleanup**: All keys associated with the lease are automatically deleted
**Automatic Recovery:**
- Reconnection with exponential backoff (50ms to 5s)
- Deadline-based retry logic
- Cancellation token propagation
### Service Discovery
Endpoints are registered in etcd for dynamic discovery:
**Key Format:**
```
/services/{namespace}/{component}/{endpoint}/{instance_id}
```
**Example:**
```
/services/vllm-agg/backend/generate/694d98147d54be25
```
**Registration Data:**
```json
{
"namespace": "vllm-agg",
"component": "backend",
"endpoint": "generate",
"instance_id": 7587888160958628000,
"transport": {
"tcp": "192.168.1.10:9999"
}
}
```
### Discovery Queries
The discovery system supports multiple query patterns:
| Query Type | Pattern | Use Case |
|------------|---------|----------|
| `AllEndpoints` | `/services/` | List all services |
| `NamespacedEndpoints` | `/services/{namespace}/` | Filter by namespace |
| `ComponentEndpoints` | `/services/{namespace}/{component}/` | Filter by component |
| `Endpoint` | `/services/{namespace}/{component}/{endpoint}/` | Specific endpoint |
### Watch Functionality
Clients watch etcd prefixes for real-time updates:
```python
# Client watches for endpoint changes
watcher = etcd.watch_prefix("/services/vllm-agg/backend/generate/")
for event in watcher:
if event.type == "PUT":
# New endpoint registered
add_endpoint(event.value)
elif event.type == "DELETE":
# Endpoint removed (worker died)
remove_endpoint(event.key)
```
**Watch Features:**
- Initial state retrieval with `get_and_watch_prefix()`
- Automatic reconnection on stream failure
- Revision tracking for no-event-loss guarantees
- Event types: `PUT` (create/update) and `DELETE`
### Distributed Locks
etcd provides distributed locking for coordination:
**Lock Types:**
| Type | Key Pattern | Behavior |
|------|-------------|----------|
| Write Lock | `v1/{prefix}/writer` | Exclusive (no readers/writers) |
| Read Lock | `v1/{prefix}/readers/{id}` | Shared (multiple readers) |
**Operations:**
```rust
// Non-blocking write lock
let lock = client.try_write_lock("my_resource").await?;
// Blocking read lock with polling (100ms intervals)
let lock = client.read_lock_with_wait("my_resource").await?;
```
## NATS Architecture
### When NATS is Used
NATS is used for:
1. **KV Cache Events**: Real-time KV cache state updates for routing
2. **Router Replica Sync**: Synchronizing router state across replicas
3. **Legacy Request Plane**: NATS-based request transport (optional)
### Configuration
| Variable | Description | Default |
|----------|-------------|---------|
| `NATS_SERVER` | NATS server URL | `nats://localhost:4222` |
### Disabling NATS
For deployments without KV-aware routing:
```bash
# Disable NATS and KV events
python -m dynamo.frontend --no-kv-events
```
This enables "approximate mode" for KV routing without event persistence.
### Event Publishing
Components publish events to NATS subjects:
```rust
pub trait EventPublisher {
async fn publish(&self, event: &str, data: &[u8]) -> Result<()>;
async fn publish_serialized<T: Serialize>(&self, event: &str, data: &T) -> Result<()>;
}
```
**Subject Naming:**
```
{base_subject}.{event_name}
```
Example:
```
vllm-agg.backend.kv_cache_update
```
### Event Subscription
Components subscribe to events:
```rust
pub trait EventSubscriber {
async fn subscribe(&self, topic: &str) -> Result<Subscriber>;
async fn subscribe_typed<T: DeserializeOwned>(&self, topic: &str) -> Result<TypedSubscriber<T>>;
}
```
### JetStream Persistence
For durable event delivery, NATS JetStream provides:
- Message persistence
- Replay from offset
- Consumer groups for load balancing
- Acknowledgment tracking
## Key-Value Store Abstraction
Dynamo provides a unified KV store interface supporting multiple backends:
### Supported Backends
| Backend | Use Case | Configuration |
|---------|----------|---------------|
| `EtcdStore` | Production deployments | `ETCD_ENDPOINTS` |
| `MemoryStore` | Testing, development | Default |
| `NatsStore` | NATS-only deployments | `NATS_SERVER` |
| `FileStore` | Local persistence | File path |
### Store Interface
```rust
pub trait KvStore {
async fn get(&self, bucket: &str, key: &str) -> Result<Option<Vec<u8>>>;
async fn put(&self, bucket: &str, key: &str, value: &[u8]) -> Result<()>;
async fn delete(&self, bucket: &str, key: &str) -> Result<()>;
async fn watch(&self, bucket: &str) -> Result<WatchStream>;
}
```
### Buckets
Data is organized into logical buckets:
| Bucket | Purpose |
|--------|---------|
| `v1/instances` | Endpoint instance registry |
| `v1/mdc` | Model deployment cards |
## Typed Prefix Watcher
For type-safe watching of etcd prefixes:
```rust
// Watch and maintain HashMap of deserialized values
let watcher = watch_prefix_with_extraction::<DiscoveryInstance>(
&etcd_client,
"/services/vllm-agg/",
lease_id_extractor,
value_extractor,
).await?;
// Receive updates via watch channel
let instances = watcher.borrow();
```
**Key Extractors:**
| Extractor | Description |
|-----------|-------------|
| `lease_id()` | Use lease ID as key |
| `key_string()` | Extract key with prefix stripping |
| `full_key_string()` | Use full etcd key |
## Reliability Features
### Connection Resilience
**etcd Reconnection:**
- Exponential backoff: 50ms to 5s
- Deadline-based retry logic
- Mutex ensures single concurrent reconnect
**NATS Reconnection:**
- Built-in reconnection in NATS client
- Configurable max reconnect attempts
- Buffering during disconnection
### Lease-Based Cleanup
When a worker crashes or loses connectivity:
1. Keep-alive heartbeats stop
2. Lease expires after TTL (10 seconds)
3. All registered endpoints automatically deleted
4. Clients receive DELETE watch events
5. Traffic reroutes to healthy workers
### Transaction Safety
etcd transactions ensure atomic operations:
```rust
// Atomic create-if-not-exists
let txn = Txn::new()
.when([Compare::create_revision(key, CompareOp::Equal, 0)])
.and_then([Op::put(key, value, options)]);
etcd_client.txn(txn).await?;
```
This prevents race conditions in concurrent service registration.
## Operational Modes
### Kubernetes Mode (Requires Explicit Configuration)
Native Kubernetes service discovery:
```bash
# Operator explicitly sets this (not auto-detected):
export DYN_DISCOVERY_BACKEND=kubernetes
# Workers register via K8s CRDs
python -m dynamo.vllm --model Qwen/Qwen3-0.6B
# Frontend discovers workers via K8s API
python -m dynamo.frontend
```
No etcd or NATS required for basic operation when using K8s discovery.
### KV Store Mode (Global Default)
Full service discovery with etcd:
```bash
# This is the default - no configuration needed
# export DYN_DISCOVERY_BACKEND=kv_store # (implicit)
# Workers register with etcd
python -m dynamo.vllm --model Qwen/Qwen3-0.6B
# Frontend discovers workers via etcd
python -m dynamo.frontend
```
### KV-Aware Routing (Optional)
Enable NATS for KV cache event tracking:
```bash
# Default: KV events enabled (requires NATS)
python -m dynamo.frontend --router-mode kv
# Disable KV events for prediction-based routing (no NATS)
python -m dynamo.frontend --router-mode kv --no-kv-events
```
With `--no-kv-events`:
- Router predicts cache state based on routing decisions
- TTL-based expiration and LRU pruning
- No NATS infrastructure required
## Best Practices
### 1. Use Kubernetes Discovery on K8s
The Dynamo operator automatically sets `DYN_DISCOVERY_BACKEND=kubernetes` for pods. No additional setup required when using the operator.
### 2. For Bare Metal: Deploy etcd Cluster
For bare-metal production deployments, deploy a 3-node etcd cluster for high availability.
### 3. Configure Appropriate TTLs (etcd mode)
Balance between detection speed and overhead:
- **Short TTL (5s)**: Faster failure detection, more keep-alive traffic
- **Long TTL (30s)**: Less overhead, slower detection
### 4. KV Routing Without NATS
For simpler deployments without NATS:
```bash
# Use prediction-based KV routing
python -m dynamo.frontend --router-mode kv --no-kv-events
```
This provides KV-aware routing with reduced accuracy but no NATS dependency.
## Related Documentation
- [Distributed Runtime](distributed_runtime.md) - Runtime architecture
- [Request Plane](request_plane.md) - Request transport configuration
- [Fault Tolerance](../fault_tolerance/README.md) - Failure handling
<!--
SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
# KVBM Design
This document provides an in-depth look at the architecture, components, framework integrations via the connector API, and the detailed workings of the Dynamo KV Block Manager (KVBM). The design of KVBM takes inspiration from the KV block managers used in vLLM and SGLang, with added influence from historical memory tiering strategies common in general GPU programming. For more details, see [Further Reading](#further-reading).
## KVBM Components
![Internal Components of Dynamo KVBM](../images/kvbm-components.png)
*Internal Components of Dynamo KVBM*
### Core
- **KvBlockManager**: Public facade. Constructs/owns the internal state and exposes the pools and onboarding APIs.
- **Scheduler**: Gates transfer execution relative to model progress (iteration/layer completion) when integrated with a framework connector (e.g., vLLM V1).
- **Config (config.rs)**: Describes model dims, page size, layout choices, and runtime flags used to build pools and layouts.
- **KvBlockManagerState**: Central object wiring together layouts, storage backends, and pools; owns the OffloadManager, metrics, and events hooks.
- **Events/Metrics**: Observability components emitting counters/gauges and event hooks for integration/testing.
### Layouts and Blocks
- **LayoutConfig & LayoutType**: Translate tensor shapes into KV cache layouts (layer-separated or fully-contiguous), including block counts and geometry.
- **Blocks & Metadata**: Typed block handles (mutable/immutable), metadata (e.g., priority), and views by layer/outer dims; used to allocate, register, and match by `sequence_hash`.
### Transfer Manager
- **TransferManager**: Asynchronous transfer orchestrator with per-path queues (Device→Host, Host→Disk, Host→Device, Disk→Device).
### Storage & Pools
- **Device Pool (G1)**: GPU-resident KV block pool. Allocates mutable GPU blocks, registers completed blocks (immutable), serves lookups by sequence hash, and is the target for onboarding (Host→Device, Disk→Device).
- **Host Pool (G2)**: CPU pinned-memory KV block pool. Receives Device offloads (Device→Host), can onboard to Device (Host→Device), and offloads to Disk. Uses pinned (page-locked) memory for efficient CUDA transfers and NIXL I/O.
- **Disk Pool (G3)**: Local SSD NVMe-backed KV block pool. Receives Host offloads (Host→Disk) and provides blocks for onboarding to Device (Disk→Device). NIXL descriptors expose file offsets/regions for zero-copy I/O and optional GDS.
- **Remote Storage (G4)**: Remote or cloud-backed KV block storage. KVBM treats G4 as an opaque blob store accessed through NIXL, unaware of internal layout optimizations.
## KVBM Data Flows
![KVBM Data Flows](../images/kvbm-data-flows.png)
*KVBM Data Flows from device to other memory hierarchies*
### Device → Host (Offload)
- Triggered when explicitly requested to offload by the connector scheduler
- Worker allocates a Host block and performs CUDA D2H/Custom Kernel copy
- Host pool registers the new immutable block (dedup by sequence hash)
### Host → Disk (Offload)
- **Local Disk (G3)**: NIXL Write via POSIX; GDS when available
- **Remote Disk (G4)** (Network FS like NFS/Lustre/GPFS): NIXL Write via POSIX to the mounted FS; batching/concurrency identical
- Triggered on registered host blocks or explicit offload requests
- Worker allocates a Disk block and performs NIXL Write (Host→Disk)
- Disk pool registers the new immutable block (dedup by sequence hash)
### Host → Device (Onboard)
- Called to bring a host block into GPU memory
- Worker uses provided Device targets and performs CUDA H2D/Custom Kernel copy
- Device pool registers the new immutable block
### Disk → Device (Onboard)
- Called to bring a disk block directly into GPU memory
- Worker uses provided Device targets and performs NIXL Read (Disk→Device), possibly via GDS
- Device pool registers the new immutable block
## Internal Architecture Deep Dive
![Internal architecture and key modules in the Dynamo KVBM](../images/kvbm-internal-arch.png)
*Internal architecture and key modules in the Dynamo KVBM*
### KvBlockManager as Orchestration Layer
The `KvBlockManager<H, D>` acts as a coordinator across memory tiers—host (CPU), device (GPU), and remote—by managing per-backend block pools and exposing consistent block lifecycle APIs. It tracks KV block locations across device memory (G1), CPU memory within and across nodes (G2), local/pooled SSDs (G3), and remote storage (G4). G1-G4 are key tiers enabled by KVBM. Note that KVBM treats G4 storage as an opaque blob store, unaware of internal layout optimizations.
`KvBlockManager<H, D>` owns:
- A device-side `BlockPool<Device>`
- A host-side `BlockPool<Host>`
- A remote NIXL agent that supports communication and memory sharing across nodes
- A block set registry for remote lookup and import/export of block metadata
Implementation-wise, `KvBlockManagerState` holds the logic: it's initialized by `KvBlockManagerConfig`, which merges runtime, model, and layout configurations. `NixlOptions` injects remote awareness.
### Block Layout and Memory Mapping
Each block is a 2D array `[num_layers][page_size × inner_dim]`. The `BlockLayout` trait abstracts the memory layout. The default implementation, `FullyContiguous`, stores all layers for all blocks in one region with alignment-aware stride computation:
```text
block_stride_in_bytes = align_up(num_layers × layer_stride, alignment);
```
Both CPU and GPU pools share this memory layout, but they use storage-specific backends:
- `DeviceStorage` → CUDA device buffer
- `PinnedStorage` → page-locked host memory
- `SystemStorage` → CPU heap memory (fallback/test)
- `NixlStorage` → remote memory through NIXL RDMA handles (includes storage)
Each layout is constructed using a `LayoutConfig`, and storage is either passed directly or allocated using a `StorageAllocator`.
### BlockPool and Memory Pools (Active and Inactive)
Each `BlockPool<T>` (where `T` is `DeviceStorage`, `PinnedStorage`, etc.) tracks two sub-pools:
- **ActivePool**: Contains blocks currently in use by sequences
- **InactivePool**: Recycled blocks ready for allocation (free list)
When a token block is requested (e.g., `get_mutable_block()`), the allocator pops from `InactivePool`, transitions its state, and returns a writable handle. On sequence commit or eviction, the system resets blocks and returns them to the inactive pool.
### Block State Machine
The state machine (`BlockState`) tracks block lifecycle transitions:
| State | Description | Ownership | Valid Actions/Transitions |
|-------|-------------|-----------|---------------------------|
| Reset | Block hasn't been initialized or was reset. No associated sequence. | Held in InactivePool, reusable | `init_sequence(salt_hash)` → Partial |
| Partial | Block is being filled with tokens for a new sequence. In-progress. | Owned by the sequence creator | `add_token()` / `add_tokens()` (accumulate), `commit()` → Complete, `reset()` → Reset |
| Complete | Block is fully filled with token data but not yet visible to others. | Still owned by creator thread | `register()` → Registered, `reset()` → Reset |
| Registered | Block is finalized and visible for reuse. Available in the deduplication cache. | Shared ownership (global registry) | Auto `drop()` → triggers Remove event and transitions to Reset |
#### Valid State Transitions
| From → To | Trigger | Validation |
|-----------|---------|------------|
| Reset → Partial | `init_sequence(salt_hash)` | Must not be in use |
| Partial → Complete | `commit()` | Must be full |
| Complete → Registered | `register()` | Must be finalized |
| Registered → Reset | Drop of `RegistrationHandle` | Automatic |
| Partial → Reset | Aborted sequence | Explicit or drop |
| Complete → Reset | Invalidated | Explicit or drop |
#### Example Block Lifecycle
A sequence requests a new KV block:
1. Allocator pops from InactivePool → Block is in Reset
2. `init_sequence()` → Transitions to Partial
3. Tokens are appended → State remains Partial
4. On full → `commit()` → State becomes Complete
5. `register()` → Block is hashed and moved to Registered. Blocks can now be used for lookup.
6. On eviction or end-of-life → `drop()` of RAII handle returns block to Reset
### Lifecycle Management using RAII and Event Plane
The system uses RAII for memory lifecycle management. Every block holds metadata and registration state, and registration is coupled with an `EventManager`. On registration and drop:
- `PublishHandle` triggers Register events
- Dropping it triggers Remove events
This pattern ensures consistency for shared memory tracking across workers without requiring explicit deallocation logic. The events are propagated in the Dynamo Events plane. Any Dynamo component subscribed to the events plane can listen to these changes. Note that even the storage provider can subscribe to the events plane and create an internal prefix tree representation that is tailored and optimized for the specific platform.
### Remote Memory Integration using NIXL
The NIXL agent exposes remote memory buffers using `NixlBlockSet`, `RemoteBlocks`, and layout descriptors. Key operations include:
- `nixl_register()`: Registers memory region with NIXL runtime
- `serialize() / deserialize()`: Converts layout and memory into transferable descriptors
- `import_remote_blockset()`: Loads remote node's block layouts into the manager
- `get_remote_blocks_mutable()`: Fetches transferable memory views from another node
`RemoteBlocks` is a lightweight abstraction over shared memory for cross-node block usage (through UCX or other backends).
#### Remote Memory Registration Protocol
The following describes a bidirectional remote memory registration and layout synchronization protocol between workers (e.g., Worker 1 and Worker 2) using NIXL:
**1. Agent Creation & Memory Registration**
Each worker independently sets up a NixlAgent:
- Registers its memory regions (i.e., device memory) through `nixl_register()`
- These regions correspond to blocks managed in the local BlockPool
Once the worker registers the memory, NIXL creates remote-accessible descriptors, which it binds to the memory layout.
**2. Metadata Exchange**
After memory registration, workers exchange serialized layout metadata, encapsulated in a `SerializedNixlBlockLayout`.
Why is this step critical?
- LLM inference workloads often differ in *tensor parallel (TP)* configurations:
- Worker 1 might have TP=4, while Worker 2 has TP=8
- Even if both systems use similar `FullyContiguous` layouts, their internal slicing and alignment assumptions differ
- The metadata exchange bridges this semantic mismatch by sharing:
- LayoutConfig (num_layers, page_size, inner_dim, dtype)
- BlockSetID
- Base address + stride information (including alignment)
- Device ID + memory type (host/device)
- Once workers share metadata, each can reconstruct the layout on its side using `deserialize()`
This enables NIXL to:
- Understand where each layer/block resides
- Perform correct gather-scatter operations during RDMA-like transfers
Without this step, remote fetches would result in data corruption or misaligned tokens.
**3. Serialization & Deserialization: Making Layouts Portable**
In the serialization stage, KVBM exports and `FullyContiguous::serialize()` encodes:
- FullyContiguousConfig
- base_offset
- Physical memory descriptors (NixlStorage), including:
- Memory type (VRAM, DRAM)
- Address & size
- Device ID
The system sends this using NIXL transfer and then injects it into a KVBM scheduler state.
In the deserialization stage, `SerializedNixlBlockLayout::deserialize()` rehydrates this into:
- A fully reconstructed memory layout view
- Local representation of a remote memory slice with correct offsets and size semantics
It also enables direct access to remote memory with consistent logical semantics. This guarantees that even across different system configurations (hardware or LLM shape), both parties agree on the memory view for each KV block.
**4. Ownership Handles and Lifetime Tracking**
Memory ownership in NIXL is tightly coupled with RAII-based handles:
- When a block is registered, it returns a `PublishHandle` which wraps a `RegistrationHandle`
- On drop of this handle, an automatic Remove event is published, which:
- Deregisters the block from the NIXL layer
- Removes it from the remote block registry
- This ensures that once the block is evicted from the cache or no longer used in inference, all references are invalidated cleanly across nodes
This mechanism avoids:
- Stale memory access
- Dangling pointers on GPU or host
- Manual deregistration bugs
The system can batch and publish registration events using a Publisher, optimizing performance under high concurrency.
### Storage Backends and Pluggability
You can integrate KVBM with a storage backend by extending or wrapping `NixlEnabledStorage` to support cross-node RDMA registration. All layouts and block pools are generic over these backends, allowing for fine-grained control over memory tiers.
```mermaid
---
title: Example KVBM System Architecture
---
flowchart TD
A["Distributed Inference Engine"] --> B["Dynamo KV Block Manager"]
B --> C["NIXL Storage Agent<br/>- Volume registration<br/>- get()/put() abstraction"]
B --> D["Event Plane<br/>- NATS-based Pub/Sub<br/>- StoreEvent / RemoveEvent"]
C --> E["G4 Storage Infrastructure<br/>(SSD, Object store, etc.)<br/>- Store KV blocks"]
D --> F["Storage Provider Subscriber<br/>- Parse Events<br/>- Build fast tree/index<br/>- Optimize G4 tiering"]
```
#### NIXL Storage Interface (for Backend Integration)
The NIXL interface abstracts volume interaction and decouples it from mounting, metadata tracking, or direct system I/O. It provides:
- `registerVolume(descriptor)`: Register a logical volume for KV cache data
- `unregisterVolume()`: Cleanly deregister and release volume mappings
- `get() / put()`: Block-level APIs used by KVBM to fetch and store token blocks
These abstractions allow backends to be integrated without tying into the host's file system stack, enabling safe interaction with block devices, local filesystems, and RDMA-capable volumes. Note that these APIs are still being finalized.
#### Dynamo Event Plane (Pub/Sub Coordination Layer)
To support external storage optimizations without modifying KVBM logic, we provide an **event plane** built on NATS.io that emits lifecycle events for all block operations:
- **StoreEvent**: Emitted when a KV block is registered
- **RemoveEvent**: Emitted when a KV block is released or evicted
Each KVEvent (~100 bytes) contains:
| Field | Description |
|-------|-------------|
| `sequence_hash` | Unique identifier of the KV block |
| `prefix_hash` | Prefix grouping for query-level aggregation |
| `block_size` | Size in bytes |
| `storage_location` | Logical volume identifier |
| `event_type` | Store or Remove |
| `extra_metadata` | Reserved fields for partner-specific optimization |
For scalability, the system batches and publishes these events periodically (e.g., every ~10s, or dynamically based on system load).
#### Conceptual Design of a Storage Advisor
This section provides an overview for storage providers interested in integrating as a custom backend to KVBM. **This is optional for KVBM integration with a backend.**
External storage systems are not tightly coupled with Dynamo's execution pipeline. Instead, they passively observe KV block lifecycle events through a subscription model:
1. Storage volumes are pre-provisioned and mounted by the storage provider
2. These volumes are registered with Dynamo through the NIXL Storage Agent using `registerVolume()` APIs
3. Dynamo KV Block Manager interacts only with logical block-level APIs (`get()` and `put()`)
4. The Event Plane asynchronously broadcasts KV lifecycle events using a NATS-based pub/sub channel
5. Storage vendors implement a lightweight subscriber process that listens to these events
To enable fast lookup and dynamic tiering, storage vendors may build internal data structures using the received event stream:
- On receiving a **StoreEvent**: Insert a record into an internal prefix tree, hash map, or LRU index with `prefix_hash`, `sequence_hash`, and associated metadata
- On receiving a **RemoveEvent**: Delete or prune the corresponding record, optionally triggering cleanup or tier migration workflows
With real-time visibility into KV block usage patterns, the storage system can implement smart tiering policies:
- **Hot block promotion**: Frequently accessed KV blocks can be migrated to fast SSD volumes
- **Cold block demotion**: Infrequently used blocks can be demoted to slower storage (HDDs, cloud object storage)
- **Proactive compaction**: If block sizes or prefix patterns indicate fragmentation, the storage backend can coalesce or rewrite blocks
This design ensures that performance, resilience, and extensibility scale independently across the KV layer and the storage backend layer.
## Framework Integrations
KVBM integrates with inference frameworks (vLLM, TensorRT-LLM, SGLang) via Connector APIs to influence KV caching behavior, scheduling, and forward pass execution.
### Connector Architecture
There are two components of the interface:
- **Scheduler (Leader)**: Responsible for orchestration of KV block offload/onboard, builds metadata specifying transfer data to the workers. It also maintains hooks for handling asynchronous transfer completion.
- **Worker**: Responsible for reading metadata built by the scheduler (leader), performs async onboarding/offloading at the end of the forward pass.
![vLLM KVBM Integration](../images/kvbm-integrations.png)
*Typical integration of KVBM with inference frameworks (vLLM shown as example)*
### Onboarding Operations
![Onboarding blocks from Host to Device](../images/kvbm-onboard-host2device.png)
*Onboarding blocks from Host to Device*
![Onboarding blocks from Disk to Device](../images/kvbm-onboard-disk2device.png)
*Onboarding blocks from Disk to Device*
### Offloading Operations
![Offloading blocks from Device to Host & Disk](../images/kvbm-offload.png)
*Offloading blocks from Device to Host & Disk*
## Further Reading
- [vLLM Automatic Prefix Caching](https://docs.vllm.ai/en/latest/features/automatic_prefix_caching.html)
- [SGLang HiCache Benchmarks](https://github.com/sgl-project/sglang/tree/main/benchmark/hicache)
- [EMOGI: Efficient Memory-access for Out-of-memory Graph-traversal](https://arxiv.org/abs/2006.06890)
## See Also
- [KVBM Overview](../components/kvbm/README.md)
- [KVBM Guide](../components/kvbm/kvbm_guide.md)
- [NIXL Documentation](https://github.com/ai-dynamo/nixl/blob/main/docs/nixl.md)
<!--
SPDX-FileCopyrightText: Copyright (c) 2024-2026 NVIDIA CORPORATION & AFFILIATES.
All rights reserved.
SPDX-License-Identifier: Apache-2.0
-->
# Planner Design
> **Tier 3 design documentation** for contributors and architects. For user-facing docs, see [docs/components/planner/](/docs/components/planner/).
## Overview
The Planner is Dynamo's autoscaling controller. It observes system metrics, predicts future load, and adjusts prefill/decode worker replica counts to proactively meet SLA targets. This document covers the internal architecture, algorithms, and design trade-offs.
## Architecture
```text
┌──────────────────────────────────────────────────────────┐
│ Planner Component │
│ │
│ ┌───────────────┐ ┌───────────────┐ ┌────────────────┐ │
│ │ Metric │ │ Load │ │ Performance │ │
│ │ Collector │ │ Predictor │ │ Interpolator │ │
│ │ (Prometheus) │ │ (ARIMA/etc.) │ │ (JSON data) │ │
│ └───────┬───────┘ └───────┬───────┘ └───────┬────────┘ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌───────────────────────────────────────────────────┐ │
│ │ Scaling Algorithm │ │
│ └───────────────────────┬───────────────────────────┘ │
│ │ │
│ ┌───────────────────────▼───────────────────────────┐ │
│ │ Connector Layer │ │
│ │ ┌───────────────────┐ ┌───────────────────────┐ │ │
│ │ │ KubernetesConn. │ │ VirtualConn. │ │ │
│ │ │ (PATCH DGD) │ │ (Runtime bridge) │ │ │
│ │ └───────────────────┘ └───────────────────────┘ │ │
│ └───────────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────────────┘
```
## Scaling Algorithm
### Step 1: Metric Collection
Every `adjustment_interval` seconds, the planner queries Prometheus for:
- Average TTFT and ITL over the interval
- Total request count
- Average input sequence length (ISL) and output sequence length (OSL)
The Prometheus query targets the Frontend's `/metrics` endpoint, which exposes histograms and counters.
### Step 2: Correction Factor Calculation
The planner maintains correction factors that adapt profiling-based predictions to real-world behavior:
```text
prefill_correction = actual_ttft / expected_ttft
decode_correction = actual_itl / expected_itl
```
These factors account for hard to model factors such as:
- **Request queueing**: Bursty traffic causes higher TTFT than profiled steady-state
- **Prefix cache hits**: KV reuse reduces effective prefill tokens, lowering actual TTFT
- **Chunked prefill in decode**: Small prefills processed in decode engine affect ITL
- **Metric variance**: Average ISL/OSL may not represent the actual distribution
The correction factors are applied as multipliers to the next scaling decision. Setting `--no-correction` disables this for debugging or when cold-start artifacts dominate.
### Step 3: Load Prediction
The planner forecasts three values for the next interval:
- `next_num_req`: Number of requests
- `next_isl`: Average input sequence length
- `next_osl`: Average output sequence length
Four predictor implementations are available:
| Predictor | Algorithm | Best For |
| ------------ | ---------------------------------------- | -------------------------------- |
| **Constant** | `next = current` | Stable workloads, long intervals |
| **ARIMA** | Auto-ARIMA with optional log1p transform | Trending/seasonal patterns |
| **Kalman** | Local linear trend Kalman filter | Bursty traffics |
| **Prophet** | Facebook Prophet time-series model | Complex seasonality |
All predictors support warm-starting from trace files (`--load-predictor-warmup-trace`).
### Step 4: Replica Calculation
**Prefill replicas:**
```python
predicted_load = next_requests * next_isl / interval * min(1, prefill_correction)
prefill_replicas = ceil(predicted_load / interpolated_throughput / gpus_per_engine)
```
The prefill correction factor has a linear effect on throughput because prefill is single-batched.
**Decode replicas:**
```python
# Apply correction to the ITL SLA target
corrected_itl = target_itl / decode_correction_factor
# Find best throughput/GPU that achieves corrected ITL at predicted context length
throughput_per_gpu = decode_interpolator.find_best_throughput_per_gpu(
itl=corrected_itl,
context_length=next_isl + next_osl / 2
)
# Calculate required replicas
decode_replicas = ceil(next_num_req * next_osl / interval / throughput_per_gpu / gpus_per_engine)
```
### Step 5: Scaling Execution
The planner calls `connector.set_component_replicas()` with the calculated targets. Scaling is non-blocking by default: the planner continues monitoring while replicas are adjusting.
## Connector Design
### Interface
```python
class PlannerConnector(ABC):
async def add_component(self, component_name)
async def remove_component(self, component_name)
# Extended interface (not on ABC, but implemented by both connectors):
async def set_component_replicas(self, targets, blocking)
async def validate_deployment(self, ...)
async def wait_for_deployment_ready(self)
```
### KubernetesConnector
Directly PATCHes the DGD resource to update replica counts. The operator watches for DGD changes and reconciles component deployments.
**Design decisions:**
- Uses `DYN_PARENT_DGD_K8S_NAME` to find its parent DGD (injected by operator)
- Resolves services by `subComponentType` field (prefill/decode), with fallback to legacy component names
- Validates deployment structure on startup: checks that prefill and decode services exist and model names match
### VirtualConnector
For non-native environments (e.g., custom orchestrators). Writes scaling decisions to the distributed runtime via `VirtualConnectorCoordinator` (Rust binding). External systems use `VirtualConnectorClient` to poll decisions and report completion.
**Scaling decision flow:**
1. Planner writes `(num_prefill, num_decode, decision_id)` to runtime
2. External system reads decision via `client.wait()`
3. External system executes scaling
4. External system reports completion via `client.complete(decision)`
5. Planner sees `scaled_decision_id >= decision_id` and proceeds
**Timeout**: If scaling isn't acknowledged within 1800s (configurable), the planner proceeds with new decisions anyway.
## Performance Interpolation
The planner uses pre-deployment profiling data (NPZ files) to map (throughput, ISL/OSL, context_length) -> (TTFT, ITL). This data comes from the SLA-driven profiling process (either online GPU profiling or AI Configurator estimation).
Two interpolators are maintained:
- **Prefill interpolator**: Maps (throughput_per_gpu, ISL) -> TTFT
- **Decode interpolator**: Maps (throughput_per_gpu, context_length) -> ITL
The interpolators use the profiling sweep granularity to determine precision. Finer granularity means more profiling samples but more accurate interpolation.
## Initialization
The planner starts with a 30-second delay (`INIT_PLANNER_START_DELAY`) to allow other components (frontend, workers) to register and stabilize. This is a known workaround (marked TODO in code) that should be replaced with a proper readiness check.
After the delay:
1. Initialize the connector (K8s or Virtual based on `--environment`)
2. Validate deployment structure
3. Load profiling results
4. Build interpolators
5. Initialize load predictor
6. Enter main scaling loop
## Performance Considerations
- **Adjustment interval sizing**: The interval must be long enough for scaling operations to complete. If `adjustment_interval` is shorter than the time to add/remove a worker (which includes pod scheduling, model loading, and registration), scaling decisions will overlap. Default of 180s is conservative; workloads with fast model loading can use shorter intervals.
- **Correction factor stability**: Correction factors are recalculated each interval. During traffic transitions (e.g., ramp-up), they can oscillate. The `--no-correction` flag disables correction for scenarios where cold-start artifacts dominate and distort the factor.
- **Interpolation accuracy vs profiling cost**: Higher `prefillInterpolationGranularity` and `decodeInterpolationGranularity` in the profiling sweep produce more accurate interpolation but increase profiling time linearly. Default granularity (16 prefill, 6 decode) balances accuracy with profiling duration.
- **Predictor warm-up period**: All predictors need observation history before making reliable forecasts. ARIMA and Prophet need multiple adjustment intervals of data. Kalman starts forecasting after `--kalman-min-points` observations. During warm-up, the planner uses the constant predictor as fallback.
## Known Limitations
1. **30-second startup delay**: Hardcoded wait for component registration. It should be replaced with runtime readiness probing.
2. **Adjustment interval vs scaling latency**: If `adjustment_interval` < time to scale, scaling decisions can pile up. The planner logs warnings but doesn't queue.
3. **Average-based interpolation**: The planner uses average ISL/OSL, which may not represent bimodal or heavy-tailed distributions well.
4. **Single DGD scope**: Each planner instance manages exactly one DGD. Multi-model/multi-DGD coordination is not supported.
5. **Load-based planner deprecated**: The load-based code path exists but is non-functional with current backends (no prefill queue metrics).
## Future Work
- Support aggregated (non-disaggregated) scaling mode for single-worker deployments
- Multi-DGD coordination for shared-cluster scenarios
- Distribution-aware interpolation (beyond mean ISL/OSL)
- Adaptive adjustment interval based on observed scaling latency
## File Map
| File | Size | Purpose |
| ---------------------------- | ---- | ----------------------------------------------------- |
| `planner_core.py` | 36k | Main scaling loop, algorithm implementation |
| `perf_interpolation.py` | 13k | NPZ data loading and throughput/latency interpolation |
| `load_predictor.py` | 16k | ARIMA, Prophet, Kalman, Constant predictors |
| `pre_swept_results_utils.py` | 12k | Pre-computed H100/H200 profiling data loader |
| `kubernetes_connector.py` | 11k | K8s API integration for DGD scaling |
| `kube.py` | 7.4k | Low-level K8s client wrapper |
| `exceptions.py` | 7.2k | Custom exception hierarchy |
| `prometheus.py` | 7.3k | Prometheus query builder and client |
| `defaults.py` | 8.1k | Default configs, backend name mappings |
| `planner_argparse.py` | 6.2k | CLI argument definitions |
<!--
SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
# Dynamo Request Planes User Guide
## Overview
Dynamo supports multiple transport mechanisms for its request plane (the communication layer between services). You can choose from three different request plane modes based on your deployment requirements:
- **TCP** (default): Direct TCP connection for optimal performance
- **NATS**: Message broker-based request plane
- **HTTP**: HTTP/2-based request plane
This guide explains how to configure and use request plane in your Dynamo deployment.
## What is a Request Plane?
The request plane is the transport layer that handles communication between Dynamo services (e.g., frontend to backend, worker to worker). Different request planes offer different trade-offs:
| Request Plane | Suitable For | Characteristics |
|--------------|----------|-----------------|
| **NATS** | Production deployments with KV routing | Requires NATS infrastructure, provides pub/sub patterns, highest flexibility |
| **TCP** | Low-latency direct communication | Direct connections, minimal overhead |
| **HTTP** | Standard deployments, debugging | HTTP/2 protocol, easier observability with standard tools, widely compatible |
## Request Plane vs KV Event Plane
Dynamo has **two independent communication planes**:
- **Request plane** (**`DYN_REQUEST_PLANE`**): how **RPC requests** flow between components (frontend → router → worker), via `tcp`, `http`, or `nats`.
- **KV event plane** (currently only **NATS** is supported): how **KV cache events** (and optional router replica sync) are distributed/persisted for KV-aware routing.
**Note:** If you are using `tcp` or `http` request plane with KV events enabled (default), NATS is automatically initialized. You can optionally configure `NATS_SERVER` environment variable (e.g., `NATS_SERVER=nats://nats-hostname:port`) to specify a custom NATS server; otherwise, it defaults to `localhost:4222`. To completely disable NATS, use `--no-kv-events` on the frontend.
Because they are independent, you can mix them.
For example, a deployment with TCP request plane can use different KV event planes:
- **JetStream KV events**: requests use TCP, KV routing still uses NATS JetStream + object store for persistence.
- **NATS Core KV events (local indexer)**: requests use TCP, KV events use NATS Core pub/sub and persistence lives on workers.
- **no KV events**: requests use TCP and KV routing runs without events (no NATS required, but no event-backed persistence).
## Configuration
### Environment Variable
Set the request plane mode using the `DYN_REQUEST_PLANE` environment variable:
```bash
export DYN_REQUEST_PLANE=<mode>
```
Where `<mode>` is one of:
- `tcp` (default)
- `nats`
- `http`
The value is case-insensitive.
### Default Behavior
If `DYN_REQUEST_PLANE` is not set or contains an invalid value, Dynamo defaults to `tcp`.
## Usage Examples
### Using TCP (Default)
TCP is the default request plane and provides direct, low-latency communication between services.
**Configuration:**
```bash
# TCP is the default, so no need to set DYN_REQUEST_PLANE explicitly
# But you can explicitly set it if desired:
export DYN_REQUEST_PLANE=tcp
# Optional: Configure TCP server host and port
export DYN_TCP_RPC_HOST=0.0.0.0 # Default host
# export DYN_TCP_RPC_PORT=9999 # Optional: specify a fixed port
# Run your Dynamo service
DYN_REQUEST_PLANE=tcp python -m dynamo.frontend --http-port=8000 &
DYN_REQUEST_PLANE=tcp python -m dynamo.vllm --model Qwen/Qwen3-0.6B
```
**Note:** By default, TCP uses an OS-assigned free port (port 0). This is ideal for environments where multiple services may run on the same machine or when you want to avoid port conflicts. If you need a specific port (e.g., for firewall rules), set `DYN_TCP_RPC_PORT` explicitly.
**When to use TCP:**
- Simple deployments with direct service-to-service communication (e.g. frontend to backend)
- Minimal infrastructure requirements (NATS is initialized by default for KV events but can be disabled with `--no-kv-events`)
- Low-latency requirements
**TCP Configuration Options:**
Additional TCP-specific environment variables:
- `DYN_TCP_RPC_HOST`: Server host address (default: auto-detected)
- `DYN_TCP_RPC_PORT`: Server port. If not set, the OS assigns a free port automatically (recommended for most deployments). Set explicitly only if you need a specific port for firewall rules.
- `DYN_TCP_MAX_MESSAGE_SIZE`: Maximum message size for TCP client (default: 32MB)
- `DYN_TCP_REQUEST_TIMEOUT`: Request timeout for TCP client (default: 10 seconds)
- `DYN_TCP_POOL_SIZE`: Connection pool size for TCP client (default: 50)
- `DYN_TCP_CONNECT_TIMEOUT`: Connect timeout for TCP client (default: 3 seconds)
- `DYN_TCP_CHANNEL_BUFFER`: Request channel buffer size for TCP client (default: 100)
### Using HTTP
HTTP/2 provides a standards-based request plane that's easy to debug and widely compatible.
**Configuration:**
```bash
# Optional: Configure HTTP server host and port
export DYN_HTTP_RPC_HOST=0.0.0.0 # Default host
export DYN_HTTP_RPC_PORT=8888 # Default port
export DYN_HTTP_RPC_ROOT_PATH=/v1/rpc # Default path
# Run your Dynamo service
DYN_REQUEST_PLANE=http python -m dynamo.frontend --http-port=8000 &
DYN_REQUEST_PLANE=http python -m dynamo.vllm --model Qwen/Qwen3-0.6B
```
**When to use HTTP:**
- Standard deployments requiring HTTP compatibility
- Debugging scenarios (use curl, browser tools, etc.)
- Integration with HTTP-based infrastructure
- Load balancers and proxies that work with HTTP
**HTTP Configuration Options:**
Additional HTTP-specific environment variables:
- `DYN_HTTP_RPC_HOST`: Server host address (default: auto-detected)
- `DYN_HTTP_RPC_PORT`: Server port (default: 8888)
- `DYN_HTTP_RPC_ROOT_PATH`: Root path for RPC endpoints (default: /v1/rpc)
`DYN_HTTP2_*`: Various HTTP/2 client configuration options
- `DYN_HTTP2_MAX_FRAME_SIZE`: Maximum frame size for HTTP client (default: 1MB)
- `DYN_HTTP2_MAX_CONCURRENT_STREAMS`: Maximum concurrent streams for HTTP client (default: 1000)
- `DYN_HTTP2_POOL_MAX_IDLE_PER_HOST`: Maximum idle connections per host for HTTP client (default: 100)
- `DYN_HTTP2_POOL_IDLE_TIMEOUT_SECS`: Idle timeout for HTTP client (default: 90 seconds)
- `DYN_HTTP2_KEEP_ALIVE_INTERVAL_SECS`: Keep-alive interval for HTTP client (default: 30 seconds)
- `DYN_HTTP2_KEEP_ALIVE_TIMEOUT_SECS`: Keep-alive timeout for HTTP client (default: 10 seconds)
- `DYN_HTTP2_ADAPTIVE_WINDOW`: Enable adaptive flow control (default: true)
### Using NATS
NATS provides durable jetstream messaging for request plane and can be used for KV events (and router replica sync).
**Prerequisites:**
- NATS server must be running and accessible
- Configure NATS connection via standard Dynamo NATS environment variables
```bash
# Explicitly set to NATS
export DYN_REQUEST_PLANE=nats
# Run your Dynamo service
DYN_REQUEST_PLANE=nats python -m dynamo.frontend --http-port=8000 &
DYN_REQUEST_PLANE=nats python -m dynamo.vllm --model Qwen/Qwen3-0.6B
```
**When to use NATS:**
- Production deployments with service discovery
- KV-aware routing with accurate cache state tracking (requires NATS for event transport). Note: approximate mode (`--no-kv-events`) provides KV routing without NATS but with reduced accuracy.
- Need for message replay and persistence features
Limitations:
- NATS does not support payloads beyond 16MB (use TCP for larger payloads)
## Complete Example
Here's a complete example showing how to launch a Dynamo deployment with different request planes:
See [`examples/backends/vllm/launch/agg_request_planes.sh`](../../examples/backends/vllm/launch/agg_request_planes.sh) for a complete working example that demonstrates launching Dynamo with TCP, HTTP, or NATS request planes.
## Real-World Example
The Dynamo repository includes a complete example demonstrating all three request planes:
**Location:** `examples/backends/vllm/launch/agg_request_planes.sh`
```bash
cd examples/backends/vllm/launch
# Run with TCP
./agg_request_planes.sh --tcp
# Run with HTTP
./agg_request_planes.sh --http
# Run with NATS
./agg_request_planes.sh --nats
```
## Architecture Details
### Network Manager
The request plane implementation is centralized in the Network Manager (`lib/runtime/src/pipeline/network/manager.rs`), which:
1. Reads the `DYN_REQUEST_PLANE` environment variable at startup
2. Creates the appropriate server and client implementations
3. Provides a transport-agnostic interface to the rest of the codebase
4. Manages all network configuration and lifecycle
### Transport Abstraction
All request plane implementations conform to common trait interfaces:
- `RequestPlaneServer`: Server-side interface for receiving requests
- `RequestPlaneClient`: Client-side interface for sending requests
This abstraction means your application code doesn't need to change when switching request planes.
### Configuration Loading
Request plane configuration is loaded from environment variables at startup and cached globally. The configuration hierarchy is:
1. **Mode Selection**: `DYN_REQUEST_PLANE` (defaults to `tcp`)
2. **Transport-Specific Config**: Mode-specific environment variables (e.g., `DYN_TCP_*`, `DYN_HTTP2_*`)
## Migration Guide
### From NATS to TCP
1. Stop your Dynamo services
2. Set environment variable `DYN_REQUEST_PLANE=tcp`
3. Optionally configure TCP-specific settings (e.g., `DYN_TCP_RPC_HOST`). Note: `DYN_TCP_RPC_PORT` is optional; if not set, an OS-assigned free port is used automatically.
4. Restart your services
### From NATS to HTTP
1. Stop your Dynamo services
2. Set environment variable `DYN_REQUEST_PLANE=http`
3. Optionally configure HTTP-specific settings (`DYN_HTTP_RPC_PORT`, etc.)
4. Restart your services
### Testing the Migration
After switching request planes, verify your deployment:
```bash
# Test with a simple request
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen3-0.6B",
"messages": [{"role": "user", "content": "Hello!"}]
}'
```
## Troubleshooting
### Issue: Services Can't Communicate
**Symptoms:** Requests timeout or fail to reach the backend
**Solutions:**
- Verify all services use the same `DYN_REQUEST_PLANE` setting
- Check that server ports are not blocked by k8s network policies or firewalls
- For TCP/HTTP: Ensure host/port configurations are correct and accessible
- For NATS: Verify NATS server is running and accessible
### Issue: "Invalid request plane mode" Error
**Symptoms:** Service fails to start with configuration error
**Solutions:**
- Check `DYN_REQUEST_PLANE` spelling (valid values: `nats`, `tcp`, `http`)
- Value is case-insensitive but must be one of the three options
- If not set, defaults to `tcp`
### Issue: Port Conflicts
**Symptoms:** Server fails to start due to "address already in use"
**Solutions:**
- TCP: By default, TCP uses an OS-assigned free port, so port conflicts should be rare. If you explicitly set `DYN_TCP_RPC_PORT` to a specific port and get conflicts, either change the port or remove the setting to use automatic port assignment.
- HTTP default port: 8888 (adjust environment variable `DYN_HTTP_RPC_PORT`)
## Performance Considerations
### Latency
- **TCP**: Lowest latency due to direct connections and binary serialization
- **HTTP**: Moderate latency with HTTP/2 overhead
- **NATS**: Moderate latency due to nats jet stream persistence
### Resource Usage
- **TCP**: Minimal infrastructure (NATS required only if using KV events, can disable with `--no-kv-events`)
- **HTTP**: Minimal infrastructure (NATS required only if using KV events, can disable with `--no-kv-events`)
- **NATS**: Requires running NATS server (additional memory/CPU)
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment